Corpora, resources and tools

Welcome to our site containing a collection of corpora, resources and tools from the Section for Computational Linguistics.

The Section for Computational Linguistics makes parts of our Natural Language Processing software, resources and corpora available to the public.

The links to the left give you more information about our various corpora, including Swedish Blog Sentences (2.7 billion tokens), the Stockholm Umeå Corpus (1 million words), SUC-CORE (a 20 000 word subset of SUC with NP coreference annotation), and the Stockholm University Strindberg Corpus (400 000 tokens).

The tools we distribute include Stockholm Language Model with Entropy (SLME), Swedish Python Routines (SPyRo) including compound analysis for Swedish, and the Stockholm Tagger (Stagger), a part-of-speech tagger and NE recognizer for Swedish.

Read more on our research and some of the projects that we currently work on.