Corpora and resources
- Stockholm—Umeå Corpus (SUC) Stockholm—Umeå Corpus (SUC) is a collection of Swedish texts, totalling one million words. SUC has been released in three versions: SUC 1.0 (1997), SUC 2.0 (2006) and SUC 3.0 (2012).
- SUC-CORE: SUC 2.0 Annotated with NP Coreference SUC-CORE is a 20 000 word subset of the Stockholm-Umeå Corpus (SUC 2.0) annotated with coreference relations between noun phrases. The corpus covers a wide range of genres and domains, and is freely available for research.
- Stockholm Internet Corpus (SIC) The SIC project aims to create a freely available, manually annotated corpus of Swedish Internet texts. So far, a small corpus (8174 tokens) of blog texts has been created. The tagset and data format is adapted from the Stockholm-Umeå Corpus (SUC), on which the corpus is modelled.
-
Stockholm University Strindberg Corpus (SUSC)
The Stockholm University Strindberg Corpus (SUSC) consists of seven novels by August Strindberg annotated for parts-of-speech with morphological analysis and lemmas. The corpus is freely available.
- Swedish Blog Sentences (SBS) This is a collection of sentences from Swedish blog posts from November 2010 until September 2012.
- Stockholm MULtilingual TReebank (SMULTRON) SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank that contains around 1000 sentences in English, German and Swedish.
