1. You are here:
  2. Start
  3. Department of Linguistics
  4. NLP
  5. Corpora and resources

Corpora and resources

  • Stockholm—Umeå Corpus (SUC) Stockholm—Umeå Corpus (SUC) is a collection of Swedish texts, totalling one million words. SUC has been released in three versions: SUC 1.0 (1997), SUC 2.0 (2006) and SUC 3.0 (2012).
  • SUC-CORE: SUC 2.0 Annotated with NP Coreference SUC-CORE is a 20 000 word subset of the Stockholm-Umeå Corpus (SUC 2.0) annotated with coreference relations between noun phrases. The corpus covers a wide range of genres and domains, and is freely available for research.
  • Stockholm Internet Corpus (SIC) The SIC project aims to create a freely available, manually annotated corpus of Swedish Internet texts. So far, a small corpus (8174 tokens) of blog texts has been created. The tagset and data format is adapted from the Stockholm-Umeå Corpus (SUC), on which the corpus is modelled.
  • August Strindberg (1899) painted by Carl Larsson Stockholm University Strindberg Corpus (SUSC) The Stockholm University Strindberg Corpus (SUSC) consists of seven novels by August Strindberg annotated for parts-of-speech with morphological analysis and lemmas. The corpus is freely available.
  • Swedish Blog Sentences (SBS) This is a collection of sentences from Swedish blog posts from November 2010 until September 2012.
  • Stockholm MULtilingual TReebank (SMULTRON) SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank that contains around 1000 sentences in English, German and Swedish.

CONTACT

Section head: Mats Wirén
Email: mats.wiren@ling.su.se

Website URL: www.ling.su.se/nlp

Section for Computational Linguistics:
www.ling.su.se/compling
www.ling.su.se/DaLi

Stockholm University Research Database