Corpora and resources

  1. You are here:
  2. Start
  3. Department of Linguistics
  4. NLP
  5. Corpora and resources

Corpora and resources

LONG-MINGLE: A Longitudinal Corpus of Child-Directed Speech

 LONG-MINGLE is a longitudinal corpus of child-directed speech. The corpus consists of ortographic transcripts of audio and video recordings of naturalistic free play sessions

Stockholm—Umeå Corpus (SUC)
 Stockholm—Umeå Corpus (SUC) is a collection of Swedish texts, totalling one million words. SUC has been released in three versions: SUC 1.0 (1997), SUC 2.0 (2006) and SUC 3.0 (2012).

SUC-CORE: SUC 2.0 Annotated with NP Coreference
 SUC-CORE is a 20 000 word subset of the Stockholm-Umeå Corpus (SUC 2.0) annotated with coreference relations between noun phrases. The corpus covers a wide range of genres and domains, and is freely available for research.

Stockholm Internet Corpus (SIC)
 The SIC project aims to create a freely available corpus of Swedish Internet texts, manually annotated with Part of Speech (PoS) and Named Entity tags. So far, a small corpus (13,562 tokens) of blog texts has been created. The tagset and data format is adapted from the Stockholm–Umeå Corpus (SUC) version 3, on which the corpus is modelled.

Stockholm University Strindberg Corpus (SUSC)
 The Stockholm University Strindberg Corpus (SUSC) consists of seven novels by August Strindberg annotated for parts-of-speech with morphological analysis and lemmas. The corpus is freely available.

Swedish Blog Sentences (SBS)
 This is a collection of sentences from Swedish blog posts from November 2010 until September 2012.

Stockholm MULtilingual TReebank (SMULTRON)
 SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank that contains around 1000 sentences in English, German and Swedish.

 Word order tables (80 Kb)
Quantitative word order data for 986 languages.



Section head: Mats Wirén

Website URL:

Section for Computational Linguistics:

Stockholm University Research Database