Corpora and resources

  1. You are here:
  2. Start
  3. Department of Linguistics
  4. NLP
  5. Corpora and resources

Corpora and resources

SWE-AOA: Subjective ratings of age of acquisition
SWE-AOA is a freely available resource for research on age-of-acquisition in Swedish, that is, the age at which the average child learns a given word.

Sign Language Iconicity
In this project we study form-meaning relationships across sign languages.

LONG-MINGLE: A Longitudinal Corpus of Child-Directed Speech
LONG-MINGLE is a longitudinal corpus of child-directed speech. The corpus consists of ortographic transcripts of audio and video recordings of naturalistic free play sessions

Stockholm—Umeå Corpus (SUC)
Stockholm—Umeå Corpus (SUC) is a collection of Swedish texts, totalling one million words. SUC has been released in three versions: SUC 1.0 (1997), SUC 2.0 (2006) and SUC 3.0 (2012).

SUC-CORE: SUC 2.0 Annotated with NP Coreference
SUC-CORE is a 20 000 word subset of the Stockholm-Umeå Corpus (SUC 2.0) annotated with coreference relations between noun phrases. The corpus covers a wide range of genres and domains, and is freely available for research.

Stockholm Internet Corpus (SIC)
The SIC project aims to create a freely available corpus of Swedish Internet texts, manually annotated with Part of Speech (PoS) and Named Entity tags. So far, a small corpus (13,562 tokens) of blog texts has been created. The tagset and data format is adapted from the Stockholm–Umeå Corpus (SUC) version 3, on which the corpus is modelled.

Stockholm University Strindberg Corpus (SUSC)
The Stockholm University Strindberg Corpus (SUSC) consists of seven novels by August Strindberg annotated for parts-of-speech with morphological analysis and lemmas. The corpus is freely available.

The Strindberg National Edition Corpus (SNEC)
The Strindberg National Edition Corpus (SNEC) contains the National Edition of August Strindberg's Collected Works, provided in a plain text version and a linguistically annotated CoNLL version.

Swedish Blog Sentences (SBS)
This is a collection of sentences from Swedish blog posts from November
2010 until December 2014.

Stockholm MULtilingual TReebank (SMULTRON)
SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank that contains around 1000 sentences in English, German and Swedish.

Word order tables (80 Kb)
Quantitative word order data for 986 languages.



Section head: Robert Östling

Website URL:

Section for Computational Linguistics: