You are here:
Start
Department of Linguistics
NLP
Corpora and resources

Corpora and resources

— SWE-AOA: Subjective ratings of age of acquisition
SWE-AOA is a freely available resource for research on age-of-acquisition in Swedish, that is, the age at which the average child learns a given word.

— Sign Language Iconicity
In this project we study form-meaning relationships across sign languages.

— LONG-MINGLE: A Longitudinal Corpus of Child-Directed Speech
LONG-MINGLE is a longitudinal corpus of child-directed speech. The corpus consists of ortographic transcripts of audio and video recordings of naturalistic free play sessions

— Stockholm—Umeå Corpus (SUC)
Stockholm—Umeå Corpus (SUC) is a collection of Swedish texts, totalling one million words. SUC has been released in three versions: SUC 1.0 (1997), SUC 2.0 (2006) and SUC 3.0 (2012).

— SUC-CORE: SUC 2.0 Annotated with NP Coreference
SUC-CORE is a 20 000 word subset of the Stockholm-Umeå Corpus (SUC 2.0) annotated with coreference relations between noun phrases. The corpus covers a wide range of genres and domains, and is freely available for research.

— Stockholm Internet Corpus (SIC)
The SIC project aims to create a freely available corpus of Swedish Internet texts, manually annotated with Part of Speech (PoS) and Named Entity tags. So far, a small corpus (13,562 tokens) of blog texts has been created. The tagset and data format is adapted from the Stockholm–Umeå Corpus (SUC) version 3, on which the corpus is modelled.

— Stockholm University Strindberg Corpus (SUSC)
The Stockholm University Strindberg Corpus (SUSC) consists of seven novels by August Strindberg annotated for parts-of-speech with morphological analysis and lemmas. The corpus is freely available.

— The Strindberg National Edition Corpus (SNEC)
The Strindberg National Edition Corpus (SNEC) contains the National Edition of August Strindberg's Collected Works, provided in a plain text version and a linguistically annotated CoNLL version.

— Swedish Blog Sentences (SBS)
This is a collection of sentences from Swedish blog posts from November
2010 until December 2014.

— Stockholm MULtilingual TReebank (SMULTRON)
SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank that contains around 1000 sentences in English, German and Swedish.

— Word order tables (80 Kb)
Quantitative word order data for 986 languages.

CONTACT

Section head: Robert Östling
Email: robert@ling.su.se

Website URL: www.ling.su.se/nlp

Section for Computational Linguistics:
www.ling.su.se/compling
www.ling.su.se/DaLi

Sign Language