Stockholm Internet Corpus (SIC)

The SIC project aims to create a freely available corpus of Swedish Internet texts, manually annotated with Part of Speech (PoS) and Named Entity tags. So far, a small corpus (13,562 tokens) of blog texts has been created. The tagset and data format is adapted from the Stockholm–Umeå Corpus (SUC) version 3, on which the corpus is modelled.

SIC is primarily intended for researchers developing and testing Natural Language Processing (NLP) tools working with Internet texts. Linguists and general users interested in searching texts from Swedish blogs would probably find the Korp concordancer at Språkbanken to be more useful.

The tagset and data format is adapted from the Stockholm–Umeå Corpus (SUC), on which the corpus is modelled. One important difference is that SIC uses a more permissive license
(the Creative Commons Attribution-ShareAlike 3.0 Unported), allowing researchers to modify and redistribute the corpus. The annotation was done by Robert Östling, Johan Sjons and Johannes Bjerva, by manually correcting the output of Stagger.

If you are the author of a Swedish blog, you can help us expanding the corpus by licensing your blog under the same Creative Commons license (just put a note about it on your blog), and telling us about it!

Downloads: Download SIC (zip) (173 Kb)

Contact: Robert Östling

Last updated: February 11, 2016
Source: Department of Linguistics

Tell a friend

CONTACT

Section head: Robert Östling
Email: robert@ling.su.se

Website URL: www.ling.su.se/nlp

Section for Computational Linguistics:
www.ling.su.se/compling
www.ling.su.se/DaLi

Sign Language