Stockholm Internet Corpus (SIC)

The SIC project aims to create a freely available, manually annotated corpus of Swedish Internet texts. So far, a small corpus (8174 tokens) of blog texts has been created. The tagset and data format is adapted from the Stockholm-UmeƄ Corpus (SUC), on which the corpus is modelled.

SIC is primarily intended for researchers developing and testing Natural Language Processing (NLP) tools working with Internet texts. Linguists and general users interested in searching texts from Swedish blogs would probably find the Korp concordancer at Språkbanken to be more useful. 

 
CC by SA
 
The tagset and data format is adapted from the Stockholm-Umeå Corpus (SUC), on which the corpus is modelled. One important difference is that SIC uses a more permissive license
(the Creative Commons Attribution-ShareAlike 3.0 Unported), allowing researchers to modify and redistribute the corpus. 

If you are the author of a Swedish blog, you can help us expanding the corpus by licensing your blog under the same Creative Commons license (just put a note about it on your blog), and telling us about it!

Downloads:  Current version (zip archive) (108 Kb)

Contact: Robert Östling

 

Attached files
Bookmark and share Tell a friend

CONTACT

Section head: Mats Wirén
Email: mats.wiren@ling.su.se

Website URL: www.ling.su.se/nlp

Section for Computational Linguistics:
www.ling.su.se/compling
www.ling.su.se/DaLi

Stockholm University Research Database