Swedish Blog Sentences (SBS)

This is a collection of sentences from Swedish blog posts from November 2010 until September 2012.

For copyright reasons, the order of these sentences has been randomly rearranged so that the original texts can not be recreated. In many applications this is still very useful, and we publish this resource hoping that it will be of use to the field of Swedish Natural Language Processing. You may also be interested in other sets of sentences from Språkbanken.

The texts have been automatically lemmatized, annotated for part of speech and named entities by Stagger. In total, our data contains about 2.7 billion tokens, in over 220 million posts from 660 000 different blogs.

 
CC by SA
 
The data is licensed under the Creative Commons Attribution-ShareAlike 3.0 license, which means that you are free to use it as long as proper credit is given, and that any modifications are shared under the same conditions. Consider citing the following paper if you use the Swedish Blog Sentences:
Robert Östling and Mats Wirén: Compounding in a Swedish Blog Corpus (to appear in Språkvetenskapliga föreningens årsskrift 2012, Stockholm University)

Download (20 GB): Full data (bzip2 compressed)
Contact: Robert Östling
 

Bookmark and share Tell a friend

CONTACT

Section head: Mats Wirén
Email: mats.wiren@ling.su.se

Website URL: www.ling.su.se/nlp

Section for Computational Linguistics:
www.ling.su.se/compling
www.ling.su.se/DaLi

Stockholm University Research Database