For copyright reasons, the order of these sentences has been randomly rearranged so that the original texts can not be recreated. In many applications this is still very useful, and we publish this resource hoping that it will be of use to the field of Swedish Natural Language Processing.

The data is distributed as a xz-compressed file with two tab-separated
columns: a numeric blog identifier, and the raw text of the sentence. In total it contains 434 million sentences from 726346 different blogs, with a total of about 5.3 billion words.

License and citation

 
CC by SA
 
 

This data is licensed under the Creative Commons Attribution-ShareAlike 3.0 license, which means that you are free to use it as long as proper credit is given, and that any modifications are shared under the same conditions.

Consider citing the following paper if you use the Swedish Blog Sentences:

Östling, R. & Wirén, M. (2013). Compounding in a Swedish Blog Corpus. In: Laura Álvarez López, Charlotta Seiler Brylla & Philip Shaw (Ed.), Computer mediated discourse across languages: (pp. 45-63). Stockholm: Acta Universitatis Stockholmiensis.
Fulltext available in DiVA (opens in new window)

Download

Download (8.5 GB): Full data (xz compressed)

Contact

Robert Östling: robert@ling.su.se