For copyright reasons, the order of these sentences has been randomly rearranged so that the original texts can not be recreated. In many applications this is still very useful, and we publish this resource hoping that it will be of use to the field of Swedish Natural Language Processing. You may also be interested in other sets of sentences from Språkbanken.

The texts have been automatically lemmatized, annotated for part of speech and named entities by Stagger. In total, our data contains about 2.7 billion tokens, in over 220 million posts from 660 000 different blogs.

CC by SA

The data is licensed under the Creative Commons Attribution-ShareAlike 3.0 license, which means that you are free to use it as long as proper credit is given, and that any modifications are shared under the same conditions. Consider citing the following paper if you use the Swedish Blog Sentences:

Östling, R. & Wirén, M. (2013). Compounding in a Swedish Blog Corpus. In: Laura Álvarez López, Charlotta Seiler Brylla & Philip Shaw (Ed.), Computer mediated discourse across languages: (pp. 45-63). Stockholm: Acta Universitatis Stockholmiensis.
Fulltext available in DiVA (opens in new window)

Download (20 GB): Full data (bzip2 compressed)
Contact: Robert Östling