Swedish Blog Sentences (SBS)

This is a collection of sentences from Swedish blog posts from November 2010 until December 2014.

For copyright reasons, the order of these sentences has been randomly rearranged so that the original texts can not be recreated. In many applications this is still very useful, and we publish this resource hoping that it will be of use to the field of Swedish Natural Language Processing.

The data is distributed as a xz-compressed file with two tab-separated
columns: a numeric blog identifier, and the raw text of the sentence. In total it contains 434 million sentences from 726346 different blogs, with a total of about 5.3 billion words.

License and citation

This data is licensed under the Creative Commons Attribution-ShareAlike 3.0 license, which means that you are free to use it as long as proper credit is given, and that any modifications are shared under the same conditions.

Consider citing the following paper if you use the Swedish Blog Sentences:

Östling, R. & Wirén, M. (2013). Compounding in a Swedish Blog Corpus. In: Laura Álvarez López, Charlotta Seiler Brylla & Philip Shaw (Ed.), Computer mediated discourse across languages: (pp. 45-63). Stockholm: Acta Universitatis Stockholmiensis.
Fulltext available in DiVA (opens in new window)

Download

Download (8.5 GB): Full data (xz compressed)

Contact

Robert Östling: robert@ling.su.se

Last updated: September 16, 2022
Source: Department of Linguistics

Tell a friend

CONTACT

Section head: Robert Östling
Email: robert@ling.su.se

Website URL: www.ling.su.se/nlp

Section for Computational Linguistics:
www.ling.su.se/compling
www.ling.su.se/DaLi

Sign Language