Stockholm University Strindberg Corpus (SUSC)

The Stockholm University Strindberg Corpus (SUSC) consists of seven novels by August Strindberg annotated for parts-of-speech with morphological analysis and lemmas. The corpus is freely available.

SUSC consists of approximately 400 000 tokens annotated for parts-of-speech, including morphological analysis and lemmas, using the Stockholm-Umeå Corpus tag set in PAROLE-format. The annotated texts have been converted to XML which makes the corpus searchable with corpus analysis tools such as Xaira. This allows for e.g., searching for concordances with a specific wordform, part-of-speech and/or lemma, for pattern matching, and collocation extraction.

The current version of the corpus includes seven works which can be classified as autobiographical:

  • Tjänstekvinnans son (The son of a servant, 1886-87)
  • Han och hon (He and she, 1919)
  • Inferno (Inferno, 1897)
  • Legender and Jakob brottas (Legends and Jacob wrestles, 1898)
  • Fagervik och Skamsund (Fair haven and Foulstrand, 1902)
  • Ensam (Alone, 1903)

We are aware of three other electronic collections of Strindberg’s works: Projekt Runeberg, Litteraturbanken and Språkbanken. While these are valuable resources, SUSC is an important addition because, unlike the first two, it is linguistically annotated, and unlike the third, the data is available for download and thus can be fully inspected and processed using the researcher’s software of choice. Even more importantly, researchers can add their analyses as new layers of annotation of the corpus.

Download

Presentations
Kristina Nilsson Björkenstam, Sofia Gustafson Capkovà & Mats Wirén. Stockholm University Strindberg Corpus: Contents and possibilities. In: Arvet efter Strindberg - The Strindberg Legacy. The 18th International Strindberg Conference. Stockholm Univeristy, May 31--June 3, 2012.

Contact
Kristina Nilsson Björkenstam, kristina.nilsson@ling.su.se
Sofia Gustafson Capková, sofia@ling.su.se

 

Attached files
Bookmark and share Tell a friend

CONTACT

Section head: Mats Wirén
Email: mats.wiren@ling.su.se

Website URL: www.ling.su.se/nlp

Section for Computational Linguistics:
www.ling.su.se/compling
www.ling.su.se/DaLi

Stockholm University Research Database