The course deals with corpus-based methods, that is, the large-scale study of written text, or spoken or signed utterances. Contents: Data, methods and evidence in different linguistic traditions. Quantitative properties of language, frequencies, n-grams. Data collection for different types of corpora (including traditional sample corpora, monitor corpora and web corpora) and modalities (text, speech, signing). Representation of corpora in XML. Overview of computational linguistic methods for automatic segmentation and annotation of text, including tokenisation, part-of-speech tagging and syntactic analysis. Searching corpora using regular expressions. Analysis of corpora based on occurrences and co-occurrences. Relationship between corpus material and research questions. Ethics, copyright, licenses.

Syllabus and application

Schedule and literature list

Please note that the schedule is preliminary until the course starts
Schedule Autumn 2018 | Literature list Autumn 2018 (98 Kb)


Mats Wirén,


The teaching consists of lectures and laboratory exercises.

Instruction language


Prerequisites and special admittance requirements

Bachelor's degree with major in language sciences, including a thesis in language sciences. English 6 from Swedish Upper Secondary School, or equivalent.