Research in Computational Linguistics
Our current research concerns computational models of first-language acquisition, automatic identification of linguistic constructions, and medical text mining (details below)
- Computational models of first-language acquisition. An assumption of this work is that a cognitive model of language learning needs to be dialogue-driven and multimodal, taking both linguistic and non-linguistic aspects of interaction into account. To this end, we are developing a longitudinal corpus of video and audio recordings of parent–child interaction with verbal and non-verbal annotation (speech transcription, eye gaze, object-related actions, gestures and discourse information). Examples of phenomena that we study are synchrony (patterns converging across modalities, such as a parent holding an object in the infant's visual field while referring to it verbally), variation sets (partial repetitions of successive utterances in child-directed speech), and disfluency as a device that facilitates structuring of the input for the child. The goal is to understand how an inventory of constructions (see (2) below) is learnable on the basis of linguistic and non-linguistic input, and how this can be modelled using unsupervised machine learning. This research is carried out within the MINGLE project, "Modelling the emergence of linguistic structures in early childhood", funded by the Swedish Research Council. This is a collaboration with the Section for Phonetics.
- Automatic identification of linguistic constructions. This work studies phenomena like multiword expressions, collocations, compounding and morphology through the lens of constructions, that is, conventionalised pairings of meaning and form at different levels of abstraction. Potential uses of this exist in first-language acquisition (systematising the constructions that children learn on a longitudinal scale), second-language acquisition (providing help with idiomatic expressions to language learners) and linguistic typology (as an alternative to traditional reference grammars, etc.). The techniques rely on machine learning and include hybrid n-grams (simultaneously covering word forms, lemmas and parts of speech), distributional semantics, and massively parallel corpora (translation equivalents on the order of several hundreds). This is carried out in Robert Östling's thesis work and ties in to both (1) and (3). It involves collaboration with the Section for General Linguistics.
- Medical text mining. One aim of this recently started project is to develop techniques to augment professional medical text with paraphrases in language that a layman can understand. The main domain studied is medical records, motivated by an on-going activity for making these records available on-line for patients in Sweden. One problem with this domain is that standard measures of readability are not applicable because of extremely telegraphic style, excessive use of abbreviations, misspellings, etc. Thus, new measures need to be developed in order to arrive at objective criteria for how to make the records accessible to the general public. The heart of this work is carried out in Gintaré Grigonyté's postdoc project, and in part draws on techniques from (2). It is a collaboration with the Clinical Text Mining Group at the Department of Computer and Systems Sciences (DSV).
September 13, 2013
Source: Department of Linguistics