Next week, Shiva Taslimipoor is to defend her thesis in her viva voce which will conclude her four year PhD with the Research Group in Computational Linguistics. In the run up to her viva, Shiva presented her thesis and the research she has undertaken to the group.
Title: Automatic Identification and Translation of Multiword Expressions
Abstract: Multiword Expressions (MWEs) belong to a class of phraseological phenomena that is ubiquitous in the study of language. They are heterogeneous lexical items consisting of more than one word and feature lexical, syntactic, semantic and pragmatic idiosyncrasies. Scholarly research in MWEs immensely benefit both natural language processing (NLP) applications and end users. Along with the improvement of general NLP techniques, the methodologies to deal with MWEs should be improved.
This thesis involves designing new methodologies to identify and translate MWEs. In order to deal with MWE identification, we first develop datasets of annotated verb-noun MWEs in context. We then propose a method which employs word embeddings to disambiguate between literal and idiomatic usages of expressions. Existence of expression types with various idiomatic and literal distributions leads us to re-examine their modelling and evaluation.
We propose a type-aware train and test splitting approach to prevent models from overfitting and avoid misleading evaluation results.
Identification of MWEs in context can be modelled with tagging methodologies. To this end, we devise a new neural network architecture, which is a combination of convolutional neural networks and long-short term memories with an optional conditional random field layer on top. We conduct extensive evaluations on several languages demonstrating a better performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is outstanding.
In order to find translations for verb-noun MWEs, we propose a bilingual distributional similarity approach derived from a word embedding model that supports arbitrary context. The technique is devised to extract translation equivalents from comparable corpora which are an alternative resource to costly parallel corpora. We finally conduct a series of experiments to investigate the effects of size and quality of comparable corpora on automatic extraction of translation equivalents.