On Monday, Antoni Oliver González, from the Universitat Oberta de Catalunya (UOC) in Barcelona, arrived at RGCL for a two week stay to form research collaborations with members of the group. On Thursday, Antoni gave the following talk to the group:
Title: Automatic detection of translation equivalents of terms in large parallel and comparable corpora
Abstract: In this talk some methodologies for finding the translation equivalents of a term in big parallel and comparable corpora will be presented. For parallel corpora we are using translation tables from Statistical Machine Translation systems (Moses). For comparable corpora we are experimenting with vecmap, a tool to create cross-lingual word embedding mappings. The experiments will be carried out using the IATE database for English for two subjects: International Relations and International organizations. The goal is to enlarge the Spanish IATE database and to create this database for Catalan.
These experiments are being performed during a short research stay and we will be only able to present preliminary results.
The Research Group in Computational Linguistics (RGCL) has been successful in their application for a European Masters in Technology for Translation and Interpreting (EM TTI).
EM TTI will be run by the strong consortium consisting of the University of Wolverhampton, University of Malaga (Spain), University of Ljubljana (Slovenia) and New Bulgarian University (Bulgaria) and will deliver a cohesive, integrated European-wide programme. Bringing together these four Higher Education institutions, who are leading researchers in computational aspects of language study, as well as in state-of-the-art technology for translation and interpreting, will give the students access to high-profile academics and best practices across the field. Students on the two-year degree course have the opportunity to study at multiple universities and undertake industry placements related to their dissertation.
EM TTI will produce specialists in translation and interpreting who are up-to-date with the latest applications which support their daily work. The disciplines involved are translation, interpreting, language technology, and linguistics.
This was a highly competitive application process. Prof. R Mitkov, the coordinator of the programme and Director of the Research Institute commented ‘This programme is not only the first Erasmus Mundus Master programme on Technology for Translation and Interpreting but the very first Master programme in the world on this topic. It will not only enhance the visibility of the research group and university, but will also create a very special teaching and research vibrant environment on the topics covered. ‘
The funding of 3 million Euros granted by the EC will cover 60 scholarships across the consortium. The offer of scholarships will drive competition for places and ensure candidates of the highest calibre are selected. Students will be awarded a Multiple Master’s degree from the institutions where they study.
The new programme will begin in September 2019, with applications opening in November/December 2018. For any further information, please contact Amanda Bloore, Project and Funding Officer for RIILP (A.Bloore@wlv.ac.uk).
Congratulations to Shiva Taslimipoor who successfully defended her thesis, entitled ‘Automatic Identification and Translation of Multiword Expressions’, on Tuesday. She is pictured (left-right) with Professor Dew Harrison (Chair of the viva), Dr Aline Villavicencio (External Examiner), Professor Mike Thelwall (Internal Examiner) and Professor Ruslan Mitkov (Director of Studies). We are all thrilled for Shiva and wish her the very best for her next venture!
Branislava Šandrih, from the University of Belgrade, has spent a week with us at RGCL. During this time, she has formed collaborations with many members of the group. Branislava gave a talk to the group which outlined her research.
Title: Fingerprints in SMS messages
The presentation will present a study which seeks to find answers to the following questions:
- Is it possible to tell who is the sender of the short message only by analysing a distribution of characters, and not the meaning of the content itself?
- If possible, how reliable would the judgment be?
- Are we leaving some kind of ‘fingerprints’ when we text, and can we tell anything about a person based on the way this person writes short messages?
A multilingual corpus of SMS messages was collected from a single smart phone to underpin the development of a methodology to address the above challenges. First, a binary classifier was trained to distinguish between messages composed and sent by a public service (e.g. parking service, bank reports etc.) and messages written by humans. A second classifier caters for the more challenging task of distinguishing between messages written by the owner of the smart phone and messages sent by other senders.
Branislava’s presentation outlined the experiments related to the above classifiers and reported the evaluation results.
Next week, Shiva Taslimipoor is to defend her thesis in her viva voce which will conclude her four year PhD with the Research Group in Computational Linguistics. In the run up to her viva, Shiva presented her thesis and the research she has undertaken to the group.
Title: Automatic Identification and Translation of Multiword Expressions
Abstract: Multiword Expressions (MWEs) belong to a class of phraseological phenomena that is ubiquitous in the study of language. They are heterogeneous lexical items consisting of more than one word and feature lexical, syntactic, semantic and pragmatic idiosyncrasies. Scholarly research in MWEs immensely benefit both natural language processing (NLP) applications and end users. Along with the improvement of general NLP techniques, the methodologies to deal with MWEs should be improved.
This thesis involves designing new methodologies to identify and translate MWEs. In order to deal with MWE identification, we first develop datasets of annotated verb-noun MWEs in context. We then propose a method which employs word embeddings to disambiguate between literal and idiomatic usages of expressions. Existence of expression types with various idiomatic and literal distributions leads us to re-examine their modelling and evaluation.
We propose a type-aware train and test splitting approach to prevent models from overfitting and avoid misleading evaluation results.
Identification of MWEs in context can be modelled with tagging methodologies. To this end, we devise a new neural network architecture, which is a combination of convolutional neural networks and long-short term memories with an optional conditional random field layer on top. We conduct extensive evaluations on several languages demonstrating a better performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is outstanding.
In order to find translations for verb-noun MWEs, we propose a bilingual distributional similarity approach derived from a word embedding model that supports arbitrary context. The technique is devised to extract translation equivalents from comparable corpora which are an alternative resource to costly parallel corpora. We finally conduct a series of experiments to investigate the effects of size and quality of comparable corpora on automatic extraction of translation equivalents.
We will all be very sorry to say goodbye to Larissa when she soon returns to Universidade Federal de Minas Gerais (UFMG) in Brazil. This week, Larissa gave the group a talk which outlined her research and the work she has been doing in Wolverhampton for the past year.
Title: Mining Short Text data using Parallel programming
Abstract: This work describes the classification of texts as being either crime-related or non crime-related. Given the spontaneity and popularity of Twitter we collected some posts related with crime and criminology, in the state of São Paulo-SP Brazil. However, this data set is not a collection of crime reports. As the web language is characterized by diversity including flexibility, spontaneity and informality we need a classification rule to filter the documents which really are in the context. The proposed methodology works in a two step framework. In the first step we partition the text database into smaller data sets which define text collections with characteristics (not necessarily directly observable) which allow a better classification process. This enables the usage of parallel computing which decreases the time process required for the technique execution. Later on each subset of the data induces a distinct classification rule with a Supervised Machine Learning technique. For the sake of simplicity we work with KMeans and KMedoids and linear SVM. We will present our results in terms of speed and classification accuracy using various feature sets, including semantic codes.