Congratulations to Shiva Taslimipoor who successfully defended her thesis, entitled ‘Automatic Identification and Translation of Multiword Expressions’, on Tuesday. She is pictured (left-right) with Professor Dew Harrison (Chair of the viva), Dr Aline Villavicencio (External Examiner), Professor Mike Thelwall (Internal Examiner) and Professor Ruslan Mitkov (Director of Studies). We are all thrilled for Shiva and wish her the very best for her next venture!
Branislava Šandrih, from the University of Belgrade, has spent a week with us at RGCL. During this time, she has formed collaborations with many members of the group. Branislava gave a talk to the group which outlined her research.
Title: Fingerprints in SMS messages
The presentation will present a study which seeks to find answers to the following questions:
- Is it possible to tell who is the sender of the short message only by analysing a distribution of characters, and not the meaning of the content itself?
- If possible, how reliable would the judgment be?
- Are we leaving some kind of ‘fingerprints’ when we text, and can we tell anything about a person based on the way this person writes short messages?
A multilingual corpus of SMS messages was collected from a single smart phone to underpin the development of a methodology to address the above challenges. First, a binary classifier was trained to distinguish between messages composed and sent by a public service (e.g. parking service, bank reports etc.) and messages written by humans. A second classifier caters for the more challenging task of distinguishing between messages written by the owner of the smart phone and messages sent by other senders.
Branislava’s presentation outlined the experiments related to the above classifiers and reported the evaluation results.
Next week, Shiva Taslimipoor is to defend her thesis in her viva voce which will conclude her four year PhD with the Research Group in Computational Linguistics. In the run up to her viva, Shiva presented her thesis and the research she has undertaken to the group.
Title: Automatic Identification and Translation of Multiword Expressions
Abstract: Multiword Expressions (MWEs) belong to a class of phraseological phenomena that is ubiquitous in the study of language. They are heterogeneous lexical items consisting of more than one word and feature lexical, syntactic, semantic and pragmatic idiosyncrasies. Scholarly research in MWEs immensely benefit both natural language processing (NLP) applications and end users. Along with the improvement of general NLP techniques, the methodologies to deal with MWEs should be improved.
This thesis involves designing new methodologies to identify and translate MWEs. In order to deal with MWE identification, we first develop datasets of annotated verb-noun MWEs in context. We then propose a method which employs word embeddings to disambiguate between literal and idiomatic usages of expressions. Existence of expression types with various idiomatic and literal distributions leads us to re-examine their modelling and evaluation.
We propose a type-aware train and test splitting approach to prevent models from overfitting and avoid misleading evaluation results.
Identification of MWEs in context can be modelled with tagging methodologies. To this end, we devise a new neural network architecture, which is a combination of convolutional neural networks and long-short term memories with an optional conditional random field layer on top. We conduct extensive evaluations on several languages demonstrating a better performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is outstanding.
In order to find translations for verb-noun MWEs, we propose a bilingual distributional similarity approach derived from a word embedding model that supports arbitrary context. The technique is devised to extract translation equivalents from comparable corpora which are an alternative resource to costly parallel corpora. We finally conduct a series of experiments to investigate the effects of size and quality of comparable corpora on automatic extraction of translation equivalents.
We will all be very sorry to say goodbye to Larissa when she soon returns to Universidade Federal de Minas Gerais (UFMG) in Brazil. This week, Larissa gave the group a talk which outlined her research and the work she has been doing in Wolverhampton for the past year.
Title: Mining Short Text data using Parallel programming
Abstract: This work describes the classification of texts as being either crime-related or non crime-related. Given the spontaneity and popularity of Twitter we collected some posts related with crime and criminology, in the state of São Paulo-SP Brazil. However, this data set is not a collection of crime reports. As the web language is characterized by diversity including flexibility, spontaneity and informality we need a classification rule to filter the documents which really are in the context. The proposed methodology works in a two step framework. In the first step we partition the text database into smaller data sets which define text collections with characteristics (not necessarily directly observable) which allow a better classification process. This enables the usage of parallel computing which decreases the time process required for the technique execution. Later on each subset of the data induces a distinct classification rule with a Supervised Machine Learning technique. For the sake of simplicity we work with KMeans and KMedoids and linear SVM. We will present our results in terms of speed and classification accuracy using various feature sets, including semantic codes.
The 2nd Conference on Recent Advances in Artificial Intelligence (RAAI) took place on June 25-26 in Buchrest, Romania. It was organized by Prof. Liviu P. Dinu and colleagues from the Faculty of Mathematics and Computer Science of the University of Bucharest.
The conference lasted two days. The first day focused on Natural Language Processing with three invited speakers: Cornelia Caragea from Kansas State University, Marius Pasca from Google, and Marcos Zampieri from the University of Wolverhampton, as well as several presenters from Romania and from abroad. The second day featured presentations on computer vision and other areas of A.I. including a panel discussion with researchers and developers from local A.I. companies such as Bitdefender.
Marcos’ presentation entitled “Automatic Language Identification: A Solved Task? ModellingDialectal Variation in Language Identification Systems” provided an overview of the main challenges in language identification with special focus on dialectal variation taking the lessons learned in the five years of the VarDial workshop into account.
I recently participated in the LxMLS summer school in Lisbon, Portugal. This is an annual event that focuses on theory and application of machine learning with a focus on natural language processing. The lectures followed a linear progression, starting from the fundamentals of traditional machine learning and later covered developments in deep learning. Each day in the morning, there was a lecture on some aspect of machine learning and then after the lunch students were assembled into groups to participate in the practical programming sessions. In the afternoons there was a talk on some application of machine learning in an actual research project.
In total there were more than 230 participants and the summer school lasted for 8 days. The lecturers are accomplished researchers in the field and the presentations were usually engaging and informative. I particularly enjoyed the talks given by Noah Smith, Chris Dyer, and Kyunghyun Cho. The event also included a poster presentation and a demo day where regional IT companies showcased their work and did recruitment advertising.
During the summer school I got the opportunity to get to know several PhD students working in the field from universities around the world and the networking was very valuable. The practical coding sessions could have been organised better with more supervision but overall I consider the experience as positive and worthwhile. I also found a bit of time during the day off to explore Lisbon and its surrounding areas. I enjoyed the historical delights and the amazing seafood and look forward to revisiting Portugal again soon.