Shiva presents her thesis to RGCL

Next week, Shiva Taslimipoor is to defend her thesis in her viva voce which will conclude her four year PhD with the Research Group in Computational Linguistics. In the run up to her viva, Shiva presented her thesis and the research she has undertaken to the group.

Title: Automatic Identification and Translation of Multiword Expressions

Abstract: Multiword Expressions (MWEs) belong to a class of phraseological phenomena that is ubiquitous in the study of language. They are heterogeneous lexical items consisting of more than one word and feature lexical, syntactic, semantic and pragmatic idiosyncrasies. Scholarly research in MWEs immensely benefit both natural language processing (NLP) applications and end users. Along with the improvement of general NLP techniques, the methodologies to deal with MWEs should be improved.

This thesis involves designing new methodologies to identify and translate MWEs. In order to deal with MWE identification, we first develop datasets of annotated verb-noun MWEs in context. We then propose a method which employs word embeddings to disambiguate between literal and idiomatic usages of expressions. Existence of expression types with various idiomatic and literal distributions leads us to re-examine their modelling and evaluation.

We propose a type-aware train and test splitting approach to prevent models from overfitting and avoid misleading evaluation results.

Identification of MWEs in context can be modelled with tagging methodologies. To this end, we devise a new neural network architecture, which is a combination of convolutional neural networks and long-short term memories with an optional conditional random field layer on top. We conduct extensive evaluations on several languages demonstrating a better performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is outstanding.

In order to find translations for verb-noun MWEs, we propose a bilingual distributional similarity approach derived from a word embedding model that supports arbitrary context. The technique is devised to extract translation equivalents from comparable corpora which are an alternative resource to costly parallel corpora. We finally conduct a series of experiments to investigate the effects of size and quality of comparable corpora on automatic extraction of translation equivalents.

Larissa Sayuri Futino Castro dos Santos ends her year at RGCL by giving a talk to discuss her research

We will all be very sorry to say goodbye to Larissa when she soon returns to Universidade Federal de Minas Gerais (UFMG) in Brazil. This week, Larissa gave the group a talk which outlined her research and the work she has been doing in Wolverhampton for the past year.

Title: Mining Short Text data using Parallel programming

Abstract: This work describes the classification of texts as being either crime-related or non crime-related. Given the spontaneity  and popularity of Twitter we collected some posts related with crime and criminology, in the state of São Paulo-SP Brazil. However, this data set is not a collection of crime reports. As the web language is characterized by diversity including flexibility, spontaneity and informality we need a classification rule to filter the documents which really are in the context. The proposed methodology works in a two step framework. In the first step we partition the text database into smaller data sets which define text collections with characteristics (not necessarily directly observable) which allow a better classification process. This enables the usage of parallel computing which decreases the time process required for the technique execution. Later on each subset of the data induces a distinct classification rule with a Supervised Machine Learning technique. For the sake of simplicity we work with KMeans and KMedoids and linear SVM. We will present our results in terms of speed and classification accuracy using various feature sets, including semantic codes.

Dr. Marcos Zampieri – invited speaker at RAAI 2018

The 2nd Conference on Recent Advances in Artificial Intelligence (RAAI) took place on June 25-26 in Buchrest, Romania. It was organized by Prof. Liviu P. Dinu and colleagues from the Faculty of Mathematics and Computer Science of the University of Bucharest.

The conference lasted two days. The first day focused on Natural Language Processing with three invited speakers: Cornelia Caragea from Kansas State University, Marius Pasca from Google, and Marcos Zampieri from the University of Wolverhampton, as well as several presenters from Romania and from abroad. The second day featured presentations on computer vision and other areas of A.I. including a panel discussion with researchers and developers from local A.I. companies such as Bitdefender.

Marcos’ presentation entitled “Automatic Language Identification: A Solved Task? ModellingDialectal Variation in Language Identification Systems” provided an overview of the main challenges in language identification with special focus on dialectal variation taking the lessons learned in the five years of the VarDial workshop into account.

Omid Rohanian: My visit to LxMLS summer school

I recently participated in the LxMLS summer school in Lisbon, Portugal. This is an annual event that focuses on theory and application of machine learning with a focus on natural language processing. The lectures followed a linear progression, starting from the fundamentals of traditional machine learning and later covered developments in deep learning. Each day in the morning, there was a lecture on some aspect of machine learning and then after the lunch students were assembled into groups to participate in the practical programming sessions. In the afternoons there was a talk on some application of machine learning in an actual research project.

In total there were more than 230 participants and the summer school lasted for 8 days. The lecturers are accomplished researchers in the field and the presentations were usually engaging and informative. I particularly enjoyed the talks given by Noah Smith, Chris Dyer, and Kyunghyun Cho. The event also included a poster presentation and a demo day where regional IT companies showcased their work and did recruitment advertising.

During the summer school I got the opportunity to get to know several PhD students working in the field from universities around the world and the networking was very valuable. The practical coding sessions could have been organised better with more supervision but overall I consider the experience as positive and worthwhile. I also found a bit of time during the day off to explore Lisbon and its surrounding areas. I enjoyed the historical delights and the amazing seafood and look forward to revisiting Portugal again soon.

Dr Michael Oakes attends the LingPhil Summer School

dav

The LingPhil summer school is an annual event primarily for the training of Ph.D. students in linguistics and philosophy in Norway, but students from other countries can come as well. This year it was held at the Solstrand Hotel and Spa, near Bergen in Norway which is a beautiful old hotel built in 1896.

The school opened on Monday June 4th with a session by Ewa Dąbrowska highlighting the “Seven Deadly Sins of Cognitive Linguistics”, which include excessive reliance on introspective evidence.

The following day Paul Kerswill from the University of York spoke on sociolinguistics – demography, social structure and identity in language change. As case studies, he talked about contact varieties of English which grew up in the Industrial Revolution, and recent developments in London English. Languages which are in contact become simplified, while languages which are isolated grow more complex. Steve Mann from Warwick University gave a training session on the research interview – how to collect data and analyse it, and the pros and cons of individual interviews and focus groups.

Wednesday was eventful, starting with a session on Corpus Pragmatics.  Our excursion was in the afternoon, a boat trip along the Bjørner Fjord to the island of Lysøen, which was once owned by Ole Bull, a world famous violinist. He had built an ornately carved wooden house there, with a main room that could be used as a small concert hall.  One of the organisers, Gisle Andersen, came with his choir to sing songs composed by Ole Bull and Edvard Grieg.

On the Thursday, I gave my sessions on “Statistics for Linguistics”, using Chris Butler’s book of the same name as the basis of the course. It was a busy day for me, as in the evening, I led a discussion group with students who are using statistics in their Ph.D. studies. Agnes Marie Bamford, who runs her own consultancy, and Claudia F. Hegrenæs from the Norwegian School of Economics, ran the career workshop. They pointed out that many transferrable skills can be gained from studying for a Ph.D., such as writing, networking, time management, analytic skills, critical thinking, problem solving, processing information quickly, endurance, grant writing skills, presentation skills, organising and coordinating, teaching experience even outside your comfort zone, and how to pitch your project. In fact, there was a special session devoted to the students all preparing “elevator pitches” to describe their work in two or three minutes.

On the final day, Åsta Haukås from the University of Bergen gave a session on multilingualism, and strategies that people, who are already at least bilingual, use to learn new languages. Many of these had been discovered using questionnaires based on SILL, the Strategy Inventory of Language Learning. As an illustration, we were given an article about Juliette Binoche in Dutch, and guessed the origins of each word in the text.

PhD studentship in Translation Technology

Closing date 20th June 2018, Skype interviews 26th June 2018

The Research Group in Computational Linguistics (http://rgcl.wlv.ac.uk) at the University of Wolverhampton invites applications for a 3-year PhD studentship in the area of translation technology. This PhD studentship is part of a larger university investment which includes other PhD students and members of staff with the aim to strengthen the existing research undertaken by members of the group in this area. This funded student bursary consist of a stipend towards living expenses (£14,500 per year) and remission of fees.

We invite applications in the area of translation technology defined in the broadest sense possible and ranging from advanced methods in machine translation to user studies which involves the use of technology in the translation process. We welcome proposals focusing on Natural Language Processing techniques for translation memory systems and translation tools in general. Given the current research interests of the group and its focus on computational approaches, we would be interested in topics including but not limited to:

– Enhancing retrieval and matching from translation memories with linguistic information – The use of deep learning (and in general, statistical) techniques in translation memories – (Machine) translation of user generated content – The use of machine translation in cross-lingual applications (with particular interest in sentiment analysis, automatic summarisation and question answering) – Phraseology and computational treatment of multi-word expressions in machine translation and translation memory systems – Quality estimation for translation professionals

Other topics will also be considered as long as they align with the interests of the group. The appointed student is expected to work on a project that has a significant computational component. For this reason we expect that the successful candidate will have good background in computer science and programming.

The application deadline is 20th June 2018 and Skype interviews with the shortlisted candidates are planned for the 26th June. The starting date of the PhD position is as soon as possible after the offer is made.

The successful applicant must have:

– A good honours degree or equivalent in Computational Linguistics, Computer Science, Translation studies or Linguistics – A strong background in Programming and Statistics/ Mathematics or in closely related areas (if relevant to the proposed topic). – Experience in Computational Linguistics / Natural Language Processing, including statistical, Machine Learning and Deep Learning, applications to Natural Language Processing. – Experience with translation technology – Experience with programming languages such as Python, Java or R is a plus – An IELTS certificate with a score of 6.5 is required from candidates whose native language is not English. If a certificate is not available at the time of application, the successful candidate must be able to obtain it within one month from the offer being made.

Candidates from both UK/EU and non-EU can apply. We encourage applications from female candidates.

Applications must include:

1. A curriculum vitae indicating degrees obtained, courses covered, publications, relevant work experience and names of two referees that could be contacted if necessary

2. A research statement which outlines the topics of interest. More information about the expected structure of the research statement can be found at https://www.wlv.ac.uk/media/departments/star-office/documents/Guidelines-for-completion-of-Research-Statement.doc

Information on RGCL:

Established by Prof Mitkov in 1998, the research group in Computational Linguistics delivers cutting-edge research in a number of NLP areas. The results from the latest Research Evaluation Framework confirm the research group in Computational Linguistics as one of the top performers in UK research with its research defined as ‘internationally leading, internationally excellent and internationally recognised’. The research group has recently completed successfully the coordination of the EXPERT project a successful EC Marie Curie Initial Training Network promoting research, development and use of data-driven technologies in machine translation and translation technology (http://expert-itn.eu).


Contact:

To find out more, please contact:

Dr Constantin Orasan (Reader in Computational Linguistics , Deputy Head of the Research Group in Computational Linguistics)

Research Group in Computational Linguistics Research Institute of Information and Language Processing University of Wolverhampton MC139 Stafford Street Wolverhampton WV1 1LY

Tel. +44 (0) 1902 321630 Email: C.Orasan at wlv.ac.uk Homepage: http://dinel.org.uk