Jeremy Chelala reflects on his Summer Internship at RGCL

My time at RGCL — Jeremy Chelala

My name is Jeremy Chelala, a Belgian student from the Université Catholique de Louvain in Belgium, and in the context of my Master course in NLP, I worked as a trainee at RGCL. Thanks to the Erasmus+ programme, I had the chance to work with the RGCL staff members for nine weeks in the summer of 2017. My fields of interest for this internship were automatic simplification and summarization, with a particular focus on the way we can combine techniques from both fields to improve automatic summary generation. During my time at RGCL, I implemented a sentence compressor, working together with simplification and summarization specialists as R. Evans and Dr. C. Orasan, who was my supervisor at RGCL. This compression tool represents a substantial first step in the elaboration of a larger summarization system, which I will present in my Master thesis in 2018.

During my traineeship, I could take advantage of several NLP researchers’ experience and advice to help me develop my program, not to mention technical and logistical support. I was taught to use new tools and techniques to solve specific NLP problems. Furthermore, by being part of the Group, I could participate to several seminars given by experienced researchers, learn about their latest advances and see how a research centre operates in general. I also met a lot of people from all around the world, whom I hope to see again one day.

The evaluation process of my program is still in progress, but I can already tell that my time at RGCL has been beneficial for my project, as I learned a lot from this experience. If results are promising, a paper presenting my compressor might be published.

Jeremy Chelala

Journal of Natural Language Engineering: Call for special issue

The area of Natural Language Engineering, and Natural Language Processing in general, is following the trend of many other areas in becoming highly specialised, with a number of application-orientated and narrow-domain topics emerging or growing in importance. These developments, often coinciding with a lack of related literature, necessitate and warrant the publication of specialised volumes focusing on a specific topic of interest to the Natural Language Processing (NLP) research community.

The Journal of Natural Language Engineering (JNLE), which now features six 160-page issues per year and has increased its impact factor for third consecutive year, invites proposals for special issues on a competitive basis regarding any topics surrounding applied NLP which have emerged as important recent developments and that have attracted the attention of a number of researchers or research groups. In recent years, Calls for Proposals for special issues have resulted in high-quality outputs and this year we look forward to another successful competition.

Continue reading

2 PhD studentships in Translation Technology

The Research Group in Computational Linguistics invites applications for TWO 3-year PhD studentships in the area of translation technology. These two PhD studentships are part of a university investment which also includes the appointment of a reader (the equivalent of associate professor) and a research fellow with the aim to strengthen the existing research undertaken by members of the group in this area. These funded student bursaries consist of a stipend towards living expenses (£14,500 per year) and remission of fees.

Continue reading

RIILP Annual PhD Poster Presentations

This slideshow requires JavaScript.

Last week, the RGCL and SCRG PhD Students presented their research to their peers and staff members from across the University. The posters were well received.

Statistical Cybermetrics Research Group

David Foster: ‘Determining YouTube Video Popularity: Analysing YouTube User Behaviours’

Kuk Aduku: ‘ Do Patents Cite Conference Papers as Often as Journal Articles in Engineering? An Investigation of Four Fields’

 

Research Group in Computational Linguistics

Mohammad Alharbu: ‘Readability Assessment for Arabic as a Second Language’

Najah Albaqawi: ‘Gender Variation in Gulf Pidgin Arabic’

This poster is an attempt to provide a quantitative variationist analysis on variability in GPA morpho-syntax (Arabic definiteness markers, Arabic conjunction markers, object or possessive pronoun, GPA copula, and  agreement in the verb phrase and the noun phrase) which aims to discover the potential effect of the three factors: male and female gender, speakers’ first language, and number of years spent in the Gulf.

Richard Evans: ‘Sentence Rewriting for Language Processing’

This poster provided an overview of the OB1 sentence simplification system. In this approach, the functions of various textual markers of syntactic complexity (conjunctions, relative pronouns, and punctuation marks) are identified and used to inform an iterative rule-based sentence transformation process.

Ahmed Omer: ‘New Techniques For Finding Authorship in Arabic Texts’

The degree of stylistic difference between a pair of documents can then be found by any of a number of measures which compare the sets of linguistic features for each document. In general, The technique is used to first find a set of linguistic features and a difference measure which successfully discriminates between texts known to be either by author A or author B. Then texts of unknown authorship are compared against these texts to see whether their writing style is more similar to author A or author B.

Omid Rohanian: ‘ NLP Approaches to estimating Text Difficulty’

I am exploring NLP approaches in investigating text difficulty at the level of concepts.

Shiva Taslimpoor: ‘Automatic Extraction and Translation of Multiword Expressions’

We employ the state-of-the-art word embedding approaches to automatically identify and translate idiosyncratic Multiword Expressions.

RGCL welcomes Javier Pérez-Guerra

On Wednesday 7 June, RGCL welcomed Javier Pérez-Guerra from the University of Vigo in Spain. Javier is currently a Visiting Researcher at Linguistics and English Language Department, Lancaster University and we were very pleased that he could spare the time to visit and to give a talk to our Research Group. The talk was well attended and very well received!

TITLE: Coping with markedness in English syntax: on the ordering of complements and adjuncts

ABSTRACT:

This talk examines the forces that trigger two word-order designs in English: (i) object-verb sentences (*?The teacher the student hit) and (ii) adjunct-complement vs. complement-adjunct constructions (He taught yesterday Maths vs He taught Maths yesterday). The study focuses both on the diachronic tendencies observed in the data in Middle English, Early Modern and Late Modern English, and on their synchronic design in Present-Day English. The approach is corpus-based (or even corpus-driven) and the data, representing different periods and text types, are taken from a number of corpora (the Penn-Helsinki Parsed Corpus of Middle English, the Penn-Helsinki Parsed Corpus of Early Modern English, the Penn Parsed Corpus of Modern British English and the British National Corpus, among others). The aim of this talk is to look at the consequences that the placement of major constituents (eg. complements) has for the parsing of phrases in which they occur. I examine whether the data are in keeping with determinants of word order like complements-first (complement plus adjunct) and end-weight in the periods under investigation. Some statistical analyses will help determine the explanatory power of such determinants.

Short term job opportunity: Research Associate – AUTOR

This post is being offered on a casual basis until 31 July 2017

The Research Group in Computational Linguistics at the University of Wolverhampton is currently recruiting a Research Associate to conduct research on the AUTOR project which aims to help people with Autism read and understand text better (for more info on this project, please visit http://autor4autism.com/).

As a Research Associate you will use relevant NLP technologies such as lexical, syntactic, and semantic processing to design and implement applications that can help AUTOR improve its core mission by developing educational assistance for people with autism.

You should hold a Bachelor’s or Master’s degree, but ideally a PhD in Information Science, Computer Science or Natural Language Processing and experience in software development or employment in these fields. You should have experience of language technologies and resources and be willing to work as part of an extended team to research computational linguistics approaches to support the development of education-assistance tools for people with autism. Knowledge of machine learning is required.

Interview dates to be confirmed. Start of the post to be agreed with the successful candidates. This is a temporary, zero hour contract.

For informal discussion about the role please contact Dr Victoria Yaneva (v.yaneva@wlv.ac.uk).

For more information and how to apply online: click here

RGCL Staff Research Seminar

This week Dr Constantin Orasan gave a staff research seminar profiling his current and future research on the Feedback Analysis Tool.  The paper was well received and there was an interesting debate and questions afterwards.

Title:  Presentation of the Feedback Analysis Tool

Abstract: 

The Feedback Analyser is an open source intelligent tool designed to analyse feedback provided by participants in various activities. The tool relies on set of modules to analyse the sentiment in unstructured texts, identifies recurring themes that occur in them and allows easy comparison between various activities and users involved in these activities. The tool produces reports fully automatically, but the real strength of the tool comes from the fact that it allows an analyst to drill down into the data and identify information that otherwise cannot without significant effort. The idea of the tool started from a discussion with the University Outreach team who wanted to extract changes in feelings and aspirations towards Higher Education, by processing hundreds of pieces of free text student data in a matter of minutes.

This talk will provide an overview of the modules currently incorporated in the system and present the results on a small scale pilot. The possibility to develop this tool further will be discussed with the audience being invited to give suggestions.

RGCL Welcomes Lut Colman

Last week Lut Colman visited RGCL from the Instituut voor de Nederlandse Taal, Leiden (INT).

The main objective of Lut’s visit was to gain a deeper understanding of Corpus Pattern Analysis (CPA), a corpus-driven technique developed by Prof. Hanks and implemented in the Pattern Dictionary of English Verbs (PDEV), and to test the lexicographic tools used for PDEV in order to establish whether or not they are suitable for her Dutch pilot project.  Whilst Lut was here, she gave a talk on her upcoming research project.

Title: Dutch Verb Patterns Online: A Collocation and Pattern Dictionary of Dutch Verbs

Abstract:

Dutch Verb Patterns Online is a project to be developed at the Dutch Language Institute (INT) in Leiden. A pilot will consist of a collocation and pattern dictionary of a selection of verbs for advanced learners of Dutch as a second language. For that purpose, the institute will form a consortium with two partners who have expertise in developing e-learning material for language learners.

The aim of the project is a database and web application with information sections on verbs for language learners:

1) collocations: semi-fixed lexical combinations and fixed grammatical collocations that need not be defined, such as een fout {maken, begaan} (make a mistake), vertouwen op (rely on), etc.

2) idioms: expressions that have to be defined because the meaning is opaque, such as de strijdbijl begraven (bury the hatchet)

3) GDEX-examples. GDEX stands for good dictionary examples: short, representative and illustrative example sentences from a corpus

4) verb patterns: semantically motivated pieces of phraseology in which the valency slots of the verb are occupied by arguments of a particular semantic type (e.g. human, location). Semantic types are realized by lexical sets: lists of words and phrases that occur as collocates. Each pattern corresponds to a meaning. Patterns are identified by means of Corpus Pattern Analysis (CPA), a lexicographical technique used by Patrick Hanks in the Pattern Dictionary of English Verbs, PDEV (http://pdev.org.uk/ ) and based on his Theory of Norms and Exploitations (Hanks 2013).

The Dutch project wants to combine a pattern dictionary and a collocation application like SketchEngine for Language Learners (SkeLL)(Baisa & Suchomel, n.d.). The SkeLL can be developed for Dutch before we get started with the more labour-intensive pattern descriptions. Eventually, both functionalities can be merged and included as a plug-in resource in the language material for second language learners. Students will not only have access to patterns or collocation lists separately, but will be able to see which collocations fill in a semantic type in a pattern.

References

Baisa, V., & Suchomel, V. (n.d.). SkELL: Web Interface for English Language Learning.

Hanks, P. (2013). Lexical Analysis. Norms and Exploitations. MIT Press.

 

RGCL welcomes Ximena Gutierrez-Vasques

Ximena Gutierrez-Vasques is currently visiting the Research Group in Computational Linguistics from the National Autonomous University of Mexico to collaborate with members of the group. On the 25th April, Ximena presented the group with a talk about her subject area.

Title: Bilingual lexicon extraction for a low-resource language pair

Abstract:

Bilingual lexicon extraction is the task of obtaining a list of word pairs deemed to be word-level translations. This has been a NLP active area of research for several years, especially with the availability of big amounts of parallel, comparable and monolingual corpora that allow us to model the relations between the lexical units of two languages.

However, the complexity of this task increases when we deal with typologically different languages where little data is available.

We focus on the language pair Spanish-Nahuatl. These two languages are spoken in the same country (Mexico) but they are distant from each other, they belong to different linguistic families: Indo-European and Uto-Aztecan. Nahuatl is an indigenous language with around 1.5M speakers and it is a language with a scarcity of monolingual and parallel corpora.

Our work comprises the construction of the first digital publically available parallel corpus for this language pair. Moreover, we explore the combination of several language features and statistical methods to estimate the bilingual word correspondences.

Welcome to Prof. Mikel Forcada

On Wednesday 6th April, RGCL were very pleased to welcome Prof. Mikel Forcada from the University of Alicante, Spain. Mikel is currently undertaking a sabbatical in England and we were very pleased that he could spare the time to visit and to give a talk to our Research Group. The talk, about translation technologies, was well attended and very well received!

Title: Towards effort-driven combination of translation technologies in computer-aided translation

Abstract:

The talk puts forward a general framework for the measurement and estimation of professional translation effort in computer-aided translation. It then outlines the application of this framework to optimize and seamlessly combine available translation technologies (machine translation, translation memory, etc.) in a principled manner to reduce professional translation effort. Finally, it shows some results that point out at existing challenges, particularly as regards to machine translation.