The Research Group in Computational Linguistics at the University of Wolverhampton (http://rgcl.wlv.ac.uk) is currently recruiting a Lecturer/Senior Lecturer in Translation Technology (permanent). The purpose of this post is to strengthen the research group by enhancing its research and publications in the field of translation technology. The appointed candidate will be expected to produce REF-returnable outputs, attract external income, seek industrial collaborations, teach at Masters level and supervise PhD students. He/she will join a recently appointed research fellow and two PhD students in translation technology. All these posts are part of a university investment in the area of translation technology.
For the last Staff Research Seminar on 2017, Dr Michael Oakes gave a talk on his current research. The paper was well received and there was an interesting debate and questions afterwards.
TITLE: Experiments on “The Dark Tower”, the Indus Script and the ENNTT Corpus.
In this talk I will give a brief introduction to the research I have been doing this year (and earlier).
Firstly, I will talk about the use of disputed authorship techniques, especially principal components analysis, to look at the probable authorship of the “Dark Tower”, which is generally attributed to the author of the “Narnia” series, C. S. Lewis.
Secondly I will look at the use of LNRE (Large Numbers of Random Events) models to estimate the vocabulary size of the undeciphered Indus script, which was used in Northern India and Pakistan from approximately 2600 to 1900 BC.
Thirdly, the ENNTT (Europarl Corpus of Native, Non-native and Translated Texts) corpus, developed by Rabinovitch et al., is a subset of the Europarl corpus. Using the ENNTT sub-corpus of texts translated into English, principal components analysis can be used to determine the language family (Romance or Germanic) that the texts were originally written in, and to a lesser extent, even the individual language.
In November, Dr Constantin Orasan gave a staff research seminar profiling his current and future research user study on Quality estimation for professional translators. The paper was well received and there was an interesting debate and questions afterwards.
Title: Quality estimation for professional translators: a user study
Postediting of machine translation output has became an important step of the workflows employed by translation companies. The idea behind postediting is that it is possible to improve the productivity of professional translators by asking them to correct the output of machine translation systems rather than to translate from scratch. In cases in which the quality of translation is poor this is not necessary true. The field of quality estimation could prove useful to decide which sentences can be postedited and which should be translated from scratch. This talk will report the results of a user study which recorded the productivity of four professional translators when they were asked to postedit and translate sentences in different scenarios.
Our results show that quality estimation information, when accurate, improves post-editing efficiency. The analysis has also raised a number of questions which are worth being investigated.
The Research Group in Computational Linguistics invites applications for TWO 3-year PhD studentships in the area of translation technology. These two PhD studentships are part of a larger university investment which includes other PhD students and members of staff with the aim to strengthen the existing research undertaken by members of the group in this area. These funded student bursaries consist of a stipend towards living expenses (£14,500 per year) and remission of fees.
In the middle of November, RGCL welcomed Johanna Monti, an Associate Professor of Modern Languages Teaching at the “L’Orientale”University of Naples. Her research activities are in the field of hybrid approaches to Machine Translation and NLP applications. Whilst Johann was here, she gave two lectures on Multi-word Expressions and Gender Issues in Machine Translation. The lectures were well received and also attended by the Research Group’s MA students.
TITLE: Parseme-It Corpus: An annotated Corpus of Verbal Multiword Expressions in Italian
ABSTRACT: This talk outlines the development of a new language resource for Italian, namely the PARSEME-It VMWE corpus, annotated with Italian MWEs of a particular class: verbal multiword expressions (VMWE). The PARSEME-It VMWE corpus has been developed by the PARSEME-IT research group in the framework of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (Savary et al., 2017), a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for verbal multiword expressions in 18 languages, among which also the Italian language is represented. Notably, multiword expressions represent a difficult lexical construction to identify, model and treat by Natural Language Processing (NLP) tools, such as parsers, machine translation engines among others, mainly due to their non-compositional property. In particular, among multiword expressions verbal ones are particularly challenging because they have different syntactic structures (prendere una decisione ’make a decision’, decisioni prese precedentemente ’decisions made previously’), may be continuous and discontinuous (andare e venire versus andare in malora in Luigi ha fatto andare la societ`a in malora), may have a literal and figurative meaning (abboccare all’amo ’bite the hook’ or ’be deceived’). The talk will describe the state of the art in VMWE annotation and identification for the Italian language, the methodology, the Italian VMWE categories taken into account for the annotation task, the corpus and the annotation process and the results.
TITLE: Gender Issues in Machine Translation
ABSTRACT: Machine Translation is one of most widely used Artificial Intelligence applications on the Internet: it is so widespread in online services of various types that sometimes users do not realize that they are using the results of an automatic translation process- In spite of the remarkable progress achieved in this field over the last twenty years thanks to the enhanced calculating capacity of computers and advanced technologies in the field of Natural Language Processing (NLP), machine translation systems, even the most widely used ones on the net such as for example Google Translate, have high error rates.. Among the most frequent problems in the state-of-the-art MT systems, either based on linguistic data like Systran, statistical approaches like Google Translation or the recent neural approach, translation of gender still represents a recurrent source of mistranslations: incorrect gender attribution to proforms (personal pronouns, relative pronouns, among others), reproduction of gender stereotypes and overuse of male pronouns are among the most frequent problems in MT.
Deadline: 30th Nov 2017
The Research Group in Computational Linguistics at University of Wolverhampton invites applications for a 3-year PhD studentships in the area of translation technology. This PhD studentship is part of a university investment which also includes the appointment of a senior lecturer, a research fellow and another PhD student with the aim to strengthen the existing research undertaken by members of the group in this area. This bursary consists of a stipend towards living expenses
(£14,500 per year) and remission of fees.
We invite applications in the area of translation technology defined in the broadest sense possible and ranging from advanced methods in machine translation to other ways of involving technology in the translation process. The proposals should focus on Natural Language Processing techniques for translation memory systems and translation tools in general. Given the current research interests of the group and its focus on computational approaches, we would be interested in topics including but not limited to:
- Enhancing retrieval and matching from translation memories with linguistic information
- The use of deep learning (and in general, statistical) techniques in translation memories
- (Machine) translation of user generated content
- The use of machine translation in cross-lingual applications
- Phraseology and computational treatment of multi-word expressions in machine translation and translation memory systems
- Quality estimation for translation professionals
Other topics will be also considered as long as they align with the interests of the group. The appointed student is expected to work on a project that has a significant computational component. For this reason we expect that the successful candidate will have good background in computer science and programming.
The application deadline is 30 November 2017 and the interviews will take place in the first half of December by Skype. The starting date of the PhD position is 1st Jan 2018 or any time as soon as possible after that.
A successful applicant must have:
- A good honours degree or equivalent in Computational Linguistics, Computer Science, Translation studies or Linguistics
- A strong programming and statistical / Mathematical background or closely related areas
- Experience in Computational Linguistics / Natural Language Processing, including at least some of the following Statistical Processing, Machine Learning and Deep Learning, applications to Natural Language Processing.
- Experience with translation technology
- Experience with programming languages such as Python, Java or R.
- If not native speaker a IELTS certificate with a score of 6.5. If a certificate is not available at the time of application, the successful candidate must be able to obtain it within one month from the offer being made.
Candidates from both UK/EU and non-EU can apply.
Applications must include:
- A curriculum vitae indicating degrees obtained, courses covered, publications, relevant work experience and names of two referees that could be contacted if necessary
- A research statement which outlines the topics of interest. More information about the expected structure of the research statement can be found at https://www.wlv.ac.uk/media/departments/star-office/documents/Guidelines-for-completion-of-Research-Statement.doc
Established by Prof Mitkov in 1998, the research group in Computational Linguistics delivers cutting-edge research in a number of NLP areas.
The results from the latest Research Evaluation Framework confirm the research group in Computational Linguistics as one of the top performers in UK research with its research defined as ‘internationally leading, internationally excellent and internationally recognised’. The research group has recently completed successfully the coordination of the EXPERT project a successful EC Marie Curie Initial Training Network promoting research, development and use of data-driven technologies in machine translation and translation technology (http://expert-itn.eu)
My time at RGCL — Jeremy Chelala
My name is Jeremy Chelala, a Belgian student from the Université Catholique de Louvain in Belgium, and in the context of my Master course in NLP, I worked as a trainee at RGCL. Thanks to the Erasmus+ programme, I had the chance to work with the RGCL staff members for nine weeks in the summer of 2017. My fields of interest for this internship were automatic simplification and summarization, with a particular focus on the way we can combine techniques from both fields to improve automatic summary generation. During my time at RGCL, I implemented a sentence compressor, working together with simplification and summarization specialists as R. Evans and Dr. C. Orasan, who was my supervisor at RGCL. This compression tool represents a substantial first step in the elaboration of a larger summarization system, which I will present in my Master thesis in 2018.
During my traineeship, I could take advantage of several NLP researchers’ experience and advice to help me develop my program, not to mention technical and logistical support. I was taught to use new tools and techniques to solve specific NLP problems. Furthermore, by being part of the Group, I could participate to several seminars given by experienced researchers, learn about their latest advances and see how a research centre operates in general. I also met a lot of people from all around the world, whom I hope to see again one day.
The evaluation process of my program is still in progress, but I can already tell that my time at RGCL has been beneficial for my project, as I learned a lot from this experience. If results are promising, a paper presenting my compressor might be published.
The area of Natural Language Engineering, and Natural Language Processing in general, is following the trend of many other areas in becoming highly specialised, with a number of application-orientated and narrow-domain topics emerging or growing in importance. These developments, often coinciding with a lack of related literature, necessitate and warrant the publication of specialised volumes focusing on a specific topic of interest to the Natural Language Processing (NLP) research community.
The Journal of Natural Language Engineering (JNLE), which now features six 160-page issues per year and has increased its impact factor for third consecutive year, invites proposals for special issues on a competitive basis regarding any topics surrounding applied NLP which have emerged as important recent developments and that have attracted the attention of a number of researchers or research groups. In recent years, Calls for Proposals for special issues have resulted in high-quality outputs and this year we look forward to another successful competition.
Last week, the RGCL and SCRG PhD Students presented their research to their peers and staff members from across the University. The posters were well received.
Statistical Cybermetrics Research Group
David Foster: ‘Determining YouTube Video Popularity: Analysing YouTube User Behaviours’
Kuk Aduku: ‘ Do Patents Cite Conference Papers as Often as Journal Articles in Engineering? An Investigation of Four Fields’
Research Group in Computational Linguistics
Mohammad Alharbu: ‘Readability Assessment for Arabic as a Second Language’
Najah Albaqawi: ‘Gender Variation in Gulf Pidgin Arabic’
This poster is an attempt to provide a quantitative variationist analysis on variability in GPA morpho-syntax (Arabic definiteness markers, Arabic conjunction markers, object or possessive pronoun, GPA copula, and agreement in the verb phrase and the noun phrase) which aims to discover the potential effect of the three factors: male and female gender, speakers’ first language, and number of years spent in the Gulf.
Richard Evans: ‘Sentence Rewriting for Language Processing’
This poster provided an overview of the OB1 sentence simplification system. In this approach, the functions of various textual markers of syntactic complexity (conjunctions, relative pronouns, and punctuation marks) are identified and used to inform an iterative rule-based sentence transformation process.
Ahmed Omer: ‘New Techniques For Finding Authorship in Arabic Texts’
The degree of stylistic difference between a pair of documents can then be found by any of a number of measures which compare the sets of linguistic features for each document. In general, The technique is used to first find a set of linguistic features and a difference measure which successfully discriminates between texts known to be either by author A or author B. Then texts of unknown authorship are compared against these texts to see whether their writing style is more similar to author A or author B.
Omid Rohanian: ‘ NLP Approaches to estimating Text Difficulty’
I am exploring NLP approaches in investigating text difficulty at the level of concepts.
Shiva Taslimpoor: ‘Automatic Extraction and Translation of Multiword Expressions’
We employ the state-of-the-art word embedding approaches to automatically identify and translate idiosyncratic Multiword Expressions.