Category Archives: Seminars 2017

RGCL Staff Research Seminar

In November, Dr Constantin Orasan gave a staff research seminar profiling his current and future research user study on Quality estimation for professional translators.  The paper was well received and there was an interesting debate and questions afterwards.

Title:  Quality estimation for professional translators: a user study

Abstract: 

Postediting of machine translation output has became an important step of the workflows employed by translation companies. The idea behind postediting is that it is possible to improve the productivity of professional translators by asking them to correct the output of machine translation systems rather than to translate from scratch. In cases in which the quality of translation is poor this is not necessary true. The field of quality estimation could prove useful to decide which sentences can be postedited and which should be translated from scratch. This talk will report the results of a user study which recorded the productivity of four professional translators when they were asked to postedit and translate sentences in different scenarios.

Our results show that quality estimation information, when accurate, improves post-editing efficiency. The analysis has also raised a number of questions which are worth being investigated.

RGCL welcomes Johanna Monti

In the middle of November, RGCL welcomed Johanna Monti, an Associate Professor of Modern Languages Teaching at the “L’Orientale”University of Naples. Her research activities are in the field of hybrid approaches to Machine Translation and NLP applications.  Whilst Johann was here, she gave two lectures on Multi-word Expressions and Gender Issues in Machine Translation.  The lectures were well received and also attended by the Research Group’s MA students.

TITLE: Parseme-It Corpus: An annotated Corpus of Verbal Multiword Expressions in Italian

ABSTRACT:  This talk outlines the development of a new language resource for Italian, namely the PARSEME-It VMWE corpus, annotated with Italian MWEs of a particular class: verbal multiword expressions (VMWE). The PARSEME-It VMWE corpus has been developed by the PARSEME-IT research group in the framework of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (Savary et al., 2017), a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for verbal multiword expressions in 18 languages, among which also the Italian language is represented. Notably, multiword expressions represent a difficult lexical construction to identify, model and treat by Natural Language Processing (NLP) tools, such as parsers, machine translation engines among others, mainly due to their non-compositional property. In particular, among multiword expressions verbal ones are particularly challenging because they have different syntactic structures (prendere una decisione ’make a decision’, decisioni prese precedentemente ’decisions made previously’), may be continuous and discontinuous (andare e venire versus andare in malora in Luigi ha fatto andare la societ`a in malora), may have a literal and figurative meaning (abboccare all’amo ’bite the hook’ or ’be deceived’). The talk will  describe the state of the art in VMWE annotation and identification for the Italian language, the methodology, the Italian VMWE categories taken into account for the annotation task, the corpus and the annotation process  and the results.

 

TITLE: Gender Issues in Machine Translation

ABSTRACT:  Machine Translation is one of most widely used Artificial Intelligence applications on the Internet: it is so widespread in online services of various types that sometimes users do not realize that they are using the results of an automatic translation process- In spite of the remarkable progress achieved in this field over the last twenty years thanks to the enhanced calculating capacity of computers and advanced technologies in the field of Natural Language Processing (NLP), machine translation systems, even the most widely used ones on the net such as for example Google Translate, have high error rates.. Among the most frequent problems in the state-of-the-art MT systems, either based on linguistic data like Systran, statistical approaches like Google Translation or the recent neural approach, translation of gender still represents a recurrent source of mistranslations: incorrect gender attribution to proforms (personal pronouns, relative pronouns, among others), reproduction of gender stereotypes and overuse of male pronouns are among the most frequent problems in MT.

RGCL welcomes Javier Pérez-Guerra

On Wednesday 7 June, RGCL welcomed Javier Pérez-Guerra from the University of Vigo in Spain. Javier is currently a Visiting Researcher at Linguistics and English Language Department, Lancaster University and we were very pleased that he could spare the time to visit and to give a talk to our Research Group. The talk was well attended and very well received!

TITLE: Coping with markedness in English syntax: on the ordering of complements and adjuncts

ABSTRACT:

This talk examines the forces that trigger two word-order designs in English: (i) object-verb sentences (*?The teacher the student hit) and (ii) adjunct-complement vs. complement-adjunct constructions (He taught yesterday Maths vs He taught Maths yesterday). The study focuses both on the diachronic tendencies observed in the data in Middle English, Early Modern and Late Modern English, and on their synchronic design in Present-Day English. The approach is corpus-based (or even corpus-driven) and the data, representing different periods and text types, are taken from a number of corpora (the Penn-Helsinki Parsed Corpus of Middle English, the Penn-Helsinki Parsed Corpus of Early Modern English, the Penn Parsed Corpus of Modern British English and the British National Corpus, among others). The aim of this talk is to look at the consequences that the placement of major constituents (eg. complements) has for the parsing of phrases in which they occur. I examine whether the data are in keeping with determinants of word order like complements-first (complement plus adjunct) and end-weight in the periods under investigation. Some statistical analyses will help determine the explanatory power of such determinants.

RGCL Staff Research Seminar

This week Dr Constantin Orasan gave a staff research seminar profiling his current and future research on the Feedback Analysis Tool.  The paper was well received and there was an interesting debate and questions afterwards.

Title:  Presentation of the Feedback Analysis Tool

Abstract: 

The Feedback Analyser is an open source intelligent tool designed to analyse feedback provided by participants in various activities. The tool relies on set of modules to analyse the sentiment in unstructured texts, identifies recurring themes that occur in them and allows easy comparison between various activities and users involved in these activities. The tool produces reports fully automatically, but the real strength of the tool comes from the fact that it allows an analyst to drill down into the data and identify information that otherwise cannot without significant effort. The idea of the tool started from a discussion with the University Outreach team who wanted to extract changes in feelings and aspirations towards Higher Education, by processing hundreds of pieces of free text student data in a matter of minutes.

This talk will provide an overview of the modules currently incorporated in the system and present the results on a small scale pilot. The possibility to develop this tool further will be discussed with the audience being invited to give suggestions.

RGCL Welcomes Lut Colman

Last week Lut Colman visited RGCL from the Instituut voor de Nederlandse Taal, Leiden (INT).

The main objective of Lut’s visit was to gain a deeper understanding of Corpus Pattern Analysis (CPA), a corpus-driven technique developed by Prof. Hanks and implemented in the Pattern Dictionary of English Verbs (PDEV), and to test the lexicographic tools used for PDEV in order to establish whether or not they are suitable for her Dutch pilot project.  Whilst Lut was here, she gave a talk on her upcoming research project.

Title: Dutch Verb Patterns Online: A Collocation and Pattern Dictionary of Dutch Verbs

Abstract:

Dutch Verb Patterns Online is a project to be developed at the Dutch Language Institute (INT) in Leiden. A pilot will consist of a collocation and pattern dictionary of a selection of verbs for advanced learners of Dutch as a second language. For that purpose, the institute will form a consortium with two partners who have expertise in developing e-learning material for language learners.

The aim of the project is a database and web application with information sections on verbs for language learners:

1) collocations: semi-fixed lexical combinations and fixed grammatical collocations that need not be defined, such as een fout {maken, begaan} (make a mistake), vertouwen op (rely on), etc.

2) idioms: expressions that have to be defined because the meaning is opaque, such as de strijdbijl begraven (bury the hatchet)

3) GDEX-examples. GDEX stands for good dictionary examples: short, representative and illustrative example sentences from a corpus

4) verb patterns: semantically motivated pieces of phraseology in which the valency slots of the verb are occupied by arguments of a particular semantic type (e.g. human, location). Semantic types are realized by lexical sets: lists of words and phrases that occur as collocates. Each pattern corresponds to a meaning. Patterns are identified by means of Corpus Pattern Analysis (CPA), a lexicographical technique used by Patrick Hanks in the Pattern Dictionary of English Verbs, PDEV (http://pdev.org.uk/ ) and based on his Theory of Norms and Exploitations (Hanks 2013).

The Dutch project wants to combine a pattern dictionary and a collocation application like SketchEngine for Language Learners (SkeLL)(Baisa & Suchomel, n.d.). The SkeLL can be developed for Dutch before we get started with the more labour-intensive pattern descriptions. Eventually, both functionalities can be merged and included as a plug-in resource in the language material for second language learners. Students will not only have access to patterns or collocation lists separately, but will be able to see which collocations fill in a semantic type in a pattern.

References

Baisa, V., & Suchomel, V. (n.d.). SkELL: Web Interface for English Language Learning.

Hanks, P. (2013). Lexical Analysis. Norms and Exploitations. MIT Press.