Category Archives: A day in the life of…

PhD Research Diaries: Dr. Maria Kunilovskaya

Welcome to the first blog of our new series of PhD Research Diaries – as we all work at home during the pandemic, the opportunity for that informal research chat has diminished, if not disappeared. Although perhaps a poor substitute, let us start the research chat with a post from one of our full time PhD students: Dr. Maria Kunilovskaya.

I am engaged in a time- and effort-consuming, but insightful and potentially useful experiment in human translation quality annotation.

(It is a pity I cannot discuss it over lunch! No technical progress in Zoom, MS teams or Skype will be able to compensate for the opportunity to ruminate the research ideas over lunch and for the motivations of the office life. Let’s see whether blogging can be helpful.)

One of the problems I faced in my attempts to build a quality estimation system for human translation is the lack of a reliable and available gold standard. 

Which quality labels are most useful in a machine learning setting? But more importantly how do you get them?  

1. Previous attempts

I started my experiments with implementing two of the possible solutions:

1.1 ‘Good’ vs. ‘Bad’. I used graded exam translations produced by professional translation programmes in several universities as well as data from a number of translation contests to build a quality annotated dataset, which featured binary labels (‘good’ vs. ‘bad’). We assumed that a binary classification into sharply opposed classes can be an easy task for a reasonably sophisticated learning setup. The improved dataset counted more than 400 texts of about 400 words each. Our best results on the proven translationese features reached accuracy of 67% (with the chance level as 55%). See details in Kunilovskaya, M., & Lapshinova-Koltunski, E. (2019). Translationese Features as Indicators of Quality in English-Russian Human Translation. In Proceedings of the 2nd Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT 2019) (pp. 47–56).

I also tried other approaches to text representation, ranging from tf-idf scaled character trigrams (and other bag-of-word representations) to get the accuracy of good-bad classification of 68% and bilingual vectors learnt on lemmatised corpora + Siamese architecture of bidirectional LSTM to feed the averaged vectors for the source and target texts. The latter experiment yielded the accuracy of 64%

Those efforts indicated that translation quality is either extremely difficult to learn or that I might have a problem with the labels in the dataset or the binary setup in general. The latter option sprang to mind after I verified the labels by re-annotating 20% of the data and removing a few texts that caused disagreements.

1.2 Professional vs Student. Another way to approach the lack of reliable labels is to utilise the natural classes of texts, which can be assumed to reflect their quality (aka distant supervision). A typical case of such classes is (well-published) professional and learner translations. On the same set of morphosyntactic indicator of translationese we got the accuracy of 78% (see details in Kunilovskaya, M., & Lapshinova-Koltunski, E. (2020). Lexicogrammatic Translationese across Two Targets and Competence Levels. Proceedings Ofthe 12th Conference on Language Resources and Evaluation (LREC 2020), 4102–4112. http://ufal.mff.cuni.cz/udpipe/models#). While this is an impressive achievement, given our trials with good-bad opposition, we need to accept that professionalism is an adequate proxy for translation quality.

2. Towards a continuous sentence-level quality score 

The current annotation experiment aims to produce human judgments about translation quality in the Direct Assessment (DA) setting popular today in the field of machine translation (MT). It is going to be compared with the existing score based on error annotation to try and offer a triangulation of the human quality estimates.

(1) Annotator teams. The experiments involves 12 volunteers who are final year linguistics degree BA students in a Russian university. All participants have Russian L1 and English at B2 level; they are evaluating translations from English into Russian. They are assigned to two teams. Each team includes a group of three translators, i.e. students who have attended theoretical and practical courses in translation studies and three linguists, i.e. students who majored in English teaching or contrastive linguistics. This profiles are supposed to give insights into the impact of (almost complete) translation education in the task of evaluating translation quality.

(2) Data. We use error-annotated subset of Russian Learner Translator Corpus (www.rus-ltc.org/search) limited to general domain mass-media texts. Error annotation is used to produce a quality score to be used to cross-validate the annotations in the current experiment. Getting the quality score from error annotations is again a matter of subjectively assigned weights, unfortunately. We decided to go with the following procedure: we calculate a translation’s negative score for language and content errors assigning the following weights to the attributes: ‘minor’: 10, ‘major’: 20, ‘critical’: 30, ‘kudo’: -10. To get the quality score for a text we subtract this score from a 100. To get one quality score for a target text we average the two scores for fluency and accuracy based on the number and weights for language and content errors. 

For annotation we offer texts that come from the extreme good-bad categories to provide sharp contrasts and clear quality distinctions in the data. 

(3) Results of the calibration session. The experiments started in early October with a series of enlightening calibration sessions. Inspired by the reasoning in Graham et al (2013, 2017) and Deams et al (2013), where they suggest that fluency should be judged independently of adequacy/accuracy, we arranged for the annotators to evaluate translations in two conditions: as a text in the target language, independent of its source (fluency aspect of translation quality, ‘) and in the bilingual setting where the annotators are asked to compare the source segment and the target segment.

However, the results of the calibration session indicated that the participants struggle with separating the two aspects. In particular, even in the second round of annotation following an extended session analysing the oddities in the scores assigned in the first round, the inter-rater agreement was very low and the average annotated scores were miles away from the ones calculated from errors annotated by a translation examiner in the same text. The discrepancies were particularly noticeable in the accuracy setting. 

This made me reconsider the experiment and respect the arguments in Callison-Burch et al. (2007) and Guzmán et al. (2019) who argue for syncretic quality annotation.

The third round of calibration in the setting which does not single out traditional quality aspects (fluency, accuracy) but asks the annotator to “Use the slider to indicate how much you agree that the text in bold is an adequate translation of the original English sentence, given the context.”, returns reasonable inter-rater agreement and the (expected) better match with the error-based scores, which have the role of gold standard in this annotation experiment.

Interestingly, based on the limited data from the calibration sessions, linguists tent to be more critical in their estimates, there is also less agreement between the raters with no prior translation education. On the other hand, translates tend to agree more and are more forgiving towards the work of their fellows in the profession. 

(4) Task. The experiment is set up on the QuestionPro platform which importantly offers a slider question setup and unlimited number of surveys and participants for free, which suits our needs. While the annotation takes place on a sentence level, the task page contains a complete text and the annotators are specifically asked to pay attention to cohesion and textuality issues in translations. 

After the lessons learnt from several calibration sessions, we issued the first batch of 20 text pairs (461 sentence pairs) for annotation. The task is now formulated as “Read the source text. Use the slider to indicate how much you agree that the text in bold is an adequate translation of the original English sentence, given the context.”

We use a 100-point slider in accordance with the DA method of benchmarking translation quality in MT. It is supposed to alleviate the pressure to choose between 5 or 7 discreet bins in a traditional Likert-scale setup. We also repeatedly urged the annotators to pay attention to the textual aspects of quality that can only be noticed in context, but have to be annotated on the sentence level (see Voita et al. 2019).

Naturally, I would be happy to get any feedback or ideas how to adjust the experiment setting to return the most reliable results.

Thanks for listening, anyway 🙂

Post written by PhD student Dr. Maria Kunilovskaya

Teaching and Learning in a Pandemic

First face to face interaction for a very long time.

Last Thursday, as the country went into the second lockdown, we conducted our first face to face practical session. It was very emotional (at least for me). I could not only feel, but also see the massive amount of efforts that has been spent, so that such a simple thing can happen. Students had to travel from their home countries, facing risks and quarantines. Facilities have to make the classroom covid secured, with individual tables rightly spaced. The university is doing their best to keep itself open. All that efforts, so that us (lecturers and students) can have those two hours of valuable face to face teaching and learning. The session was very productive. We will do it again this week, but will improve it further by inviting students who cannot present physically to join in virtually. Let see how it goes. I bet that there will be problems, but none would prove too big to overcome.

Dr Le An Ha

“When the lockdown was announced, many of us thought that face to face teaching will be postponed for some additional time. Luckily, the UoW is doing their best, following a lot of covid protocols, to keep the facilities a safe place and last Thursday we had our first practical session on campus. The session was very productive, we went through the material we studied in class but also learnt some (very handy) bonus lines of codes! Also, it was nice to meet in person everyone who made it to the UK.”

Nikola Spasovski

“It was a lovely practical session! I was happy to meet Dr. Ha and the fellow students. During the class Dr. Ha shared his experience in programming and was trying to get us think a programming language. I appreciate the effort of the EMTTI Team and the University of Wolverhampton to provide this opportunity of offline classes for us given the current state of affairs! Many thanks.”

 Katerina Poltorak

Jeremy Chelala reflects on his Summer Internship at RGCL

My time at RGCL — Jeremy Chelala

My name is Jeremy Chelala, a Belgian student from the Université Catholique de Louvain in Belgium, and in the context of my Master course in NLP, I worked as a trainee at RGCL. Thanks to the Erasmus+ programme, I had the chance to work with the RGCL staff members for nine weeks in the summer of 2017. My fields of interest for this internship were automatic simplification and summarization, with a particular focus on the way we can combine techniques from both fields to improve automatic summary generation. During my time at RGCL, I implemented a sentence compressor, working together with simplification and summarization specialists as R. Evans and Dr. C. Orasan, who was my supervisor at RGCL. This compression tool represents a substantial first step in the elaboration of a larger summarization system, which I will present in my Master thesis in 2018.

During my traineeship, I could take advantage of several NLP researchers’ experience and advice to help me develop my program, not to mention technical and logistical support. I was taught to use new tools and techniques to solve specific NLP problems. Furthermore, by being part of the Group, I could participate to several seminars given by experienced researchers, learn about their latest advances and see how a research centre operates in general. I also met a lot of people from all around the world, whom I hope to see again one day.

The evaluation process of my program is still in progress, but I can already tell that my time at RGCL has been beneficial for my project, as I learned a lot from this experience. If results are promising, a paper presenting my compressor might be published.

Jeremy Chelala

A day in the life of…Le An Ha

Le An HaMy name is Le An Ha. I am a senior lecturer at the RGCL. I came to Wolverhampton in 2000 from Vietnam to pursue a PhD. Since finishing my PhD, I have been working at the research group until now (2015). My son was born here. So if you ask me whether the RGCL is a good place to work, or whether Wolverhampton is a good place to live, you have your answers already. Continue reading

A day in the life of…Najah Albaqawi

DeskAlsalam Alykom, my name is Najah Albaqawi. I am a Saudi PhD student in computational linguistics. I earned my Bachelor’s Degree in English Language in 2006 from King Faisal University. Since graduating, I have worked for three years as an English assistant teacher and in 2012 I obtained my master’s degree in Applied Linguistics from the University of Al-Emam Mohammed Ibn Saud. These experiences have made me really want to work with new people in new places, which is why I chose to pursue a doctoral degree in the UK. Fortunately for me, the University of Wolverhampton is one of the best in the UK for my subject area. Continue reading

A day in the life of…Richard Evans

Richard Evans

Hello, my name is Richard Evans. I’m employed as a research fellow and am currently undertaking a part-time PhD at the University of Wolverhampton. I first joined the research group in 1998 having obtained BA (Hons) Linguistics from the University of Wales (Bangor) and an MSc in Cognitive Science and Natural Language at the University of Edinburgh. I love the challenges of computational linguistics and natural language processing, and the creativity that those challenges inspire. Continue reading