Search Solutions is an annual event run by the Information Retrieval Specialist Group, the section of the British Computer Society which has a special interest in search engines. This year it took place on Wednesday 24th November, and was held online for invited speakers from industry to talk about their work in information retrieval. The British Computer Society has new offices at 25 Copthall Avenue in London, near the Bank of England. On the day before, a series of tuorials designed to introduce people to related topics were held, such as one given by Ingo Frommholz from our own computing department on search engine evaluation.
The tutorial on Natural Language Processing was given by myself. Unlike the others, it was an all-day event, and held face-to-face. After having had experience of online teaching during the pandemic, I know that I prefer the closer interaction with the students which comes with face-to-face teaching.
The contents of the tutorial were almost the same as the first three weeks of lectures that I give on the MA Computational Linguistics module in RIILP. I used the structure of the textbook by Jurafsky and Martin as a skeleton, but brought in other things such as the practical exercises from the Edinburgh Textbooks in Empirical Linguistics on stemming and automatic part of speech tagging. Stemming covers techniques for regarding different grammatical forms of a word as being related to each other, and part-of-speech tagging is assigning a part-of-speech category (such as noun or verb) to each word in the input sentence. I used the first edition of Jurafsky and Martin to open the discussion with a short dialogue between Dave the astronaut and HAL the computer from the film “2001 – A Space Odyssey”. What natural language techniques would HAL need to know to carry out this conversation?
At the event, I was pleased to see some old friends in the audience, including Ingo in the morning, before his own workshop began.
Welcome to the first blog of our new series of PhD Research Diaries – as we all work at home during the pandemic, the opportunity for that informal research chat has diminished, if not disappeared. Although perhaps a poor substitute, let us start the research chat with a post from one of our full time PhD students: Dr. Maria Kunilovskaya.
I am engaged in a time- and effort-consuming, but insightful and potentially useful experiment in human translation quality annotation.
(It is a pity I cannot discuss it over lunch! No technical progress in Zoom, MS teams or Skype will be able to compensate for the opportunity to ruminate the research ideas over lunch and for the motivations of the office life. Let’s see whether blogging can be helpful.)
One of the problems I faced in my attempts to build a quality estimation system for human translation is the lack of a reliable and available gold standard.
Which quality labels are most useful in a machine learning setting? But more importantly how do you get them?
1. Previous attempts
I started my experiments with implementing two of the possible solutions:
1.1 ‘Good’ vs. ‘Bad’. I used graded exam translations produced by professional translation programmes in several universities as well as data from a number of translation contests to build a quality annotated dataset, which featured binary labels (‘good’ vs. ‘bad’). We assumed that a binary classification into sharply opposed classes can be an easy task for a reasonably sophisticated learning setup. The improved dataset counted more than 400 texts of about 400 words each. Our best results on the proven translationese features reached accuracy of 67% (with the chance level as 55%). See details in Kunilovskaya, M., & Lapshinova-Koltunski, E. (2019). Translationese Features as Indicators of Quality in English-Russian Human Translation. In Proceedings of the 2nd Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT 2019) (pp. 47–56).
I also tried other approaches to text representation, ranging from tf-idf scaled character trigrams (and other bag-of-word representations) to get the accuracy of good-bad classification of 68% and bilingual vectors learnt on lemmatised corpora + Siamese architecture of bidirectional LSTM to feed the averaged vectors for the source and target texts. The latter experiment yielded the accuracy of 64%
Those efforts indicated that translation quality is either extremely difficult to learn or that I might have a problem with the labels in the dataset or the binary setup in general. The latter option sprang to mind after I verified the labels by re-annotating 20% of the data and removing a few texts that caused disagreements.
1.2 Professional vs Student. Another way to approach the lack of reliable labels is to utilise the natural classes of texts, which can be assumed to reflect their quality (aka distant supervision). A typical case of such classes is (well-published) professional and learner translations. On the same set of morphosyntactic indicators of translationese we got the accuracy of 78% (see details in Kunilovskaya, M., & Lapshinova-Koltunski, E. (2020). Lexicogrammatic Translationese across Two Targets and Competence Levels. Proceedings Of the 12th Conference on Language Resources and Evaluation (LREC 2020), 4102–4112). While this is an impressive achievement, given our trials with good-bad labels, we need to accept that professionalism is an adequate proxy for translation quality and that the student and professional collections do not differ much in terms of register. By assuming that professionalism is quality in this setting we effectively ignore the many factors associated with text production for students and professional (levels of responsibility, time constraints, levels of stress, extralinguistic motivations and situational conventions).
2. Towards a continuous sentence-level quality score
The current annotation experiment aims to produce human judgments about translation quality in the Direct Assessment (DA) setting popular today in the field of machine translation (MT). It is going to be compared with the existing score based on error annotation to try and offer a triangulation of the human quality estimates.
(1) Annotator teams. The experiments involves 12 volunteers who are final year linguistics degree BA students in a Russian university. All participants have Russian L1 and English at B2 level; they are evaluating translations from English into Russian. They are assigned to two teams. Each team includes a group of three translators, i.e. students who have attended theoretical and practical courses in translation studies and three linguists, i.e. students who major in English teaching or contrastive linguistics. This profiles are supposed to give insights into the impact of (almost complete) translation education in the task of evaluating translation quality.
(2) Data. The student translations come from the error-annotated subset of Russian Learner Translator Corpus (www.rus-ltc.org) limited to general domain mass-media texts. Error annotation is used to produce a quality score to cross-validate the annotations in the current experiment. Getting the quality score from error annotations is again a matter of subjectively assigned weights, unfortunately. We decided to go with the following procedure: we calculate a translation’s negative score for language and content errors assigning the following weights to the attributes: ‘minor’: 10, ‘major’: 20, ‘critical’: 30, ‘kudo’: -10. To get the quality score for a text we subtract this score from a 100. To get one quality score for a target text we average the two scores for fluency and accuracy based on the number and weights for language and content errors.
For annotation we offer texts that come from the extreme good-bad categories to provide sharp contrasts and clear quality distinctions in the data.
(3) Results of the calibration session. The experiments started in early October with a series of enlightening calibration sessions. Inspired by the reasoning in Graham et al (2013, 2017) and Deams et al (2013), where they suggest that fluency should be judged independently of adequacy/accuracy, we arranged for the annotators to evaluate translations in two conditions: as a text in the target language, independent of its source (fluency aspect of translation quality) and in the bilingual setting where the annotators are asked to compare the source segment and the target segment.
However, the results of the calibration session indicated that the participants struggle with separating the two aspects. In particular, even in the second round of annotation following an extended session analysing the oddities in the scores assigned in the first round, the inter-rater agreement was very low and the average annotated scores were miles away from the ones calculated from errors, annotated by a translation examiner in the same text. The discrepancies were particularly noticeable in the accuracy setting.
This made me reconsider the experiment and respect the arguments in Callison-Burch et al. (2007) and Guzmán et al. (2019) who argue for syncretic quality annotation.
The third round of calibration in the setting which does not single out traditional quality aspects (fluency, accuracy) but asks the annotator to indicate whether a Russian segment is an adequate translation of the English segment, given the context, returns reasonable inter-rater agreement and the (expected) better match with the error-based scores, which have the role of gold standard in this annotation experiment.
Interestingly, based on the limited data from the calibration sessions, linguists tent to be more critical in their estimates, there is also less agreement between the raters with no prior translation education. On the other hand, translates tend to agree more and are more forgiving towards the work of their fellows in the profession.
(4)Task. The experiment is set up on the QuestionPro platform which importantly offers a slider question and unlimited number of surveys and participants for free, which suits our needs. While the annotation takes place on a sentence level, the task page contains a complete text to provide access to cohesion and textuality issues in translations.
After the lessons learnt from several calibration sessions, we issued the first batch of 40 text pairs (922 sentence pairs) for annotation. The task is now formulated as “Read the source text. Use the slider to indicate how much you agree that the text in bold is an adequate translation of the original English sentence, given the context.”
We use a 100-point slider in accordance with the DA method of benchmarking translation quality in MT. It is supposed to alleviate the pressure to choose between 5 or 7 discreet bins in a traditional Likert scale setup. We also repeatedly urged the annotators to pay attention to the textual aspects of quality that can only be noticed in context, but have to be annotated on the sentence level (see Voita et al. 2019).
Naturally, I would be happy to get any feedback or ideas how to adjust the experiment setting to return the most reliable results.
Thanks for listening, anyway 🙂
Dr. Maria Kunilovskaya
Research Group in Computational Linguistics
University of Wolverhampton, UK
Post written by PhD student Dr. Maria Kunilovskaya
First face to face
interaction for a very long time.
Last Thursday, as the
country went into the second lockdown, we conducted our first face to face
practical session. It was very emotional (at least for me). I could not only
feel, but also see the massive amount of efforts that has been spent, so that
such a simple thing can happen. Students had to travel from their home
countries, facing risks and quarantines. Facilities have to make the classroom
covid secured, with individual tables rightly spaced. The university is doing
their best to keep itself open. All that efforts, so that us (lecturers and
students) can have those two hours of valuable face to face teaching and
learning. The session was very productive. We will do it again this week,
but will improve it further by inviting students who cannot present physically
to join in virtually. Let see how it goes. I bet that there will be problems,
but none would prove too big to overcome.
Dr Le An Ha
“When the lockdown was
announced, many of us thought that face to face teaching will be postponed for
some additional time. Luckily, the UoW is doing their best, following a lot of
covid protocols, to keep the facilities a safe place and last Thursday we had
our first practical session on campus. The session was very productive, we went
through the material we studied in class but also learnt some (very handy)
bonus lines of codes! Also, it was nice to meet in person everyone who made it
to the UK.”
“It was a lovely practical session! I was happy
to meet Dr. Ha and the fellow students. During the class Dr. Ha shared his
experience in programming and was trying to get us think a programming
language. I appreciate the effort of the EMTTI Team and the University of
Wolverhampton to provide this opportunity of offline classes for us given the
current state of affairs! Many thanks.”
My name is Jeremy Chelala, a Belgian student from the Université Catholique de Louvain in Belgium, and in the context of my Master course in NLP, I worked as a trainee at RGCL. Thanks to the Erasmus+ programme, I had the chance to work with the RGCL staff members for nine weeks in the summer of 2017. My fields of interest for this internship were automatic simplification and summarization, with a particular focus on the way we can combine techniques from both fields to improve automatic summary generation. During my time at RGCL, I implemented a sentence compressor, working together with simplification and summarization specialists as R. Evans and Dr. C. Orasan, who was my supervisor at RGCL. This compression tool represents a substantial first step in the elaboration of a larger summarization system, which I will present in my Master thesis in 2018.
During my traineeship, I could take advantage of several NLP researchers’ experience and advice to help me develop my program, not to mention technical and logistical support. I was taught to use new tools and techniques to solve specific NLP problems. Furthermore, by being part of the Group, I could participate to several seminars given by experienced researchers, learn about their latest advances and see how a research centre operates in general. I also met a lot of people from all around the world, whom I hope to see again one day.
The evaluation process of my program is still in progress, but I can already tell that my time at RGCL has been beneficial for my project, as I learned a lot from this experience. If results are promising, a paper presenting my compressor might be published.
My name is Le An Ha. I am a senior lecturer at the RGCL. I came to Wolverhampton in 2000 from Vietnam to pursue a PhD. Since finishing my PhD, I have been working at the research group until now (2015). My son was born here. So if you ask me whether the RGCL is a good place to work, or whether Wolverhampton is a good place to live, you have your answers already. Continue reading →
Alsalam Alykom, my name is Najah Albaqawi. I am a Saudi PhD student in computational linguistics. I earned my Bachelor’s Degree in English Language in 2006 from King Faisal University. Since graduating, I have worked for three years as an English assistant teacher and in 2012 I obtained my master’s degree in Applied Linguistics from the University of Al-Emam Mohammed Ibn Saud. These experiences have made me really want to work with new people in new places, which is why I chose to pursue a doctoral degree in the UK. Fortunately for me, the University of Wolverhampton is one of the best in the UK for my subject area. Continue reading →