This week we have had the pleasure of welcoming Dr Sheila Castilho and Dr Natalia Resende for a one week research stay at the Research Group in Computational Linguistics. Sheila and Natalia both come from the ADAPT Centre, Dublin and have come to discuss collaborations with members of our research group. During their stay, both Natalia and Sheila gave the group a talk about their research. The details of which can be found below:-
Speaker: Dr Sheila Castilho
Date of talk: 19th November 2018
Title: Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation
Abstract: We reassess a recent study (Hassan et al., 2018) that claimed that machine translation (MT) has reached human parity for the translation of news from Chinese into English, using pairwise ranking and considering three variables that were not taken into account in that previous study: the language in which the source side of the test set was originally written, the translation proficiency of the evaluators, and the provision of inter-sentential context. If we consider only original source text (i.e. not translated from another language, or translationese), then we find evidence showing that human parity has not been achieved. We compare the judgments of professional translators against those of non-experts and discover that those of the experts result in higher inter-annotator agreement and better discrimination between human and machine translations. In addition, we analyse the human translations of the test set and identify important translation issues. Finally, based on these findings, we provide a set of recommendations for future human evaluations of MT.
Speaker: Dr Natalia Resende
Date of talk: 21st November 2018
Title: Classifying nouns in Portuguese into gender categories: a deep learning approach
Abstract: In Portuguese, all nouns are distributed into two gender categories: feminine and masculine. On one hand, gender can be predicted from the phonological cues present in the endings of the nouns. For example, nouns ending in -a tend to be feminine and nouns ending in -o tend to be masculine. On the other hand, the relationship between word ending and gender is far from being a consistent rule, since nouns ending in other phonemes may be of either gender. In the present study, a connectionist network was trained to classify Portuguese nouns into gender categories considering their phonological structure as whole. The performance of the network was analysed in detail to check whether the network considers only the endings of the nouns or their whole phonological structure for gender decisions. In addition, it was analysed what type of information the network takes into account to decide the gender of nouns whose endings are not predictive of gender. Results show an error-free performance when the network takes into account the phonological information present in the endings of the nouns and frequency effects for nonpredictive endings. The present study has implications to the training of NLP systems when classifying nouns into gender categories.
In August, Professor Alexander Gelbukh began a 12 month sabbatical at RGCL. As part of his visit, Prof. Gelbukh presented a research seminar to the group on ‘Opinion Mining and Sentiment Analysis’. During his time here, he has held many meetings with members of the group to discuss both future opportunities for collaboration, and discuss his research with interested people.
Last week, Antonio Pascucci a visiting Ph.D. Industrial Student in Computational Linguistics for Authorship and Gender Attribution in Italian social media texts from Universiy Of Naples – L’Orientale, gave a Researcher Seminar to the group.
Title: ‘Computational Stylometry for Authorship Attribution in social media texts’
Computational Stylometry (CS) is the study of stylistic features (linguistic choices). Writing style is a combination of decisions in language production. Thanks to a statistic analysis of these decisions, we can know author identity and many more characteristics about him/her. Writing style, in fact, is unique to an individual, and that’s why we talk about authorial DNA.
CS for the authorship attribution is the topic of my research project, and the aim is using CS for authorship attribution in social media texts. During the seminar, research project and my first steps in gender attribution will be shown, in addition to Cyberbullying detection researches, conducted thanks to a software made available by Expert System Corp.
Pablo Calleja, from the Ontology Engineering Group at the Universidad Politécnica de Madrid, Spain, is currently completing an Internship with RGCL as part of his PhD. Yesterday, Pablo gave a talk to the group about his research.
Title: Role-based Named Entity Recognition over unstructured texts
Named Entity Recognition (NER) poses new challenges in real-world documents in which there are entities with different roles according to their purpose or meaning. Retrieving all the possible entities in scenarios in which only a subset of them based on their role is needed, produces noise on the overall precision.
The talk will present a Role-based NER task that relies on role classification hierarchy models that support recognizing entities with a specific role. The proposed task has been implemented in two use cases: one in the biomedical domain using Spanish drug Summary of Product Characteristics and the other in the legal domain using multilingual and heterogeneous mails of the Panama Papers investigation.
Last week, we were visited by Dr Chun Chang and Zhao Jie from the Institute of Scientific and Technical Information of China. Dr Chang and Zhao Jie spent the day in meetings with members of RGCL discussing future collaborations but in between meetings, Dr Chang found the time to give a talk to the group. The details can be found below-
Title: The Construction and Application of Chinese Thesaurus in China
Dr. Chun Chang, Professor, The Institute of Scientific and Technical Information of China. Dr. Chang has long been engaged in the construction and application of the knowledge organization system.
Focusing on the construction and application of the Chinese Thesaurus, the lecture will discuss three main aspects below: The definition of a thesaurus and basic information of constructing thesauri in China; The history and current situation of constructing Chinese Thesaurus and the application of Computational Linguistics in the compilation process; and the current and prospective application of Chinese Thesaurus in retrieving information.
On Monday, Antoni Oliver González, from the Universitat Oberta de Catalunya (UOC) in Barcelona, arrived at RGCL for a two week stay to form research collaborations with members of the group. On Thursday, Antoni gave the following talk to the group:
Title: Automatic detection of translation equivalents of terms in large parallel and comparable corpora
Abstract: In this talk some methodologies for finding the translation equivalents of a term in big parallel and comparable corpora will be presented. For parallel corpora we are using translation tables from Statistical Machine Translation systems (Moses). For comparable corpora we are experimenting with vecmap, a tool to create cross-lingual word embedding mappings. The experiments will be carried out using the IATE database for English for two subjects: International Relations and International organizations. The goal is to enlarge the Spanish IATE database and to create this database for Catalan.
These experiments are being performed during a short research stay and we will be only able to present preliminary results.