Datasets annotated with signs of syntactic complexity

In our paper:

Evans, R., & Orasan, C. (2013). Annotating signs of syntactic complexity to support sentence simplification. In I. Habernal & V. Matousek (Eds.), Text, Speech and Dialogue. Proceedings of the 16th International Conference TSD 2013. Plzen, Czech Republic: Springer. pp. 92 – 104

we present the annotation of a dataset that is used by our syntactic simplification method to identify places where rewriting rules have to be applied in order to produce simpler sentences. 

The datasets are available in XML format as three independent files, each representing a different genre

Each file contains a list of sentences annotated using the following format:

<S ID="2"><SIGN ID="2" CLASS="SSEV">That</SIGN> is 
<SIGN ID="3" CLASS="HELP">,</SIGN> a high-fibre diet, 
fluid <SIGN ID="4" CLASS="CLN">,</SIGN> etc.</S>

The sentences are marked using the S tag, whilst the signs by the tag PC. The type of sign is encoded by the attribute CLASS. The sentences were annotated in isolation, so the files above do not contain coherent texts, but sequences of sentences extracted from different files. 

To understand the difference between different classes and how the annotation process was carried out please consult the annotation guidelines. Specific questions about the annotation should be sent to Richard Evans. A demo of the sign tagger is available at http://rgcl.wlv.ac.uk/demos/SignTaggerWebDemo/  

You can find out more about our approach for syntactic simplification in our recent paper

Evans, Richard, and Constantin Orǎsan. 2018. “Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification.” Natural Language Engineering. https://doi.org/10.1017/S1351324918000384.

Sheila Castilho and Natalia Resende spend a week at RGCL

This week we have had the pleasure of welcoming Dr Sheila Castilho and Dr Natalia Resende for a one week research stay at the Research Group in Computational Linguistics. Sheila and Natalia both come from the ADAPT Centre, Dublin and have come to discuss collaborations with members of our research group. During their stay, both Natalia and Sheila gave the group a talk about their research. The details of which can be found below:- 

 

Speaker: Dr Sheila Castilho

Date of talk: 19th November 2018

Title: Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation

Abstract: We reassess a recent study (Hassan et al., 2018) that claimed that machine translation (MT) has reached human parity for the translation of news from Chinese into English, using pairwise ranking and considering three variables that were not taken into account in that previous study: the language in which the source side of the test set was originally written, the translation proficiency of the evaluators, and the provision of inter-sentential context. If we consider only original source text (i.e. not translated from another language, or translationese), then we find evidence showing that human parity has not been achieved. We compare the judgments of professional translators against those of non-experts and discover that those of the experts result in higher inter-annotator agreement and better discrimination between human and machine translations. In addition, we analyse the human translations of the test set and identify important translation issues. Finally, based on these findings, we provide a set of recommendations for future human evaluations of MT.

Speaker: Dr Natalia Resende

Date of talk: 21st November 2018

Title: Classifying nouns in Portuguese into gender categories: a deep learning approach

Abstract: In Portuguese, all nouns are distributed into two gender categories: feminine and masculine. On one hand, gender can be predicted from the phonological cues present in the endings of the nouns. For example, nouns ending in -a  tend to be feminine and nouns ending in -o   tend to be masculine. On the other hand, the relationship between word ending and gender is far from being a consistent rule, since nouns ending in other phonemes may be of either gender. In the present study, a connectionist network was trained to classify Portuguese nouns into gender categories considering their phonological structure as whole. The performance of the network was analysed in detail to check whether the network considers only the endings of the nouns or their whole phonological structure for gender decisions. In addition, it was analysed what type of information the network takes into account to decide the gender of nouns whose endings are not predictive of gender. Results show an error-free performance when the network takes into account the phonological information present in the endings of the nouns and frequency effects for nonpredictive endings. The present study has implications to the training of NLP systems when classifying nouns into gender categories.

Alexander Gelbukh gives a talk to RGCL

In August, Professor Alexander Gelbukh began a 12 month sabbatical at RGCL. As part of his visit, Prof. Gelbukh presented a research seminar to the group on ‘Opinion Mining and Sentiment Analysis’. During his time here, he has held many meetings with members of the group to discuss both future opportunities for collaboration, and discuss his research with interested people. 

Mireille Makary completes her Viva!

Congratulations to Mireille Makary for completing her Viva Voce exam on 17th October. Mireille, a part-time distance RGCL student, was defending her thesis ‘Ranking retrieval systems using minimal human assessments’.

After the Viva, Mireille celebrated with the group in the traditional RGCL way!

Research Seminar – Antonio Pascucci

Last week, Antonio Pascucci a visiting Ph.D. Industrial Student in Computational Linguistics for Authorship and Gender Attribution in Italian social media texts from Universiy Of Naples – L’Orientale, gave a Researcher Seminar to the group.

Title: ‘Computational Stylometry for Authorship Attribution in social media texts’

Abstract: 

Computational Stylometry (CS) is the study of stylistic features (linguistic choices). Writing style is a combination of decisions in language production. Thanks to a statistic analysis of these decisions, we can know author identity and many more characteristics about him/her. Writing style, in fact, is unique to an individual, and that’s why we talk about authorial DNA.

CS for the authorship attribution is the topic of my research project, and the aim is using CS for authorship attribution in social media texts. During the seminar, research project and my first steps in gender attribution will be shown, in addition to Cyberbullying detection researches, conducted thanks to a software made available by Expert System Corp.

 

Dr. Marcos Zampieri is awarded an ERAS fellowship 2018-2019

Every year, the University of Wolverhampton awards 10 research fellowships to support projects led by researchers who obtained their PhD in the last 5 years. The initiative is called ERAS – Early Research Award Scheme (ERAS) [1] and provides a budget of up to 5,000 pounds to each project. The program has existed since 2016 and applications are selective on a competitive basis.

Marcos Zampieri, a member of RGCL and RIILP, was selected to be part of the 2018-2019 cohort of ERAS fellows with a project entitled “Identifying and Categorizing Offensive Language in Social Media”. The project deals with the application of computational methods to identify offensive and aggressive language and hate speech in social media. The funding will support the annotation of a large offensive language dataset that will be used in a SemEval 2019 task [2].

For more information, please check Marcos’ recent publications on the topic [3,4,5].

[1] https://www.wlv.ac.uk/research/the-doctoral-college/early-researcher-award-scheme-eras/

[2] https://competitions.codalab.org/competitions/20011

[3] http://web.science.mq.edu.au/~smalmasi/trac1/pdf/W18-4401.pdf

[4] https://arxiv.org/pdf/1803.05495.pdf

[5] https://arxiv.org/pdf/1712.06427.pdf