In our paper:
Evans, R., & Orasan, C. (2013). Annotating signs of syntactic complexity to support sentence simplification. In I. Habernal & V. Matousek (Eds.), Text, Speech and Dialogue. Proceedings of the 16th International Conference TSD 2013. Plzen, Czech Republic: Springer. pp. 92 – 104
we present the annotation of a dataset that is used by our syntactic simplification method to identify places where rewriting rules have to be applied in order to produce simpler sentences.
The datasets are available in XML format as three independent files, each representing a different genre
Each file contains a list of sentences annotated using the following format:
<S ID="2"><SIGN ID="2" CLASS="SSEV">That</SIGN> is
<SIGN ID="3" CLASS="HELP">,</SIGN> a high-fibre diet,
fluid <SIGN ID="4" CLASS="CLN">,</SIGN> etc.</S>
The sentences are marked using the S tag, whilst the signs by the tag PC. The type of sign is encoded by the attribute CLASS. The sentences were annotated in isolation, so the files above do not contain coherent texts, but sequences of sentences extracted from different files.
To understand the difference between different classes and how the annotation process was carried out please consult the annotation guidelines. Specific questions about the annotation should be sent to Richard Evans. A demo of the sign tagger is available at http://rgcl.wlv.ac.uk/demos/SignTaggerWebDemo/
You can find out more about our approach for syntactic simplification in our recent paper
Evans, Richard, and Constantin Orǎsan. 2018. “Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification.” Natural Language Engineering. https://doi.org/10.1017/S1351324918000384.
Congratulations to Mireille Makary for completing her Viva Voce exam on 17th October. Mireille, a part-time distance RGCL student, was defending her thesis ‘Ranking retrieval systems using minimal human assessments’.
After the Viva, Mireille celebrated with the group in the traditional RGCL way!
Every year, the University of Wolverhampton awards 10 research fellowships to support projects led by researchers who obtained their PhD in the last 5 years. The initiative is called ERAS – Early Research Award Scheme (ERAS)  and provides a budget of up to 5,000 pounds to each project. The program has existed since 2016 and applications are selective on a competitive basis.
Marcos Zampieri, a member of RGCL and RIILP, was selected to be part of the 2018-2019 cohort of ERAS fellows with a project entitled “Identifying and Categorizing Offensive Language in Social Media”. The project deals with the application of computational methods to identify offensive and aggressive language and hate speech in social media. The funding will support the annotation of a large offensive language dataset that will be used in a SemEval 2019 task .
For more information, please check Marcos’ recent publications on the topic [3,4,5].
The Research Group in Computational Linguistics (RGCL) has been successful in their application for a European Masters in Technology for Translation and Interpreting (EM TTI).
EM TTI will be run by the strong consortium consisting of the University of Wolverhampton, University of Malaga (Spain), University of Ljubljana (Slovenia) and New Bulgarian University (Bulgaria) and will deliver a cohesive, integrated European-wide programme. Bringing together these four Higher Education institutions, who are leading researchers in computational aspects of language study, as well as in state-of-the-art technology for translation and interpreting, will give the students access to high-profile academics and best practices across the field. Students on the two-year degree course have the opportunity to study at multiple universities and undertake industry placements related to their dissertation.
EM TTI will produce specialists in translation and interpreting who are up-to-date with the latest applications which support their daily work. The disciplines involved are translation, interpreting, language technology, and linguistics.
This was a highly competitive application process. Prof. R Mitkov, the coordinator of the programme and Director of the Research Institute commented ‘This programme is not only the first Erasmus Mundus Master programme on Technology for Translation and Interpreting but the very first Master programme in the world on this topic. It will not only enhance the visibility of the research group and university, but will also create a very special teaching and research vibrant environment on the topics covered. ‘
The funding of 3 million Euros granted by the EC will cover 60 scholarships across the consortium. The offer of scholarships will drive competition for places and ensure candidates of the highest calibre are selected. Students will be awarded a Multiple Master’s degree from the institutions where they study.
The new programme will begin in September 2019, with applications opening in November/December 2018. For any further information, please contact Amanda Bloore, Project and Funding Officer for RIILP (A.Bloore@wlv.ac.uk).
Congratulations to Shiva Taslimipoor who successfully defended her thesis, entitled ‘Automatic Identification and Translation of Multiword Expressions’, on Tuesday. She is pictured (left-right) with Professor Dew Harrison (Chair of the viva), Dr Aline Villavicencio (External Examiner), Professor Mike Thelwall (Internal Examiner) and Professor Ruslan Mitkov (Director of Studies). We are all thrilled for Shiva and wish her the very best for her next venture!
Next week, Shiva Taslimipoor is to defend her thesis in her viva voce which will conclude her four year PhD with the Research Group in Computational Linguistics. In the run up to her viva, Shiva presented her thesis and the research she has undertaken to the group.
Title: Automatic Identification and Translation of Multiword Expressions
Abstract: Multiword Expressions (MWEs) belong to a class of phraseological phenomena that is ubiquitous in the study of language. They are heterogeneous lexical items consisting of more than one word and feature lexical, syntactic, semantic and pragmatic idiosyncrasies. Scholarly research in MWEs immensely benefit both natural language processing (NLP) applications and end users. Along with the improvement of general NLP techniques, the methodologies to deal with MWEs should be improved.
This thesis involves designing new methodologies to identify and translate MWEs. In order to deal with MWE identification, we first develop datasets of annotated verb-noun MWEs in context. We then propose a method which employs word embeddings to disambiguate between literal and idiomatic usages of expressions. Existence of expression types with various idiomatic and literal distributions leads us to re-examine their modelling and evaluation.
We propose a type-aware train and test splitting approach to prevent models from overfitting and avoid misleading evaluation results.
Identification of MWEs in context can be modelled with tagging methodologies. To this end, we devise a new neural network architecture, which is a combination of convolutional neural networks and long-short term memories with an optional conditional random field layer on top. We conduct extensive evaluations on several languages demonstrating a better performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is outstanding.
In order to find translations for verb-noun MWEs, we propose a bilingual distributional similarity approach derived from a word embedding model that supports arbitrary context. The technique is devised to extract translation equivalents from comparable corpora which are an alternative resource to costly parallel corpora. We finally conduct a series of experiments to investigate the effects of size and quality of comparable corpora on automatic extraction of translation equivalents.