Resources

Tools and Datasets Developed at RGCL

Core NLP
Core NLP (utility)
Language Processing for Assistive Technologies
Lexicography (Applied NLP)
NLP for Social Media
NLP for Technology-Enhanced Learning
Technology-Enhanced Learning
Translation Technologies

— Core NLP —

Datasets

ResourceContact personDescriptionLanguage(s)
itVNShiva Taslimipoor
st797@cam.ac.uk
Concordances of Verb+Noun Multiword Expressions (MWEs)Italian
Signs corpusRichard Evans
r.j.evans@wlv.ac.uk
Corpus annotated with information about lexical and punctuational markers of syntactic complexity. Corpus used to train shallow syntactic analysers (sign tagger).English
mwe-gecoLe An Ha
ha.l.a@wlv.ac.uk
Extracted files from the GECO eye-tracking corpus annotated with Verb-Noun and Verb-Particle constructions.English
STARS corpusRichard Evans
r.j.evans@wlv.ac.uk
Token sequences labelled with information about compound constituents and complex constituents. Dataset used to train partial parsersEnglish

NLP Tools

ReCorGloria Corpas
gcorpas@uma.es
ReCor is an effective solution to determine the minimum size of a corpus or a textual collection, regardless of language or textual genre of the collection, establishing therefore the minimum threshold for representation by an algorithm (N-Cor) and analysing lexical density according to the incremental increase in the corpus.Language independent
INTELITERMGloria Corpas
gcorpas@uma.es
Inteliterm is an intelligent multilingual dictionary designed in Java and related to the health and beauty tourism sector. It allows to quickly display the information of the selected terms which are included in the Interliterm database. It also has a TBX database management module and it is linked to a corpus manager that allows searching for concordances, n-grams, etc.Language independent
OntoDiccionarioGloria Corpas
gcorpas@uma.es
OntoDiccionario is a software application capable of displaying the conceptual and terminological information implemented in a RDF/OWL ontology (created by means of the program TopBraid Composer); it allows the users to search for terms and to navigate through concepts by means of hyperlinks. The application is based on the idea of taking the ontology code and parsing its content, in such a way that all classes, relations, properties and labels are captured; these data are then represented on a simple user interface, which includes a list of concepts, search engine, and a display window that shows the data for each concept.Language independent
Gappy MWEsLe An Ha
ha.l.a@wlv.ac.uk
Bridging the Gap: Attending to Discontinuity in Identification of Multiword Expressions. Code to identify MWEs whicb contain gaps?English, French, German, Persian
UDsyntaxMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Calculates -percentage of non-projective dependencies in a treebank (or any .conllu) to all dependencies -percentage of sentences (=trees) with such dependenciesLanguage independent
VMWE IdentificationShiva Taslimipoor
st797@cam.ac.uk
Code and documentation for “automatic identification of verbal multiword expressions”
discriminative_attributeShiva Taslimipoor
st797@cam.ac.uk
Code and documentation for “Capturing Discriminative Attributes”
Finding Discriminative AttributesLe An Ha
ha.l.a@wlv.ac.uk
Code and documentation for the SemEval 2018 shared task Capturing Discriminative Attributes. a classification system to determine whether an attribute word can distinguish one word from another.English
corpusometryMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Data and code related to the task of creating homogeneous and functionally similar subsets of two arbitrary text collectionsLanguage independent
Model for metaphor classification informed by detection of multiword unitsLe An Ha
ha.l.a@wlv.ac.uk
Model for metaphor classification informed by detection of multiword unitsEnglish
dmsMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Scripts and support files for processing/extracting statistics/visualising learners tmx and professional tmx (and reference (txt) corpora with regard to discourse markers analysisLanguage independent
Arabic SOSEmad Mohamed
E.Mohamed2@wlv.ac.uk
Segmenter and Orthography Standardazier (SOS) for Classical Arabic (CA)
Siamese-Recurrent-ArchitecturesTharindu D. Ranasinghe Hettiarachchige
t.d.ranasinghehettiarachchige@wlv.ac.uk
Siamese neural networks for semantic textual similarity.Language independent
VMWE-IdentificationLe An Ha
ha.l.a@wlv.ac.uk
Tagging verbal multiword expressionsEnglish, Spanish
hypohyperMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
taxonomy enrichment for russianRussian
bilm-tfMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Tensorflow implementation of the pretrained biLM used to compute ELMo representations from “Deep contextualized word representations”.Language independent
Prediction of multiword expressions using eye tracking dataLe An Ha
ha.l.a@wlv.ac.uk
This repository contains the source code, data, and analyses behind the paper Using Gaze Data to Predict Multiword Expressions.English
Simple-Sentence-SimilarityTharindu D. Ranasinghe Hettiarachchige
t.d.ranasinghehettiarachchige@wlv.ac.uk
Unsupervised methods to calculate the semantic textual similarityLanguage independent
webvectorsMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Web-ify your word2vec: framework to serve distributional semantic models onlineLanguage independent
Classifying Referential and Non-referential It Using Gaze (includes data)Le An Ha
ha.l.a@wlv.ac.uk
This repository contains code, annotation and data for the study on classifying referential and non-referential it using gaze:English
Sign tagger (web demo)Richard Evans
r.j.evans@wlv.ac.uk
Shallow syntactic analysisEnglish

Back to top of page

— Core NLP (utility) —

NLP Tools

ScleanerGloria Corpas
gcorpas@uma.es
SCleaner is a program that helps users to format text copied from a pdf file. When copying and pasting from a PDF file, users can find various formatting problems: white spaces, tabulations, sentence boundaries, etc. Scleaner removes extra tabs and white spaces, and splits sentences in the right place automatically.Language independent

— Language Processing for Assistive Technologies —

Datasets

Gaze data from participants with and without autism completing web searching tasksLe An Ha
ha.l.a@wlv.ac.uk
Contains gaze data from participants with and without autism completing web searching tasksEnglish

Back to top of page

— Lexicography (Applied NLP) —

Datasets

PDEVPatrick Hanks
patrick.w.hanks@gmail.com
Corpus-driven dictionary of English verb patternsEnglish
PDEVSara Moze
S.Moze@wlv.ac.uk
Corpus-driven dictionary of English verb patternsEnglish

Back to top of page

— NLP for Social Media —

Datasets

Offensive Language Identification Dataset (OLID)Marcos Zampieri
marcos.zampieri@rit.edu
A collection of 14,200 annotated English tweets using an annotation model that encompasses following three levels: A: Offensive Language DetectionB: Categorization of Offensive LanguageC: Offensive Language Target IdentificationEnglish
Greek Dataset for Offensive Language IdentificationTharindu D. Ranasinghe Hettiarachchige
t.d.ranasinghehettiarachchige@wlv.ac.uk
Offensive Language IdentificationGreek

NLP Tools

Irony DetectionShiva Taslimipoor
st797@cam.ac.uk
Code and documentation for “WLV at SemEval-2018 Task 3”
DeepOffenseTharindu D. Ranasinghe Hettiarachchige
t.d.ranasinghehettiarachchige@wlv.ac.uk
Multilingual Offensive Language Identification with Cross-lingual EmbeddingsBengali, English, Hindi, Spanish

Back to top of page

— NLP for Technology-Enhanced Learning —

NLP Tools

QA-for-medical-MCQs (includes data)Le An Ha
ha.l.a@wlv.ac.uk
This repository contains the python code and public data set forEnglish

Back to top of page

— Technology-Enhanced Learning —

Datasets

Tell-MeGloria Corpas
gcorpas@uma.es
Comparable corpus of medical textsEnglish, German, Spanish

Back to top of page

— Translation Technologies —

Datasets

Russian Learner Translator Corpus (RusLTC)Maria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Bi-directional multiple corpus, which stores English-Russian translations done by university translation studentsEnglish, Russian
EnEsCCShiva Taslimipoor
st797@cam.ac.uk
English Spanish Comparable Corpora_
EcoturismoGloria Corpas
gcorpas@uma.es
Comparable corpus of tourism textsEnglish, Spanish
Inteliterm (Comparable)Gloria Corpas
gcorpas@uma.es
Comparable corpus of health, beauty, and tourism textsEnglish, German, Italian, Spanish
Inteliterm (Parallel)Gloria Corpas
gcorpas@uma.es
Parallel corpus of health, beauty, and tourism textsEnglish, German, Italian, Spanish
Termitur (Comparable)Gloria Corpas
gcorpas@uma.es
Comparable corpus of tourism textsEnglish, German, Spanish
Termitur (Parallel)Gloria Corpas
gcorpas@uma.es
Comparable corpus of tourism textsEnglish, German, Spanish
TuricorGloria Corpas
gcorpas@uma.es
Comparable corpus of tourism/maritime transport textsEnglish, French, German, Spanish

NLP Tools

TERMITURGloria Corpas
gcorpas@uma.es
Termitur is a lexicographic multilingual system oriented to the tourism sector. Termitur is a proposal of intelligent specialised dictionary based on documents and digital resources related to tourism 2.0 that is combined with intelligent terminology management systems and (semi)automatic corpus complication. It uses corpus previously compiled by the research team on rural and nature tourism and health and beauty tourism, as well as other corpus compiled in a (semi)automatic way. The result is a hybrid system that allows the translator and the interpreter to acquire specialised knowledge of the tourism sector in German, English and Spanish, as well as the resulting language pairs.English, German, Spanish
TrandixGloria Corpas
gcorpas@uma.es
Trandix is a computer application that aims to assist the translator during the process of decoding and encoding messages. It improves consultation of terminological information which the translator may need through a fast and convenient way. This application also allows users to upload TBX files without size limit. Those files could be exported from a terminology database of any kind of specialty.Language independent
fluencyMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
An attempt to capture fluency as an aspect of translation quality along with accuracyEnglish, Russian
HiT-ITMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Code and data related to translationese-for-quality project, presented at HiT-ITLanguage independent
parcorpMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Code to create a register balance corpus for translationese studiesLanguage independent
translationese45Maria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Code to extract 45 translationese indicators for English, German and Russian, most of which were used in the research presented at LREC 2020English, German, Russian
scrapeMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Tool to collect parallel texts from the webLanguage independent
Intelligent-Translation-MemoriesTharindu D. Ranasinghe Hettiarachchige
t.d.ranasinghehettiarachchige@wlv.ac.uk
Semantically powerful translation memory matching and retrievalLanguage independent
TransQuestTharindu D. Ranasinghe Hettiarachchige
t.d.ranasinghehettiarachchige@wlv.ac.uk
Translation Quality Estimation with Cross-lingual Transformers.Language independent
accuracyMaria Kunilovskaya
maria.kunilovskaya@wlv.ac.uk
Tool using cross-linguistic text similarity to capture accuracyEnglish, Russian

Back to top of page