Tools and Datasets Developed at RGCL
Core NLP
Core NLP (utility)
Language Processing for Assistive Technologies
Lexicography (Applied NLP)
NLP for Social Media
NLP for Technology-Enhanced Learning
Technology-Enhanced Learning
Translation Technologies
— Core NLP —
Datasets
Resource | Contact person | Description | Language(s) |
---|---|---|---|
itVN | Shiva Taslimipoor st797@cam.ac.uk | Concordances of Verb+Noun Multiword Expressions (MWEs) | Italian |
Signs corpus | Richard Evans r.j.evans@wlv.ac.uk | Corpus annotated with information about lexical and punctuational markers of syntactic complexity. Corpus used to train shallow syntactic analysers (sign tagger). | English |
mwe-geco | Le An Ha ha.l.a@wlv.ac.uk | Extracted files from the GECO eye-tracking corpus annotated with Verb-Noun and Verb-Particle constructions. | English |
STARS corpus | Richard Evans r.j.evans@wlv.ac.uk | Token sequences labelled with information about compound constituents and complex constituents. Dataset used to train partial parsers | English |
NLP Tools
ReCor | Gloria Corpas gcorpas@uma.es | ReCor is an effective solution to determine the minimum size of a corpus or a textual collection, regardless of language or textual genre of the collection, establishing therefore the minimum threshold for representation by an algorithm (N-Cor) and analysing lexical density according to the incremental increase in the corpus. | Language independent |
INTELITERM | Gloria Corpas gcorpas@uma.es | Inteliterm is an intelligent multilingual dictionary designed in Java and related to the health and beauty tourism sector. It allows to quickly display the information of the selected terms which are included in the Interliterm database. It also has a TBX database management module and it is linked to a corpus manager that allows searching for concordances, n-grams, etc. | Language independent |
OntoDiccionario | Gloria Corpas gcorpas@uma.es | OntoDiccionario is a software application capable of displaying the conceptual and terminological information implemented in a RDF/OWL ontology (created by means of the program TopBraid Composer); it allows the users to search for terms and to navigate through concepts by means of hyperlinks. The application is based on the idea of taking the ontology code and parsing its content, in such a way that all classes, relations, properties and labels are captured; these data are then represented on a simple user interface, which includes a list of concepts, search engine, and a display window that shows the data for each concept. | Language independent |
Gappy MWEs | Le An Ha ha.l.a@wlv.ac.uk | Bridging the Gap: Attending to Discontinuity in Identification of Multiword Expressions. Code to identify MWEs whicb contain gaps? | English, French, German, Persian |
UDsyntax | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Calculates -percentage of non-projective dependencies in a treebank (or any .conllu) to all dependencies -percentage of sentences (=trees) with such dependencies | Language independent |
VMWE Identification | Shiva Taslimipoor st797@cam.ac.uk | Code and documentation for “automatic identification of verbal multiword expressions” | |
discriminative_attribute | Shiva Taslimipoor st797@cam.ac.uk | Code and documentation for “Capturing Discriminative Attributes” | |
Finding Discriminative Attributes | Le An Ha ha.l.a@wlv.ac.uk | Code and documentation for the SemEval 2018 shared task Capturing Discriminative Attributes. a classification system to determine whether an attribute word can distinguish one word from another. | English |
corpusometry | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Data and code related to the task of creating homogeneous and functionally similar subsets of two arbitrary text collections | Language independent |
Model for metaphor classification informed by detection of multiword units | Le An Ha ha.l.a@wlv.ac.uk | Model for metaphor classification informed by detection of multiword units | English |
dms | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Scripts and support files for processing/extracting statistics/visualising learners tmx and professional tmx (and reference (txt) corpora with regard to discourse markers analysis | Language independent |
Arabic SOS | Emad Mohamed E.Mohamed2@wlv.ac.uk | Segmenter and Orthography Standardazier (SOS) for Classical Arabic (CA) | |
Siamese-Recurrent-Architectures | Tharindu D. Ranasinghe Hettiarachchige t.d.ranasinghehettiarachchige@wlv.ac.uk | Siamese neural networks for semantic textual similarity. | Language independent |
VMWE-Identification | Le An Ha ha.l.a@wlv.ac.uk | Tagging verbal multiword expressions | English, Spanish |
hypohyper | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | taxonomy enrichment for russian | Russian |
bilm-tf | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Tensorflow implementation of the pretrained biLM used to compute ELMo representations from “Deep contextualized word representations”. | Language independent |
Prediction of multiword expressions using eye tracking data | Le An Ha ha.l.a@wlv.ac.uk | This repository contains the source code, data, and analyses behind the paper Using Gaze Data to Predict Multiword Expressions. | English |
Simple-Sentence-Similarity | Tharindu D. Ranasinghe Hettiarachchige t.d.ranasinghehettiarachchige@wlv.ac.uk | Unsupervised methods to calculate the semantic textual similarity | Language independent |
webvectors | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Web-ify your word2vec: framework to serve distributional semantic models online | Language independent |
Classifying Referential and Non-referential It Using Gaze (includes data) | Le An Ha ha.l.a@wlv.ac.uk | This repository contains code, annotation and data for the study on classifying referential and non-referential it using gaze: | English |
Sign tagger (web demo) | Richard Evans r.j.evans@wlv.ac.uk | Shallow syntactic analysis | English |
— Core NLP (utility) —
NLP Tools
Scleaner | Gloria Corpas gcorpas@uma.es | SCleaner is a program that helps users to format text copied from a pdf file. When copying and pasting from a PDF file, users can find various formatting problems: white spaces, tabulations, sentence boundaries, etc. Scleaner removes extra tabs and white spaces, and splits sentences in the right place automatically. | Language independent |
— Language Processing for Assistive Technologies —
Datasets
Gaze data from participants with and without autism completing web searching tasks | Le An Ha ha.l.a@wlv.ac.uk | Contains gaze data from participants with and without autism completing web searching tasks | English |
— Lexicography (Applied NLP) —
Datasets
PDEV | Patrick Hanks patrick.w.hanks@gmail.com | Corpus-driven dictionary of English verb patterns | English |
PDEV | Sara Moze S.Moze@wlv.ac.uk | Corpus-driven dictionary of English verb patterns | English |
— NLP for Social Media —
Datasets
Offensive Language Identification Dataset (OLID) | Marcos Zampieri marcos.zampieri@rit.edu | A collection of 14,200 annotated English tweets using an annotation model that encompasses following three levels: A: Offensive Language DetectionB: Categorization of Offensive LanguageC: Offensive Language Target Identification | English |
Greek Dataset for Offensive Language Identification | Tharindu D. Ranasinghe Hettiarachchige t.d.ranasinghehettiarachchige@wlv.ac.uk | Offensive Language Identification | Greek |
NLP Tools
Irony Detection | Shiva Taslimipoor st797@cam.ac.uk | Code and documentation for “WLV at SemEval-2018 Task 3” | |
DeepOffense | Tharindu D. Ranasinghe Hettiarachchige t.d.ranasinghehettiarachchige@wlv.ac.uk | Multilingual Offensive Language Identification with Cross-lingual Embeddings | Bengali, English, Hindi, Spanish |
— NLP for Technology-Enhanced Learning —
NLP Tools
QA-for-medical-MCQs (includes data) | Le An Ha ha.l.a@wlv.ac.uk | This repository contains the python code and public data set for | English |
— Technology-Enhanced Learning —
Datasets
Tell-Me | Gloria Corpas gcorpas@uma.es | Comparable corpus of medical texts | English, German, Spanish |
— Translation Technologies —
Datasets
Russian Learner Translator Corpus (RusLTC) | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Bi-directional multiple corpus, which stores English-Russian translations done by university translation students | English, Russian |
EnEsCC | Shiva Taslimipoor st797@cam.ac.uk | English Spanish Comparable Corpora | _ |
Ecoturismo | Gloria Corpas gcorpas@uma.es | Comparable corpus of tourism texts | English, Spanish |
Inteliterm (Comparable) | Gloria Corpas gcorpas@uma.es | Comparable corpus of health, beauty, and tourism texts | English, German, Italian, Spanish |
Inteliterm (Parallel) | Gloria Corpas gcorpas@uma.es | Parallel corpus of health, beauty, and tourism texts | English, German, Italian, Spanish |
Termitur (Comparable) | Gloria Corpas gcorpas@uma.es | Comparable corpus of tourism texts | English, German, Spanish |
Termitur (Parallel) | Gloria Corpas gcorpas@uma.es | Comparable corpus of tourism texts | English, German, Spanish |
Turicor | Gloria Corpas gcorpas@uma.es | Comparable corpus of tourism/maritime transport texts | English, French, German, Spanish |
NLP Tools
TERMITUR | Gloria Corpas gcorpas@uma.es | Termitur is a lexicographic multilingual system oriented to the tourism sector. Termitur is a proposal of intelligent specialised dictionary based on documents and digital resources related to tourism 2.0 that is combined with intelligent terminology management systems and (semi)automatic corpus complication. It uses corpus previously compiled by the research team on rural and nature tourism and health and beauty tourism, as well as other corpus compiled in a (semi)automatic way. The result is a hybrid system that allows the translator and the interpreter to acquire specialised knowledge of the tourism sector in German, English and Spanish, as well as the resulting language pairs. | English, German, Spanish |
Trandix | Gloria Corpas gcorpas@uma.es | Trandix is a computer application that aims to assist the translator during the process of decoding and encoding messages. It improves consultation of terminological information which the translator may need through a fast and convenient way. This application also allows users to upload TBX files without size limit. Those files could be exported from a terminology database of any kind of specialty. | Language independent |
fluency | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | An attempt to capture fluency as an aspect of translation quality along with accuracy | English, Russian |
HiT-IT | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Code and data related to translationese-for-quality project, presented at HiT-IT | Language independent |
parcorp | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Code to create a register balance corpus for translationese studies | Language independent |
translationese45 | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Code to extract 45 translationese indicators for English, German and Russian, most of which were used in the research presented at LREC 2020 | English, German, Russian |
scrape | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Tool to collect parallel texts from the web | Language independent |
Intelligent-Translation-Memories | Tharindu D. Ranasinghe Hettiarachchige t.d.ranasinghehettiarachchige@wlv.ac.uk | Semantically powerful translation memory matching and retrieval | Language independent |
TransQuest | Tharindu D. Ranasinghe Hettiarachchige t.d.ranasinghehettiarachchige@wlv.ac.uk | Translation Quality Estimation with Cross-lingual Transformers. | Language independent |
accuracy | Maria Kunilovskaya maria.kunilovskaya@wlv.ac.uk | Tool using cross-linguistic text similarity to capture accuracy | English, Russian |