We will all be very sorry to say goodbye to Larissa when she soon returns to Universidade Federal de Minas Gerais (UFMG) in Brazil. This week, Larissa gave the group a talk which outlined her research and the work she has been doing in Wolverhampton for the past year.
Title: Mining Short Text data using Parallel programming
Abstract: This work describes the classification of texts as being either crime-related or non crime-related. Given the spontaneity and popularity of Twitter we collected some posts related with crime and criminology, in the state of São Paulo-SP Brazil. However, this data set is not a collection of crime reports. As the web language is characterized by diversity including flexibility, spontaneity and informality we need a classification rule to filter the documents which really are in the context. The proposed methodology works in a two step framework. In the first step we partition the text database into smaller data sets which define text collections with characteristics (not necessarily directly observable) which allow a better classification process. This enables the usage of parallel computing which decreases the time process required for the technique execution. Later on each subset of the data induces a distinct classification rule with a Supervised Machine Learning technique. For the sake of simplicity we work with KMeans and KMedoids and linear SVM. We will present our results in terms of speed and classification accuracy using various feature sets, including semantic codes.
Last week the Research Group in Computational Linguistics hosted a Women in Science Research Seminar. The invited speaker was Dr Corina Forascu (Faculty of Computer Science, Alexandru Ioan Cuza University of Iasi, Romania Fulbrighter @ University of Rochester, NY, USA)
The seminar was attended by colleagues across the University and there was an engaged discussion after Corina’s talk – perhaps the start of fledgling ‘Women in Science Network’ at Wolverhampton?
Abstract: Let’s do IT, ladies!
Nowadays, with the existing disparity between men and women in IT, there are many initiatives aiming to bridge this gap. The speaker will introduce the main groups and activities related to women in IT, Computer Science and related fields. She will present current openings dedicated to them, like conferences or contests. Based on the initiatives and events organized within the Women in Information Technology of Iasi, the speaker will suggest steps that could be taken by a similar group in University of Wolverhampton.
Last week, Dr Lucas Vieira from the University of Bristol visited RGCL and gave a seminar as part of the MA Computational Linguistics. Dr Vieira’s talk was well received by the MA students and attended by members of the Research Group.
Title: UK Cognitive effort and translation technologies
Measuring the level of effort required by translation tasks has several applications in academia and industry, including understanding what goes on in translators’ minds and being able to predict effort when planning and managing translation projects. Effort is therefore a frequently researched topic in translation, but measuring cognitive (or mental) effort is not straightforward. This talk will examine the implications of using different data sources to estimate cognitive effort. First, it will present results of a study where a multivariate analysis was used to investigate the relationship between different measures of cognitive effort in post-editing of machine translation, including eye-movement metrics, pauses, and subjective ratings. It will then provide preliminary results of a second study where eye tracking was used to observe trainee translators’ first-ever interaction with computer-assisted translation tools. The talk will show how mouse clicks can be used to complement eye-movement data in translation research and will argue that care should be taken in choosing appropriate measures of effort depending on the study questions and the sample size. Although different measures of cognitive effort are, as expected, often found to be correlated, it was observed that these measures can be clustered together in different groups, which has implications for research and for how the concept of cognitive effort is formulated and understood.
Last week RGCL welcomed Dr Aline Villavicencio from the University of Essex (UK) and the Federal University of Rio Grande do Sul (Brazil). Dr Villavicencio gave a Research Seminar, which was well attended and an interesting discussion after.
Upcoming Seminars can be found on our website here.
Title: Identifying Idiomatic Language with Distributional Semantic Models
Precise natural language understanding requires adequate treatments both of single words and of larger units. However, expressions like compound nouns may display idiomaticity, and while a police car is a car used by the police, a loan shark is not a fish that can be borrowed. Therefore it is important to identify which expressions are idiomatic, and which are not, as the latter can be interpreted from a combination of the meanings of their component words while the former cannot. In this talk I discuss the ability of distributional semantic models (DSMs) to capture idiomaticity in compounds, by means of a large-scale multilingual evaluation of DSMs in French and English. A total of 816 DSMs were constructed in 2,856 evaluations. The results obtained show a high correlation with human judgments about compound idiomaticity (Spearman’s ρ=.82 in one dataset), indicating that these models are able to successfully detect idiomaticity.
RGCL welcomed visitors Matthias Schlögl and Katalin Eszter Lejtovicz from the Austrian Centre for Digital Humanities earlier this week. Whilst they were here Matthias and Katalin presented their current research to the group, the seminar was well received and there was an interesting discussion.
The Austrian Centre for Digital Humanities (ACDH) of the Austrian Academy of Sciences is a research institute which was set up with the declared intention of fostering the humanities by applying digital methods and tools in a wide range of academic fields. It offers a growing portfolio of services running a repository for digital language resources, hosting and publishing data, developing software and working on establishing a tightly knit network of specialised knowledge centres by offering advice and guidance to the research community.
In his presentation Matthias Schlögl will concentrate on the APIS project. APIS ultimate goal is the semantic enrichment of the roughly 18.000 biographies published so far in the Austrian Biographic Dictionary (ÖBL). In the course of the project a Virtual Research Environment (VRE) was developed that allows researchers to annotate biographies, link entities to reference resources and visualize/export the results. Data generated in the VRE is used to train/evaluate Natural Language Processing (NLP) tools that also store annotations to the VRE (where in return they can be reviewed by researchers). Matthias will also show some NLP related tools (e.g. a webbased tool to re-train named entity recognition models (NER)) that are in use/development at the ACDH to foster digital humanities projects like APIS.
In her presentation Katalin Eszter Lejtovicz will present the APIS project aims to extract information from unstructured biographical documents by means of detecting Named Entities, linking them to Linked Open Data vocabularies and finding relations between the entities. The presentation will give an introduction to the steps of information extraction in APIS, give a brief overview of the tools/resources that are used in the project and the ones we have been experimenting with. To name a few: Apache Stanbol for Entity Linking, IEPY, GATE for relation extraction, GermaNet, Wikidata for disambiguation.
Last week we enjoyed a visit from Dr. Shiyan Ou from the School of Information Management, Nanjing University, China. The group enjoyed her visit and her seminar was very well received.
Title: Unsupervised Citation Sentence Identification based on Similarity Measurement
Abstract: Citation Context Analysis has obtained the interest of many researchers in the field of bibliometrics. To do this, the first step is to extract the context of each citation from a citing paper. We proposed a novel unsupervised approach for the identification of implicit citation sentences without attaching a citation tag. Our approach selects the neighbouring sentences around an explicit citation sentence as candidate sentences, calculates the similarity between a candidate sentence and a cited or citing paper, and deems those that are more similar to the cited paper to be implicit citation sentences. To calculate text similarity, we proposed four methods based on the Doc2vec model, the Vector Space Model (VSM) and the LDA model respectively. The experiment results showed that the hybrid method combing the probabilistic TF-IDF weighted VSM with the TF-IDF weighted Doc2vec obtained the best performance. Compared against other supervised methods, our approach does not need any annotated training corpus, and thus can be easy to apply to other domains in theory.
Our next visitor will be Professor Gloria Corpas Pastor who will be giving lectures on the 9th and 10th April.