Last week, Antonio Pascucci a visiting Ph.D. Industrial Student in Computational Linguistics for Authorship and Gender Attribution in Italian social media texts from Universiy Of Naples – L’Orientale, gave a Researcher Seminar to the group.
Title: ‘Computational Stylometry for Authorship Attribution in social media texts’
Computational Stylometry (CS) is the study of stylistic features (linguistic choices). Writing style is a combination of decisions in language production. Thanks to a statistic analysis of these decisions, we can know author identity and many more characteristics about him/her. Writing style, in fact, is unique to an individual, and that’s why we talk about authorial DNA.
CS for the authorship attribution is the topic of my research project, and the aim is using CS for authorship attribution in social media texts. During the seminar, research project and my first steps in gender attribution will be shown, in addition to Cyberbullying detection researches, conducted thanks to a software made available by Expert System Corp.
Pablo Calleja, from the Ontology Engineering Group at the Universidad Politécnica de Madrid, Spain, is currently completing an Internship with RGCL as part of his PhD. Yesterday, Pablo gave a talk to the group about his research.
Title: Role-based Named Entity Recognition over unstructured texts
Named Entity Recognition (NER) poses new challenges in real-world documents in which there are entities with different roles according to their purpose or meaning. Retrieving all the possible entities in scenarios in which only a subset of them based on their role is needed, produces noise on the overall precision.
The talk will present a Role-based NER task that relies on role classification hierarchy models that support recognizing entities with a specific role. The proposed task has been implemented in two use cases: one in the biomedical domain using Spanish drug Summary of Product Characteristics and the other in the legal domain using multilingual and heterogeneous mails of the Panama Papers investigation.
Last week, we were visited by Dr Chun Chang and Zhao Jie from the Institute of Scientific and Technical Information of China. Dr Chang and Zhao Jie spent the day in meetings with members of RGCL discussing future collaborations but in between meetings, Dr Chang found the time to give a talk to the group. The details can be found below-
Title: The Construction and Application of Chinese Thesaurus in China
Dr. Chun Chang, Professor, The Institute of Scientific and Technical Information of China. Dr. Chang has long been engaged in the construction and application of the knowledge organization system.
Focusing on the construction and application of the Chinese Thesaurus, the lecture will discuss three main aspects below: The definition of a thesaurus and basic information of constructing thesauri in China; The history and current situation of constructing Chinese Thesaurus and the application of Computational Linguistics in the compilation process; and the current and prospective application of Chinese Thesaurus in retrieving information.
On Monday, Antoni Oliver González, from the Universitat Oberta de Catalunya (UOC) in Barcelona, arrived at RGCL for a two week stay to form research collaborations with members of the group. On Thursday, Antoni gave the following talk to the group:
Title: Automatic detection of translation equivalents of terms in large parallel and comparable corpora
Abstract: In this talk some methodologies for finding the translation equivalents of a term in big parallel and comparable corpora will be presented. For parallel corpora we are using translation tables from Statistical Machine Translation systems (Moses). For comparable corpora we are experimenting with vecmap, a tool to create cross-lingual word embedding mappings. The experiments will be carried out using the IATE database for English for two subjects: International Relations and International organizations. The goal is to enlarge the Spanish IATE database and to create this database for Catalan.
These experiments are being performed during a short research stay and we will be only able to present preliminary results.
Branislava Šandrih, from the University of Belgrade, has spent a week with us at RGCL. During this time, she has formed collaborations with many members of the group. Branislava gave a talk to the group which outlined her research.
Title: Fingerprints in SMS messages
The presentation will present a study which seeks to find answers to the following questions:
- Is it possible to tell who is the sender of the short message only by analysing a distribution of characters, and not the meaning of the content itself?
- If possible, how reliable would the judgment be?
- Are we leaving some kind of ‘fingerprints’ when we text, and can we tell anything about a person based on the way this person writes short messages?
A multilingual corpus of SMS messages was collected from a single smart phone to underpin the development of a methodology to address the above challenges. First, a binary classifier was trained to distinguish between messages composed and sent by a public service (e.g. parking service, bank reports etc.) and messages written by humans. A second classifier caters for the more challenging task of distinguishing between messages written by the owner of the smart phone and messages sent by other senders.
Branislava’s presentation outlined the experiments related to the above classifiers and reported the evaluation results.
We will all be very sorry to say goodbye to Larissa when she soon returns to Universidade Federal de Minas Gerais (UFMG) in Brazil. This week, Larissa gave the group a talk which outlined her research and the work she has been doing in Wolverhampton for the past year.
Title: Mining Short Text data using Parallel programming
Abstract: This work describes the classification of texts as being either crime-related or non crime-related. Given the spontaneity and popularity of Twitter we collected some posts related with crime and criminology, in the state of São Paulo-SP Brazil. However, this data set is not a collection of crime reports. As the web language is characterized by diversity including flexibility, spontaneity and informality we need a classification rule to filter the documents which really are in the context. The proposed methodology works in a two step framework. In the first step we partition the text database into smaller data sets which define text collections with characteristics (not necessarily directly observable) which allow a better classification process. This enables the usage of parallel computing which decreases the time process required for the technique execution. Later on each subset of the data induces a distinct classification rule with a Supervised Machine Learning technique. For the sake of simplicity we work with KMeans and KMedoids and linear SVM. We will present our results in terms of speed and classification accuracy using various feature sets, including semantic codes.