RGCL is delighted to share that a team of our academics and PhD student have recently been awarded the Best Paper Award for Qur’an QA shared task 2022. This is at the 5th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT5) at the 13th Language Resources and Evaluation Conference (LREC 2022). This is the first best paper award for the recently established RIGHT Lab.
The organisers evaluated the papers based on different metrics:
Title: DTW at Qur’an QA 2022: Utilising Transfer Learning with Transformers for Question Answering in a Low-resource Domain
The task of machine reading comprehension (MRC) is a useful benchmark to evaluate the natural language understanding of machines. It has gained popularity in the natural language processing (NLP) field mainly due to the large number of datasets released for many languages. However, the research in MRC has been understudied in several domains, including religious texts. The goal of the Qur’an QA 2022 shared task is to fill this gap by producing state-of-the-art question answering and reading comprehension research on Qur’an. This paper describes the DTW entry to the Quran QA 2022 shared task. Our methodology uses transfer learning to take advantage of available Arabic MRC data. We further improve the results using various ensemble learning strategies. Our approach provided a partial Reciprocal Rank (pRR) score of 0.49 on the test set, proving its strong performance on the task.
Title: Content Analysis of Digital Text with Special Focus on Named Entity Recognition and Linking
Abstract: Digital humanity institutions are steadily contributing an increasing amount of digital documents (either born-digital or digitised). Billions of digital documents are usually scanned and archived as images which represent a substantial resource for natural language processing (NLP) tasks. The analysis of digital documents requires therefore text extraction using optical character recognition (OCR) systems. Several studies have shown that named entities (NEs) are strongly used to index documents since they are the first point of entry in a search system for document retrieval. For this reason, NEs can be given a higher semantic value than other words. In order to improve the quality of user searches in a system, it is thus necessary to ensure the quality of these particular terms. However, most of the digital documents are indexed through their OCRed version which includes numerous errors that may hinder access to them. In my talk, I will speak about the named entity recognition (NER) and entity linking (EL) of digital text, the impact of OCR errors on NER and EL systems performances as well as existing strategies and solutions to deal with OCR noise.
Speaker Bio: Ahmed Hamdi is a lecturer-researcher at the L3i laboratory, La Rochelle University, France. He received his PhD from the Aix-Marseille University, France. He is well known for his work in automatic language processing, information extraction and document analysis. He has published in top conferences such as SIGIR and CONLL. He has been working on different projects related to digital humanities such as NewsEye (https://www.newseye.eu/) where he used different machine learning techniques to process historic newspapers. More details are available on https://pageperso.univ-lr.fr/ahmed.hamdi/.
Title: The Use of NLP for Data Creation and Analysis in Political Science: Computational Text Analysis using Newspapers and Legislation Documents
Abstract: In recent decades, governments have started to maintain an online presence of their archives and documentation of their proceedings and decisions. Newspapers around the world continue to produce daily textual data. Different groups and individuals are also employing online platforms at a rapid rate, like Twitter, Facebook, and Reddit, that constantly store data about users’ activities. All of this has led to an availability of extensive text data online that social scientists can make use of to answer pressing research questions that were previously difficult to approach. In this talk, I speak about the applications of Text as Data in the field of political science. Specifically, I focus on two types of text as data, newspaper articles and US legislation. The talk discusses a recent publication that uses NLP and text analysis on over one million news articles to identify the prevalence of Russian illiberal discourse and its timing relative to German elections. The talk also underlines how NLP and computational text analysis methods are used on US legislation to build a dataset about economic sanctions that improves coverage of US sanction cases from previous datasets.
Speaker Bio: Ashrakat Elshehawy is a visiting PhD student at Yale University and a doctoral student at the Department of Politics and International Relations at the University of Oxford. Her research interests lie in the field of comparative political economy. Her research draws on questions related to the politics of public service provision and the politics of information. In her recent publications, she has focused on how foreign policy tools, such as economic sanctions, interact with domestic politics and how the NLP techniques can be used to analyse them. She has authored several journal publications related to digital humanities, including “SASCAT: Natural Language Processing Approach to the Study of Economic Sanctions” and “Illiberal Communication and Election Intervention during the Refugee Crisis in Germany”. She also taught several courses at the graduate level on Applied Statistics, Python, and Computational Text Analysis.
Speaker: Dr Antonio Pascucci, L’Orientale University of Naples
Title: Stylistic analysis of a hate speech corpus
Abstract: The hate speech phenomenon is a cybercrime that has been growing in recent years. Hate speech, although a sociological phenomenon, has its full realization through written and spoken texts. On the internet, and in social networks in particular, people are more likely to adopt aggressive behaviour because of the anonymity provided by these environments (Brunap and Williams, 2015). Social media represent a sort of echo chamber, in which more radical expressions than those of face-to-face interaction are used. The NLP community is at the forefront in developing AI systems for hate speech detection on the web. Despite this, Fortuna and Nunes (2018) emphasise the need to use multi-class approaches (based on different hate speech categories) instead of only binary classification (e.g. hate speech vs. non-hate speech, misogyny vs. non-misogyny and so on). For this reason, I carried out research on a hate speech corpus in order to investigate stylistic differences in different categories of hate speech (e.g. racism, hate based on religion, LGBTQI+phobia, misogyny). My aim was to show that it is possible to distinguish between i) hate speech and non-hate speech texts and ii) hate speech categories by focusing on haters’ writing style.
Speaker Bio: Antonio Pascucci is a research assistant at the L’Orientale University of Naples and a member of the UNIOR NLP Research Group. He received his PhD from the same university in 2022. He is well known for his work in computational stylometry, hate-speech detection and authorship attribution. he is also a member of the COST Action 17124 “Digital Forensics. Evidence Analysis via Intelligent Systems and Practices”. He has published in conferences and workshops such as ACII, CLiC-it and TRAC. Furthermore, he serves in the programming committee of many AI and NLP workshops targeting author profiling in abusive language such as ResT-UP.
Speaker: Dr Daniel Alcaraz-Carrión, University of Wisconsin-Madison
Title: Multimodal analysis using TV data: new tools for the study of language and gesture
Abstract: In this talk, I will describe some of the data and methods offered by the Red Hen Lab. The first section will be devoted to the NewsScape archive, a television repository with over 400,000 hours of TV news recorded from 2004 until the present. This dataset allows researchers to look up specific linguistic expressions and to obtain all the instances in which they were uttered on TV. The NewsScape library gives access to massive amounts of multimodal data useful for big-data approaches to linguistics, political science and computer science, amongst many other disciplines. To illustrate this, I will present some of my research on temporal and numerical cognition, mixing corpus-based linguistic methods and large-scale gesture analysis.
The second section will focus on new tools for the analysis of visual data and multimodal communication. I will present Open Pose, an open-source Python package that automatically detects body key points, shifting gesture recognition from a manual to a machine-assisted annotation. Following that, I will introduce the Red Hen Anonymizer, a software capable of substituting facial features by using computer-generated images while maintaining facial gestures (e.g., lips, eyebrows). I will finish by introducing some of the tools that are currently being developed in Red Hen, such as a visual lexicon for Aztec hieroglyphs and the integration of PRAAT for the analysis of acoustic features.
Speaker Bio: Dr Daniel Alcaraz-Carrión is a postdoctoral researcher at the department of psychology at University of Wisconsin-Madison. He obtained his PhD in Linguistics and English Language at Lancaster University in 2018. His research uses large multimodal databases to examine different aspects of multimodal communication, including language, gesture and other visual representations. He is particularly interested in how people communicate about highly abstract concepts such as time and number, and how language and gesture can vary cross-linguistically. He is a member and collaborator of several international research groups, including the Red Hen Lab, the Daedalus Lab and the Cognitive Development and Communication Lab.
Speaker: Prof Thomas Mandl, University of Hildesheim.
Title: Computer Vision Meets Portrait Research
Abstract: Digital Humanities research is focusing on enriching scholarship in Humanities and Cultural studies by employing digital methods for collecting, preserving and analysing artefacts. The paradigm of Distant Reading has proven to be especially productive. Since the Iconic Turn, research with images and visual material has itself established within the Humanities beyond the classic image sciences. For Digital Humanities, the development of appropriate tools and methods for Distant Viewing, which stands for the automatic analysis of large amounts of visual data with AI algorithms is still an emerging research field. In the last years, considerable progress has been made in image processing, especially through approaches of so-called Deep Learning. Thus, the classification of photographs is done based on algorithms, which not only learn the illustration but also, which aspects of the pictures need to be analysed. A prototypical system is a Convolutional Neural Network that combines many simple neurons as processors into complex architectures. This talk will briefly introduce CNNs. A review of approaches of Distant Viewing approaches and systems will be given. Then, the talk will report on experiences from working with image collections in two projects. One is about a collection of 32,000 early modern portraits. The other one deals with collections of pedagogical images mainly from children and youth literature. The goals include print type classification, object detection, similarity of publishers and face recognition on portraits. A discussion will introduce the challenges of processing historical data and working with concepts from the humanities.
Speaker Bio: Thomas Mandl is a Professor of Information Science at the University of Hildesheim. He received his Doctorate in information science at the University of Hildesheim in 2000. He was appointed as an extraordinary professor at the same university in 2010. He is well known for his work in human-machine interaction (usability, method research, international aspects) and user-oriented evaluation in information retrieval. He is also the lead organiser in HASOC shared task – Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages from 2019. His most recent work in applying computer vision to portrait graphics has created a new direction in digital humanities research.