Thursday 19/05/2022 10:00 – 11:30 (UK time)
Speaker: Dr Ahmed Hamdi, La Rochelle University
Title: Content Analysis of Digital Text with Special Focus on Named Entity Recognition and Linking
Abstract: Digital humanity institutions are steadily contributing an increasing amount of digital documents (either born-digital or digitised). Billions of digital documents are usually scanned and archived as images which represent a substantial resource for natural language processing (NLP) tasks. The analysis of digital documents requires therefore text extraction using optical character recognition (OCR) systems. Several studies have shown that named entities (NEs) are strongly used to index documents since they are the first point of entry in a search system for document retrieval. For this reason, NEs can be given a higher semantic value than other words. In order to improve the quality of user searches in a system, it is thus necessary to ensure the quality of these particular terms. However, most of the digital documents are indexed through their OCRed version which includes numerous errors that may hinder access to them. In my talk, I will speak about the named entity recognition (NER) and entity linking (EL) of digital text, the impact of OCR errors on NER and EL systems performances as well as existing strategies and solutions to deal with OCR noise.
Speaker Bio: Ahmed Hamdi is a lecturer-researcher at the L3i laboratory, La Rochelle University, France. He received his PhD from the Aix-Marseille University, France. He is well known for his work in automatic language processing, information extraction and document analysis. He has published in top conferences such as SIGIR and CONLL. He has been working on different projects related to digital humanities such as NewsEye (https://www.newseye.eu/) where he used different machine learning techniques to process historic newspapers. More details are available on https://pageperso.univ-lr.fr/ahmed.hamdi/.