Seminar: Taxonomies for semantic tagging: how large do they need to be?

Speaker: Dr Paul Rayson, Lancaster University
Title: Taxonomies for semantic tagging: how large do they need to be?
Date and time: Tuesday Feb 9th, 2pm
Room: MI301, City Campus

Abstract: In this presentation, I will describe joint research carried out in the recently completed Samuels project (www.gla.ac.uk/samuels/) in which we have applied automatic semantic analysis to two very large corpora around 1-2 billion words each:(a) Early English Books Online Text Creation Partnership (EEBO-TCP) consisting of over 53,000 transcribed books published between 1473 and 1700 and (b) two hundred years of UK Parliamentary Hansard made up from over 7 million speeches. We have adopted the Historical Thesaurus of English (HTE) taxonomy (developed at the University of Glasgow over 44 years) which is directly linked to the Oxford English Dictionary, thus helping us improve methods for the automatic semantic analysis of historical texts. The Historical Thesaurus contains 793,742 word forms arranged into 225,131 semantic categories. In addition, we have assigned a set of around 4,000 thematic codes reduced down from the HTE (by Marc Alexander and Christian Kay), as well as the existing UCREL Semantic Analysis System (USAS) with 232 tags in its hierarchy. On top of challenges related to historical spelling variation for which we developed the VARD (Variant Detector) software, the sheer size of the corpora and HTE taxonomy pose significant computational challenges but also provide opportunities for contextual semantic disambiguation. I will focus on our new Historical Thesaurus Semantic Tagger (HTST) and the effects of the relative sizes of our three taxonomies on tagging accuracy and sense differentiation.

About the speaker: Dr Rayson is director of the UCREL research centre and a Reader in the School of Computing and Communications, in the Infolab21 building at Lancaster University in Lancaster, UK. His research interests are based on applications of corpus-based natural language processing to address significant challenges in a number of different areas. He is a member of the CREME (Corpus Research in Early Modern English) interdisciplinary research group and a member of the multidisciplinary centre Security Lancaster.