Dr Yuval Pinter, Ben Gurion University of the Negev, Isarel
Challenging and Adapting NLP Models to Lexical Phenomena
12 October 2021
Over the last few years, deep neural models have taken over the field of natural language processing (NLP), brandishing great improvements on many of its sequence-level tasks. But the end-to-end nature of these models makes it hard to figure out whether the way they represent individual words aligns with how language builds itself from the bottom up, or how lexical changes in register and domain can affect the untested aspects of such representations.
In this talk, I will present NYTWIT, a dataset created to challenge large language models at the lexical level, tasking them with identification of processes leading to the formation of novel English words, as well as with segmentation and recovery of the class of novel blends. I will then present XRayEmb, a method which alleviates the hardships of processing these novelties by fitting a character-level encoder to the existing models’ subword tokenizers; and conclude with a discussion of the drawbacks of current tokenizers’ vocabulary creation schemes.
Yuval Pinter is a Senior Lecturer in the Department of Computer Science at Ben-Gurion University of the Negev, focusing on NLP. Yuval got his PhD at the Georgia Institute of Technology School of Interactive Computing as a Bloomberg Data Science PhD Fellow. Before that, he worked as a Research Engineer at Yahoo Labs and as a Computational Linguist at Ginger Software, and obtained an MA in Linguistics and a BSc in CS and Mathematics, both from Tel Aviv University. Yuval blogs (in Hebrew) about language matters on Dagesh Kal.