RCGL Seminars logo

Machine Learning/Deep Learning

Dr Burcu Can, University of Wolverhampton

How to Represent Words?

4 December 2020

Abstract:

Agglutinating languages are built upon words that are made up of a sequence of morphemes. Although the morphemic structure of a language enables a productive word generation that handles both syntax and semantics during the generation of new words, in other respects this production causes sparsity in the language, thereby brings one of the most challenging problems in natural language processing.

The sparsity problem is still there with the rise of representation learning with which we could represent each word in a low dimensional space using their distributional features in a large corpus. However, if the word does not exist or it is not frequent enough, how should we represent this word in the same space? Most of the recent work handles this problem by processing each word as a set of characters where the representation is obtained through a word’s characters. Here I will describe our recent model, morph2vec, by questioning whether a word should be represented by its characters or its morphemes. How to represent words in agglutinating languages?