Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching

Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching

Michael Wick, Pallika Kanani, Adam Pocock

14 February 2016

We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. In contrast to current approaches that require language identification, our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations— which are often freely and readily available. We can encode such word translations as hard constraints in the model’s objective functions; however, we find that we can more naturally constrain the space by allowing words in one language to borrow distributional statistics from context words in another language. We achieve this via a process we term artificial code-switching. As the name suggests, we induce code switching so that words across multiple languages appear in contexts together. Not only do embedding models trained on code-switched data learn common cross-lingual structure, the common structure allows an NLP model trained in a source language to generalize to multiple target languages (achieving up to 80% of the accuracy of models trained with target language data).


Venue : AAAI 2016

External Link: https://people.cs.umass.edu/~mwick/MikeWeb/Publications_files/wick16minimally.pdf