Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching
Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching
06 December 2015
We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. Our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations to artificially induce code switching; thus, allowing words in multiple languages to appear in contexts together and share distributional information. We evaluate the embeddings on a new multilingual word analogy dataset. We also find that our embeddings allow an NLP model trained in one language to generalize to another, achieving up to 80% of the accuracy of an in-language model.
Venue : NIPS Workshop on Multi-Task and Transfer Learning 2016.