Oracle Labs | Single Publication Page

Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching

Michael Wick, Pallika Kanani, Adam Pocock

06 December 2015

We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. Our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations to artificially induce code switching; thus, allowing words in multiple languages to appear in contexts together and share distributional information. We evaluate the embeddings on a new multilingual word analogy dataset. We also find that our embeddings allow an NLP model trained in one language to generalize to another, achieving up to 80% of the accuracy of an in-language model.

Venue : NIPS Workshop on Multi-Task and Transfer Learning 2016.

External Link: https://fc3696b9-a-62cb3a1a-s-sites.googlegroups.com/site/tlworkshop2015/Paper_20.pdf?attachauth=ANoY7cp-uNwV7xUFuIW2YO4VvmJqsyowjGUb3elhgJMH6ah6NjlEM-NXdG1J_9zd1VmHswAfw22Iu16KcN-OfNavXCS3vpfp0EsRDfFt_w0UD26rBCv9x-PG_z4KfcEFlR4BEv9NZFZ3CVhFkTT_2QGpv

Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching

Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching

Resources For

Partners

Emerging Technology

What’s New

Contact Us