06 December 2015
We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. Our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word
translations to artificially induce code switching; thus, allowing words in multiple languages to appear in contexts together and share distributional information. We evaluate the embeddings on a new multilingual word analogy dataset. We also find that our embeddings allow an NLP model trained in one language to generalize to another, achieving up to 80% of the accuracy of an in-language model.
Venue : NIPS Workshop on Multi-Task and Transfer Learning 2016.
External Link: https://fc3696b9-a-62cb3a1a-s-sites.googlegroups.com/site/tlworkshop2015/Paper_20.pdf?attachauth=ANoY7cp-uNwV7xUFuIW2YO4VvmJqsyowjGUb3elhgJMH6ah6NjlEM-NXdG1J_9zd1VmHswAfw22Iu16KcN-OfNavXCS3vpfp0EsRDfFt_w0UD26rBCv9x-PG_z4KfcEFlR4BEv9NZFZ3CVhFkTT_2QGpv