This work focuses on the task of finding latent vector representations of the words in a corpus. In particular, we address the issue of what to do when there are multiple languages in the corpus. Prior work has, among other techniques, used canonical correlation analysis to project pre-trained vectors in two languages into a common space. We propose a simple and scalable method that is inspired by the notion that the learned vector representations should be invariant to translation between languages. We show empirically that our method outperforms prior work on multilingual tasks, matches the performance of prior work on monolingual tasks, and scales linearly with the size of the input data (and thus the number of languages being embedded).
|Original language||English (US)|
|Title of host publication||Conference Proceedings - EMNLP 2015|
|Subtitle of host publication||Conference on Empirical Methods in Natural Language Processing|
|Publisher||Association for Computational Linguistics (ACL)|
|Number of pages||5|
|State||Published - 2015|
|Event||Conference on Empirical Methods in Natural Language Processing, EMNLP 2015 - Lisbon, Portugal|
Duration: Sep 17 2015 → Sep 21 2015
|Name||Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing|
|Other||Conference on Empirical Methods in Natural Language Processing, EMNLP 2015|
|Period||9/17/15 → 9/21/15|
Bibliographical notePublisher Copyright:
© 2015 Association for Computational Linguistics.