Categories: NewsData

Facebook MUSE: a Python library for multilingual word embeddings now open sourced!

2 min read

Facebook has open-sourced MUSE, Multilingual Unsupervised and Supervised Embeddings. It is a Python library to align embedding spaces in a supervised or unsupervised way. The supervised method uses a bilingual dictionary or identical character strings. The unsupervised approach does not use any parallel data. Instead, it builds a bilingual dictionary between two languages by aligning monolingual word embedding spaces in an unsupervised way.

Facebook MUSE has state-of-the-art multilingual word embeddings for over 30 languages based on fastText. fastText is library for efficient learning of word representations and sentence classification. fastText can be used for making word embeddings using Skipgram, word2vec or CBOW (Continuous Bag of Words) and use it for text classification.

For downloading the English (en) and Spanish (es) embeddings, you can use:

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.es.vec

Facebook MUSE also has 110 large-scale high-quality, truth, bilingual dictionaries to ease the development and evaluation of cross-lingual word embeddings and multilingual NLP. These dictionaries are created using an internal translation tool. The dictionaries handle the polysemy (the coexistence of many possible meanings for a word) of words well.

As mentioned earlier, MUSE has two ways to obtain cross-lingual word embeddings.

The Supervised approach uses a training bilingual dictionary (or identical character strings as anchor points) to learn a mapping from the source to the target space using Procrustes alignment.

To learn a mapping between the source and the target space, simply run:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_iter 5 --dico_train default

The unsupervised approach learns a mapping from the source to the target space using adversarial training and Procrustes refinement without any parallel data or anchor point.

To learn a mapping using adversarial training and iterative Procrustes refinement, run:

python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec

Facebook MUSE also has a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:

Monolingual

python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000

Cross-lingual

python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000

To know more about the functionalities of this library and to download other resources, you can go through the official GitHub repo here.

Sugandha Lahoti

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.

Share
Published by
Sugandha Lahoti

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago