Yesterday, the team at Facebook’s AI research announced that they have expanded and enhanced their LASER (Language-Agnostic SEntence Representations) toolkit to work with more than more than 90 languages, written in 28 different alphabets. This has accelerated the transfer of natural language processing (NLP) applications to many more languages. The team is now open-sourcing LASER and making it as the first exploration of multilingual sentence representations. Currently 93 languages have been incorporated into LASER.
LASER achieves the results by embedding all languages together in a single shared space. They are also making the multilingual encoder and PyTorch code freely available and providing a multilingual test set for more than 100 languages.
The Facebook post reads, “The 93 languages incorporated into LASER include languages with subject-verb-object (SVO) order (e.g., English), SOV order (e.g., Bengali and Turkic), VSO order (e.g., Tagalog and Berber), and even VOS order (e.g., Malagasy).”
LASER maps a sentence in any language to a point in a high-dimensional space such that the same sentence in any language will end up in the same neighborhood. This representation could also be a universal language in a semantic vector space. The Facebook post reads, “We have observed that the distance in that space correlates very well to the semantic closeness of the sentences.” The sentence embeddings are used for initializing the decoder LSTM through a linear transformation and are also concatenated to its input embeddings at every time step.
The approach behind this project is based on neural machine translation, an encoder/decoder approach which is also known as sequence-to-sequence processing. LASER uses one shared encoder for all input languages and a shared decoder for generating the output language.
LASER uses a 1,024-dimension fixed-size vector for representing the input sentence. The decoder is instructed about which language needs to be generated. As the encoder has no explicit signal for indicating the input language, this method encourages it to learn language-independent representations.
The team at Facebook AI-research has trained their systems on 223 million sentences of public parallel data, aligned with either English or Spanish. By using a shared BPE vocabulary trained on the concatenation of all languages, it was possible to benefit low-resource languages from high-resource languages of the same family.
LASER achieves excellent results in cross-lingual natural language inference (NLI). The Facebook’s AI research team considers the zero-shot setting as they train the NLI classifier on English and then apply it to all target languages with no fine tuning or target-language resources.
The distances between all sentence pairs are calculated and the closest ones are selected. For more precision, the margin between the closest sentence and the other nearest neighbors is considered. This search is performed using Facebook’s FAISS library.
The team outperformed the state of the art on the shared BUCC task by a large margin. The team improved the F1 score from 85.5 to 96.2 for German/English, from 81.5 to 93.9 for French/English, from 81.3 to 93.3 for Russian/English, and from 77.5 to 92.3 for Chinese/English.
To know more about LASER, check out the official post by Facebook.
Russia opens civil cases against Facebook and Twitter over local data laws
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…