Yesterday, the team at Facebook’s AI research announced that they have expanded and enhanced their LASER (Language-Agnostic SEntence Representations) toolkit to work with more than more than 90 languages, written in 28 different alphabets. This has accelerated the transfer of natural language processing (NLP) applications to many more languages. The team is now open-sourcing LASER and making it as the first exploration of multilingual sentence representations. Currently 93 languages have been incorporated into LASER.
LASER achieves the results by embedding all languages together in a single shared space. They are also making the multilingual encoder and PyTorch code freely available and providing a multilingual test set for more than 100 languages.
The Facebook post reads, “The 93 languages incorporated into LASER include languages with subject-verb-object (SVO) order (e.g., English), SOV order (e.g., Bengali and Turkic), VSO order (e.g., Tagalog and Berber), and even VOS order (e.g., Malagasy).”
Features of LASER
- Enables zero-shot transfer of NLP models from one language, such as English, to scores of others including languages where training data is limited.
- Handles low-resource languages and dialects.
- Provides accuracy for 13 out of the 14 languages in the XNLI corpus. It delivers results in cross-lingual document classification (MLDoc corpus).
- LASER’s sentence embeddings are strong at parallel corpus mining which establishes a new state of the art in the BUCC, 2018 workshop on building and using comparable Corpora, shared task for three of its four language pairs.
- It provides fast performance with processing up to 2,000 sentences per second on GPU.
- PyTorch has been used to implement the sentence encoder with minimal external dependencies.
- LASER supports the use of multiple languages in one sentence.
- LASER’s performance improves as new languages get added and the system keeps on learning to recognize the characteristics of language families.
LASER maps a sentence in any language to a point in a high-dimensional space such that the same sentence in any language will end up in the same neighborhood. This representation could also be a universal language in a semantic vector space. The Facebook post reads, “We have observed that the distance in that space correlates very well to the semantic closeness of the sentences.” The sentence embeddings are used for initializing the decoder LSTM through a linear transformation and are also concatenated to its input embeddings at every time step.
The encoder/decoder approach
The approach behind this project is based on neural machine translation, an encoder/decoder approach which is also known as sequence-to-sequence processing. LASER uses one shared encoder for all input languages and a shared decoder for generating the output language.
LASER uses a 1,024-dimension fixed-size vector for representing the input sentence. The decoder is instructed about which language needs to be generated. As the encoder has no explicit signal for indicating the input language, this method encourages it to learn language-independent representations.
The team at Facebook AI-research has trained their systems on 223 million sentences of public parallel data, aligned with either English or Spanish. By using a shared BPE vocabulary trained on the concatenation of all languages, it was possible to benefit low-resource languages from high-resource languages of the same family.
Zero-shot, cross-lingual natural language inference
LASER achieves excellent results in cross-lingual natural language inference (NLI). The Facebook’s AI research team considers the zero-shot setting as they train the NLI classifier on English and then apply it to all target languages with no fine tuning or target-language resources.
The distances between all sentence pairs are calculated and the closest ones are selected. For more precision, the margin between the closest sentence and the other nearest neighbors is considered. This search is performed using Facebook’s FAISS library.
The team outperformed the state of the art on the shared BUCC task by a large margin. The team improved the F1 score from 85.5 to 96.2 for German/English, from 81.5 to 93.9 for French/English, from 81.3 to 93.3 for Russian/English, and from 77.5 to 92.3 for Chinese/English.
To know more about LASER, check out the official post by Facebook.