CMU and Google researchers present XLNet: a new pre-training method for language modeling that outperforms BERT on 20 tasks

Last week, Carnegie Mellon University (CMU) and Google researchers presented a paper XLNet: Generalized Autoregressive Pretraining for Language Understanding which focuses on the XLNet model.

https://twitter.com/quocleix/status/1141511813709717504

In this paper, the researchers have explained about the XLNet and how it uses a permutation language modeling objective for combining the advantages of AR and AE methods. The researchers compared XLNet with BERT and they have shown with examples that XLNet was able to surpass BERT on 20 tasks using the RACE, SQuAD and GLUE datasets.

What is the need for XLNet

Among different unsupervised pre-training objectives, autoregressive (AR) language modeling and autoencoding (AE) have been the two most successful pre-training objectives. Also, AR language modeling estimates the probability distribution of a text corpus with an autoregressive model. This language model has been only trained to encode a uni-directional context and is not effective at modeling deep bidirectional contexts.

But the downstream language understanding tasks usually need bidirectional context information and which results in a gap between AR language modeling and effective pretraining. In contrast, AE based pretraining does not perform density estimation but it works towards reconstructing the original data from corrupted input.

As density estimation is not part of the objective, BERT can utilize bidirectional contexts for reconstruction which also closes the bidirectional information gap in AR language modeling and improves performance.

BERT (Bidirectional Encoder Representations from Transformers) achieves better performance than pretraining approaches that are based on autoregressive language modeling. But it relies on corrupting the input with masks and neglects dependency between the masked positions and also suffers from a discrepancy.

Considering these pros and cons, the researchers from CMU and Google proposed XLNet, a generalized autoregressive pretraining method that:

(1) enables learning bidirectional contexts by simply maximizing the expected likelihood over all permutations of the factorization order and

(2) overcomes the limitations of BERT because of its autoregressive formulation.

XLNet also integrates ideas from Transformer-XL which is the state-of-the-art autoregressive model, into pretraining. It outperforms BERT on 20 tasks and usually by a large margin, and achieves state-of-the-art results on 18 tasks. These tasks include question answering, sentiment analysis, natural language inference, and document ranking.

The researchers observed that applying a Transformer(-XL) architecture to permutation-based language modeling does not work as the factorization order is random and also the target is unclear. To solve this, the researchers proposed to reparameterize the Transformer(-XL) network for removing the ambiguity.

https://twitter.com/rsalakhu/status/1141539269565132800?s=19

XLNet comparison with BERT

While comparing with BERT, researchers observed that BERT and XLNet perform partial prediction, which means predicting only a subset of tokens in the sequence. It is important for BERT because in case, all the tokens are masked then it is impossible to make any meaningful predictions.

Partial prediction plays a role in reducing optimization difficulty for both BERT and XLNet by predicting tokens with sufficient context.

XLNet improves architectural designs for pretraining and improves the performance for tasks involving a longer text sequence. XLNet does not rely on data corruption so it does not suffer from the pretrain-finetune discrepancy that happens in the case of BERT.

The autoregressive objective provides a natural way to use the product rule for factorizing the joint probability of the predicted tokens. This eliminates the independence assumption made in BERT.

XLNet maximizes the expected log likelihood of a sequence with respect to all possible permutations of the factorization order instead of using a fixed forward or backward factorization order.

According to the researchers, “BERT factorizes the joint conditional probability p(x¯ | xˆ) based on an independence assumption that all masked tokens x̄ are separately reconstructed (Given a text sequence x = [x1, · · · , xT ],). The researchers have called it as independence assumption, and according to them it disables BERT to model dependency between targets.

The researchers explain the difference between XLNet and BERT with an example, “Let’s consider a concrete example [New, York, is, a, city]. Suppose both BERT and XLNet select the two tokens [New, York] as the prediction targets and maximize 6 log p(New York | is a city).

Also suppose that XLNet samples the factorization order [is, a, city, New, York]. In this case, BERT and XLNet respectively reduce to the following objectives: JBERT = log p(New | is a city) + log p(York | is a city), JXLNet = log p(New | is a city) + log p(York | New, is a city). Notice that XLNet is able to capture the dependency between the pair (New, York), which is omitted by BERT.”

In the above example, BERT learns some dependency pairs such as (New, city) and (York, city), so the researchers conclude that XLNet always learns more dependency pairs given the same target and contains “denser” effective training signals. Also, the XLNet objective comprises of more effective training signals that offer better performance.

XLNet comparison with Language Model

According to the researchers, standard AR language model like GPT (GUID Partition Table) is only able to cover the dependency (x = York, U = {New}) but not (x = New, U = {York}).

On the other hand, XLNet is able to cover both in expectation overall factorization orders. This limitation of AR language modeling can be a critical issue in real-world applications.

The researchers concluded that AR language modeling is not able to cover the dependency but XLNet is able to cover all dependencies in expectation.

There has always been a gap between language modeling and pretraining because of the lack of the capability of bidirectional context modeling. But XLNet generalizes language modeling and bridges the gap.

Implementation and conclusion

The researchers used the BooksCorpus and English Wikipedia as part of their pre-training data, which contains 13GB plain text combined. They experimented on four datasets including RACE dataset, SQuAD dataset, ClueWeb09-B Dataset, and GLUE dataset. “They further studied three major aspects:

The effectiveness of the permutation language modeling objective, especially compared to the denoising auto-encoding objective used by BERT.

The importance of using Transformer-XL as the backbone neural architecture and employing segment-level recurrence (i.e. using memory).

The necessity of some implementation details including span-based prediction, the bidirectional input pipeline, and next-sentence prediction.”

The researchers concluded that XLNet is a generalized AR pre-training method and it uses a permutation language modeling objective for combining the advantages of AR and AE methods.

According to them, the neural architecture of XLNet is developed to work seamlessly with the AR objective that integrates Transformer-XL. It also achieves state-of-the-art results in various tasks with improvement.

The paper reads, “In the future, we envision applications of XLNet to a wider set of tasks such as vision and reinforcement learning.”

A lot of users seem to be excited about this news and they think it can get even better. One of the users commented on Reddit, “The authors are currently trying to see the text generation capability of XLNet. If they confirm that it's on par with left-to-right model (hence better than BERT), then their work would be even more impressive.”

Few others think that it will be better if the researchers use more diverse datasets for experimentation purpose. Another user commented, “The result seems to me as if the substantial improvement in this setting is coming mostly from the use of Transformer-XL (i.e. larger context size). Probably using more data and greater context size (and more diverse dataset) is far more important than doing anything else proposed in the paper.”

Many others are excited about this research and think that XLNet is better than BERT.

https://twitter.com/eturner303/status/1143174828804857856

https://twitter.com/ST4Good/status/1143182779460608001

https://twitter.com/alex_conneau/status/1141489936022953984

To know more about this, check out the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding.

Curl’s lead developer announces Google’s “plan to reimplement curl in Libcrurl”

Google rejects all 13 shareholder proposals at its annual meeting, despite protesting workers

Google Calendar was down for nearly three hours after a major outage