Jarvis+ with the intelligent community brain FRIDAY based on MASS
Since 2018, pre-training has without a doubt become one of the hottest research topics in Natural Language Processing (NLP).
By leveraging generalized language models like the BERT, GPT and XLNet, great breakthroughs have been achieved in natural language understanding. However, in sequence to sequence based language generation tasks, the popular pre-training methods have not achieved significant improvements. Now, researchers from Microsoft Research Asia have introduced MASS — a new pre-training method that achieves better results than BERT and GPT.
MASS: A new General pre-training framework
MASS possesses an important hyperparameter k (the length of the masked fragment). By adjusting k, MASS can incorporate the masked language modeling in BERT and the standard language modeling in GPT, which extends MASS into a general pre-training framework.
When k=1, according to the design of MASS, one token on the encoder side is masked, and the decoder side predicts this masked token, as shown in Figure 3. The decoder side has no input information and MASS is equivalent to the masked language model in BERT.
k=1. One token on encoder side is masked; the decoder side predicts the masked token.
When k=m (m is the length of the sequence), in MASS all tokens on the encoder side are masked, and the decoder side predicts all tokens, as shown in Figure 4. The decoder side cannot extract any information from the encoder side, and MASS is equivalent to the standard language model in GPT.
k=m. All tokens on encoder side are masked; the decoder side predicts all tokens, just as in GPT.