MacBERT

MacBERT: A Transformer-Based Model for Chinese NLP with Modified Masking Strategy

If you're interested in natural language processing (NLP) or machine learning for languages other than English, you may have heard of BERT (Bidirectional Encoder Representations from Transformers), a model originally developed by Google AI. BERT is a pre-trained NLP model that uses Transformer architecture and has set state-of-the-art performance on various NLP tasks. However, BERT was pre-trained on English and had to be adapted to other languages, such as Chinese. That's where MacBERT comes in.

What is MacBERT?

MacBERT is a Transformer-based model for Chinese NLP, based on modifying RoBERTa (a Chinese-adapted version of BERT). Specifically, MacBERT alters RoBERTa's masking strategy, which is used during pre-training to predict missing words in a sentence, known as the Masked Language Modeling (MLM) task. RoBERTa, like BERT, uses the [MASK] token to mask words, but this token doesn't appear in the fine-tuning stage, limiting its usefulness. MacBERT uses similar words instead of the [MASK] token, which is obtained through Synonyms toolkit based on word2vec similarity calculations, allowing for more accurate predictions.

Modifications to MLM Task

MacBERT shares the same pre-training tasks as BERT, but with several modifications. For the MLM task, MacBERT uses whole word masking and n-gram masking strategies to select tokens for masking, using percentages of 40%, 30%, 20%, and 10% for word-level unigram to 4-gram. Instead of masking with the [MASK] token, MacBERT uses similar words, with 80% of input words replaced by similar words, 10% replaced by a random word, and the remaining 10% kept as the original word. If an n-gram is selected for masking, similar words are found individually, and in rare cases where no similar word exists, random word replacement is used.

Performance and Impact

MacBERT has shown significant improvement on various Chinese NLP tasks, such as named entity recognition and sentiment analysis, compared to RoBERTa and other Chinese NLP models. MacBERT also achieved state-of-the-art performance on the DuReader dataset, a large-scale Chinese machine reading comprehension dataset. MacBERT's modified masking strategy allows for better predictions and more robustness in fine-tuning stages, making it a valuable tool for Chinese NLP researchers and practitioners.

MacBERT is a Transformer-based model for Chinese NLP that modifies RoBERTa's masking strategy for improved performance on various NLP tasks. Its use of similar words instead of the [MASK] token allows for more accurate predictions and robustness in fine-tuning stages. MacBERT's success in achieving state-of-the-art performance on various Chinese NLP tasks highlights its potential impact on the field, and its availability to researchers and practitioners will hopefully lead to further improvements in Chinese NLP.