Pre-training of Deep Bidirectional Transformers for Language Understanding — BERT

Nikhil Verma
2 min readDec 4, 2021

BERT is a language representation model which pre-trains deep bidirectional representations from text by jointly conditioning on context on an unlabeled text corpus for different tasks. It is different from context-free models (word2vec), shallowly bidirectional contextual models (ELMo) and unidirectional contextual models (OpenAI GPT).The motivation behind BERT is to build a model that is pre-trained with an existing corpus such that it can then be fine-tuned to be used for different tasks.

The BERT consists of stacked transformer encoders and proposes different layer numbers for different variations. The use of encoders means that one should not pre-training architecture the same as GPT did as it would cause target leakage. When training language models, there is a challenge of defining a prediction goal. A directional technique which limits context learning is to predict the next word in sequence while modelling language. To overcome this problem, BERT uses the following approaches while pre-training. Pre-training can be understood as a stage where BERT tries to learn “What is language? What is the context?”

  • Masked Language Modelling (MLM): Word sequences fed to the model have 15% words masked using [MASK] token. The model predicts the id of these masked words, using the context…

--

--

Nikhil Verma

Knowledge shared is knowledge squared | My Portfolio https://lihkinverma.github.io/portfolio/ | My blogs are living document, updated as I receive comments