Member-only story
Components of Transformer Architecture
Sequence modelling is popularly done using Recurrent Neural network(RNN) or its advancements as gated RNNs or Long-short term memory(LSTM). Handling events sequentially hinders parallel processing and when sequences are too long, then the model could potentially forget long-range dependencies in the input or could mix positional content.
Dependency modelling without considering input or output sequences deal with the problem of model forgetting long-range dependencies that attention mechanism leverages. Neural sequence transduction model — Transformer introduced in paper is entirely built on self-attention mechanism without using sequence aligned recurrent architecture.
Key components of the article are
- Scaled dot product attention
- Multi head attention
- Positional encoding and
- Encoder-decoder architecture
Encoder and Decoder are both stacks of 6 layers where each layer has a multi head self attention layer followed by a simple position wise fully connected feedforward network. Each sublayer has a residual connection and layer-normalisation. Decoder has an extra first multi head attention sublayer which is modified from attending to subsequent positions as we dont want to look into the future of target sequence when predicting…