Components of Transformer Architecture

Sequence modelling is popularly done using Recurrent Neural network(RNN) or its advancements as gated RNNs or Long-short term memory(LSTM). Handling events sequentially hinders parallel processing and when sequences are too long, then the model could potentially forget long-range dependencies in the input or could mix positional content.