Understanding the Attention mechanism

3 min readJan 1, 2023

Language modelling is the task of generating next word given previous words of a sequence such as Autocomplete working in our mobile phones.

Mathematically, it could be written as y* = argmax_y P(y_t | y_<t)

Sequence-to-Sequence is a similar task where we generate response sentence in some language given the query sentence in some language such as Machine Translation through Google Assistant. This could also be seen as a language model conditioned on another language. Like for translating from English to Hindi, we are generating word tokens in Hindi, given actual sentence in English.

To do this translation, one of earlier approaches used was to compute a context of the input sentence and use it to produce the translated sentence using RNN/LSTM/GRU kind of models.

Encoder reads the sentence only once and encode it as context. But passing the context in this manner to produce translated language has some disadvantages.

This is not how we humans translate from one language to another because its difficult to retain all the information from the initial sentence. So one way is to pass the encoded context at all the time steps. But then one need to think that does every word encoded is important at each time-step? And the answer is “No”.

Illustration to convert from Hindi(“Main Gana Ga Raha hoon”) to English(“I am Singing Song”), I need to pay attention to selected words for corresponding translation and here are mask vectors to tell the weightage of each input words.

But there is no oracle who will tell me that which word is how much important for translation. So I could take a weighted average of the input words and feed it to the decoder

Converting this notion into equations, one could calculate the quantity which captures importance of j’th input word for decoding t’th output word

e_jt = f_att ( s_t-1, h_j )

And then we can normalize these weights using Softmax

⍺_jt = Softmax(e_jt)

And there could be various ways to define e_jt which is a function of current output state and j’th input state

Attention Mechanism

Till now we have seen attention as way to improve Seq2Seq model but one can use attention in many architectures for many tasks. Attention could be defined as

Attention: Query attend to Values
Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of values dependent on the query

Intuition behind this definition is that

The weighted sum is a selective summary of information contained in values, where query determines values to focus on
Attention is a way to obtain fixed size representation of arbitrary set of representations (Values) dependent on some other representation (Query)

The algorithm for general Attention calculation is

Input: Values(h_1… h_n) and Query(s_t-1), where n is number of words in input
       Compute Attention scores: e_jt's
       Take softmax to get attention distribution: ⍺_jt
       Use attention distribution to take weighted sum of values 
       and obtain attention output
       Attention_score = ∑ ⍺_jt * h_j
Output: Attention Score

And there are multiple ways of calculating attention scores famously known by names Basic dot product Attention, Multiplicative attention and Additive attention

References

Deep Learning NPTEL, IIT Madras: https://www.youtube.com/playlist?list=PL3pGy4HtqwD2kwldm81pszxZDJANK3uGV
NLP Stanford CS224N: https://www.youtube.com/playlist?list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z

Keep Learning, Keep Hustling

Understanding the Attention mechanism

Attention Mechanism

References

Written by Nikhil Verma

No responses yet