Understanding the Attention mechanism
Language modelling is the task of generating next word given previous words of a sequence such as Autocomplete working in our mobile phones.
Mathematically, it could be written as y* = argmax_y P(y_t | y_<t)
Sequence-to-Sequence is a similar task where we generate response sentence in some language given the query sentence in some language such as Machine Translation through Google Assistant. This could also be seen as a language model conditioned on another language. Like for translating from English to Hindi, we are generating word tokens in Hindi, given actual sentence in English.
To do this translation, one of earlier approaches used was to compute a context of the input sentence and use it to produce the translated sentence using RNN/LSTM/GRU kind of models.
Encoder reads the sentence only once and encode it as context. But passing the context in this manner to produce translated language has some disadvantages.
This is not how we humans translate from one language to another because its difficult to retain all the information from the initial sentence. So one way is to pass the encoded context at all the time steps. But then one need to think that does every word encoded is important at each time-step? And the answer is “No”.
Illustration to convert from Hindi(“Main Gana Ga Raha hoon”) to English(“I am Singing Song”), I need to pay attention to selected words for corresponding translation and here are mask vectors to tell the weightage of each input words.
But there is no oracle who will tell me that which word is how much important for translation. So I could take a weighted average of the input words and feed it to the decoder
Converting this notion into equations, one could calculate the quantity which captures importance of j’th input word for decoding t’th output word
e_jt = f_att ( s_t-1, h_j )
And then we can normalize these weights using Softmax
⍺_jt = Softmax(e_jt)
And there could be various ways to define e_jt which is a function of current output state and j’th input state
Attention Mechanism
Till now we have seen attention as way to improve Seq2Seq model but one can use attention in many architectures for many tasks. Attention could be defined as
Attention: Query attend to Values
Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of values dependent on the query
Intuition behind this definition is that
- The weighted sum is a selective summary of information contained in values, where query determines values to focus on
- Attention is a way to obtain fixed size representation of arbitrary set of representations (Values) dependent on some other representation (Query)
The algorithm for general Attention calculation is
Input: Values(h_1… h_n) and Query(s_t-1), where n is number of words in input
Compute Attention scores: e_jt's
Take softmax to get attention distribution: ⍺_jt
Use attention distribution to take weighted sum of values
and obtain attention output
Attention_score = ∑ ⍺_jt * h_j
Output: Attention Score
And there are multiple ways of calculating attention scores famously known by names Basic dot product Attention, Multiplicative attention and Additive attention
References
- Deep Learning NPTEL, IIT Madras: https://www.youtube.com/playlist?list=PL3pGy4HtqwD2kwldm81pszxZDJANK3uGV
- NLP Stanford CS224N: https://www.youtube.com/playlist?list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z
Keep Learning, Keep Hustling