Finding attention Sink of LLMs: StreamingLLM

Nikhil Verma
4 min readOct 16, 2023
Source Google Photos and modification; Disclaimer: The views and opinions expressed by me are personal and do not reflect the positions or perspectives of my respective employer

Although the large language models are swinging their furls in various directions such as search engines, question answering, chat assistance and document summarization, but there are still many choices made while laying down their architecture for particular case such as modelling architecture, decoding strategy, long range dependency and loss function specific to the task and learning objective. The architecture behind most such models is Vanilla Transformer or its components such as encoder-only or decoder-only model.

It is very challenging for LLM to generalize to longer sequence lengths than they have been pretrained on

The internal operation powering the components are scaled-dot-product-attention, which carries the time complexity of O(n**2) for a given input of length n. This dense operation really limits the use of model to finite length of input tokens. This limit(context length) usually looks like 512 for BERT, 1K for T5 and approximately 4K for LLaMa and most of GPT-family models.

Window attention: it ensures constant memory usage and decoding speed after the cache is initially filled

--

--

Nikhil Verma

Knowledge shared is knowledge squared | My Portfolio https://lihkinverma.github.io/portfolio/ | My blogs are living document, updated as I receive comments