Finding attention Sink of LLMs: StreamingLLM
Although the large language models are swinging their furls in various directions such as search engines, question answering, chat assistance and document summarization, but there are still many choices made while laying down their architecture for particular case such as modelling architecture, decoding strategy, long range dependency and loss function specific to the task and learning objective. The architecture behind most such models is Vanilla Transformer or its components such as encoder-only or decoder-only model.
It is very challenging for LLM to generalize to longer sequence lengths than they have been pretrained on
The internal operation powering the components are scaled-dot-product-attention, which carries the time complexity of O(n**2) for a given input of length n. This dense operation really limits the use of model to finite length of input tokens. This limit(context length) usually looks like 512 for BERT, 1K for T5 and approximately 4K for LLaMa and most of GPT-family models.