Long-Range Transformer with Unlimited Length Input

3 min readAug 25, 2023

ChatGPT, not able to generate 4000 word essay due to context window limitation

Pretrained transformers generally have a context window of 512 (e.g. BERT , T5 ) or 1024 tokens (e.g. BART), which are sufficient lengths for many current conditional generation datasets. But vanilla transformers cannot simply scale, as naïve self-attention operation has quadratic complexity. So the tasks that involve long narratives, such as book summarization, which may contain inputs exceeding 500K tokens cannot be processed with such architecture.

Authors of paper[1] Unlimiformer, introduced a retrieval-based approach to accept inputs of unbounded length at test time. Its main components include

Long input sequence
k-nearest-neighbor (kNN) index over processed input tokens via Transformer Encoder
Cross-Attention on top-k input tokens in Transformer Decoder

Unlimiformer can be injected into any existing encoder-decoder transformer to permit unbounded inputs. Having grasp of the problem from a high-level perspective, lets dive deep into the model architecture and intricacies of the problem.

Long-Range Transformer with Unlimited Length Input

Model Architecture

Written by Nikhil Verma