Long-Range Transformer with Unlimited Length Input

Nikhil Verma
3 min readAug 25, 2023
ChatGPT, not able to generate 4000 word essay due to context window limitation

Pretrained transformers generally have a context window of 512 (e.g. BERT , T5 ) or 1024 tokens (e.g. BART), which are sufficient lengths for many current conditional generation datasets. But vanilla transformers cannot simply scale, as naïve self-attention operation has quadratic complexity. So the tasks that involve long narratives, such as book summarization, which may contain inputs exceeding 500K tokens cannot be processed with such architecture.

Authors of paper[1] Unlimiformer, introduced a retrieval-based approach to accept inputs of unbounded length at test time. Its main components include

  • Long input sequence
  • k-nearest-neighbor (kNN) index over processed input tokens via Transformer Encoder
  • Cross-Attention on top-k input tokens in Transformer Decoder

Unlimiformer can be injected into any existing encoder-decoder transformer to permit unbounded inputs. Having grasp of the problem from a high-level perspective, lets dive deep into the model architecture and intricacies of the problem.

Model Architecture

--

--

Nikhil Verma

Knowledge shared is knowledge squared | My Portfolio https://lihkinverma.github.io/portfolio/ | My blogs are living document, updated as I receive comments