Long-Range Transformer with Unlimited Length Input
Pretrained transformers generally have a context window of 512 (e.g. BERT , T5 ) or 1024 tokens (e.g. BART), which are sufficient lengths for many current conditional generation datasets. But vanilla transformers cannot simply scale, as naïve self-attention operation has quadratic complexity. So the tasks that involve long narratives, such as book summarization, which may contain inputs exceeding 500K tokens cannot be processed with such architecture.
Authors of paper[1] Unlimiformer, introduced a retrieval-based approach to accept inputs of unbounded length at test time. Its main components include
- Long input sequence
- k-nearest-neighbor (kNN) index over processed input tokens via Transformer Encoder
- Cross-Attention on top-k input tokens in Transformer Decoder
Unlimiformer can be injected into any existing encoder-decoder transformer to permit unbounded inputs. Having grasp of the problem from a high-level perspective, lets dive deep into the model architecture and intricacies of the problem.