Long-Range Transformer with Unlimited Length Input
3 min readAug 25, 2023
Pretrained transformers generally have a context window of 512 (e.g. BERT , T5 ) or 1024 tokens (e.g. BART), which are sufficient lengths for many current conditional generation datasets. But vanilla transformers cannot simply scale, as naïve self-attention operation has quadratic complexity. So the tasks that involve long narratives, such as book summarization, which may contain inputs exceeding 500K…