Google researchers recently released a paper which claims to give large language models (LLMs) the capability of handling text of infinite length. They introduce Infini-attention as a technique which configures language models in such a way as to increase their context window while keeping memory and compute requirements constant.
Context Window refers to the maximum number of tokens a model can handle at once. For instance, if your conversation with ChatGPT extends past this limit, its performance will drastically degrade and it may discard tokens included at the start.
Organizations have taken to tailoring LLMs specifically to their applications by adding in custom documents and knowledge into prompts, thus increasing context length as one means of increasing model effectiveness and gaining an edge against rivals.
Experiments conducted by Google Research indicate that models using Infini-attention can maintain quality over one million tokens without needing additional memory resources – an effect which should continue unabated over longer lengths.
Infini-attention Transformer, the deep learning architecture employed in LLMs, exhibits what’s known as quadratic complexity for memory footprint and computation time requirements. This means that with increasing data input sizes (for instance going from 1,000 tokens to 2,000 tokens), both memory requirements and computation times increase exponentially with processing needs increasing exponentially as well. This means doubling input size wouldn’t double but quadruple processing needs.
This quadratic relationship can be explained by the self-attention mechanism found in transformers, which compares each element from an input sequence with every other one in turn.
Researchers have developed various techniques over recent years to reduce the costs of expanding context length of LLMs. One such technique, Infini-attention, consists of long-term compressive memory and local causal attention for modeling both long and short-range contextual dependencies efficiently.