Context Size and the Power of RAG
When using Large Language Models (LLM) there is a single big and important key factor: Context Size. That’s the total amount of content you can present the LLM and it dictates the amount of information an LLM can consider when generating responses.Depending on your use case the context size can be quite a limiting factor. There is a smart workaround: RAG. In this post I will explain a few things about context size and how RAG solves its limitations. Reading this post requires only a little knowledge about LLMs in general.
The Context Window
The context window size of prominent LLM models these days hovers between 192 to 300 pages of text. I’ve included 5 models as an example in the chart below.
data:image/s3,"s3://crabby-images/a5134/a5134d76b57a15290a0dcf0f9c8fb679d63f9f45" alt=""
That amount of context covers a lot of scenarios, especially when you’re only asking general question and you assume the LLM has an answer. Anytime you have a follow-up question the content in your context window grows as the LLM usually maintains your chat history, which includes your questions and previous answers. Still, the available context window sizes of LLMs should be just be fine for most people.
data:image/s3,"s3://crabby-images/93e70/93e70f4c6d5119b52ead903eca5e0df68be4e7a2" alt=""
Sometimes you want to include additional information that the LLM can’t know. Like some very recent facts not present when training happened, a text you wrote yourself or some code you want to examine. Nothing a typical context window can’t handle these days when they can hold anything between 192 and 300 pages of text.
data:image/s3,"s3://crabby-images/82736/82736eeb0c9fbd4ce1fbc64aa434274fb4873a9a" alt=""
The moment you think about providing entire knowledge bases or a document management, like Atlassian’s Confluence, Microsoft SharePoint or your product database you have a problem: you probably have more than you can squeeze into the context window.
data:image/s3,"s3://crabby-images/98f1a/98f1a48bd5f735b90b6fb7ab3cffeecc4659b625" alt=""
RAG
That’s when Retrieval-Augmented Generation, in short RAG, comes into play. It’s the bridge between your limited context window and your vast repository of knowledge.
The technique in a nutshell: As the context window size is limited, you don’t try to include all of your knowledge into it.
data:image/s3,"s3://crabby-images/7932b/7932b095b1d4724a435590a566da0fa16ab85aae" alt=""
(1) Instead you begin with your raw question. (2) With that, you go to your knowledge database and retrieve what’s relevant to the given questions. (3) This yields a package of relevant documents or excerpts of them. (4) Which you nicely fits into the context window together with the original question.
While many people relate RAG exclusively to a vector database that’s not a requirement. You can use whatever works to retrieve relevant documents. In practice, often a keyword-based approach is used, usually being BM25 - an algorithm used by any traditional search engine like Google or Bing. You can even combine both approaches to build a hybrid RAG, where the semantic part finds a set of documents and you rerank them based on a second keyword based list of results.
Very Large Context Window Sizes
The context window sizes are growing larger and larger. While GPT-1 (2018) had a context size of 512 tokens, GPT-2 had 1024 tokens, GPT-3 2048 tokens and so on. Right now Gemini Pro 1.5/2.0 are reaching 2 Mio tokens and a paper of Google (”Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context”) suggest they have already tested up to 10 million tokens without degrading performance. That’s 15.000 pages of text.
data:image/s3,"s3://crabby-images/737b0/737b03e136f3dc0516a2f9fa83499a692e3ad7ae" alt=""
This leads some people to the provoking idea that RAG might be obsolete given such a big context window size that can hold quite some big knowledge bases. The idea is simple, just squeeze everything you know into the prompt (that approach is even called “prompt stuffing”)
Even if you could squeeze your entire knowledge into a context window, it would be very inefficient to encode everything into tokens, wasting precious computation time, it would be surely more expensive, slower and in the end maybe even less accurate compared to any RAG approach.
Multimodal Tasks
The case for larger context window sizes is different. While you can transform a page of text into 375 tokens you need vastly more tokens for a multimodal content like video or audio.
data:image/s3,"s3://crabby-images/493cb/493cbf9a9b5f6c43ac4eee297b7a78aee093c6d5" alt=""
Let’s assume you speak 150 words per minute, then you would need for a page more than 3 minutes. While that page of text costs you only 375 tokens, you probably need a tenfold of tokens for audio (5760 tokens) and a hundredfold for video (25248 tokens).The motivation behind bigger context sizes mostly serves the multimodal experience of modern LLMs.
Conclusion
Context window sizes have been a limiting factor of LLMs in the past and yielded the RAG methodology. With sizes beyond 1M the context window size is not the limiting factor for text documents anymore but it doesn’t mean that RAG is obsolete. RAG still offers a practical and scalable solution for leveraging vast amounts of information by intelligently retrieving only the relevant context for a given query. Actually RAG feels more like a fundamental approach that has to be paired with LLM when prompting for knowledge.
When speaking of multimodal content, we can still see the context size as a limiting factor, as a movie of 120mins still consumes something close to 2M tokens so we’re far from throwing a set of videos onto a LLM to ask questions about it (ignoring the computional effort of doing that).
Lastly, Google has very recently released a Recall feature for their Gemini Advanced Chat, that allows it to include knowledge from previous chats in newly asked questions. Although they haven't disclosed the underlying mechanism, which could be either a massive context window or a RAG infrastructure, this already demonstrates the kind of innovation enabled by larger context sizes.