Silas Smith
Summarization: Improving RAG quality in LLM apps while minimizing vector storage costs
In this guide, we explore 3 strategies for performing RAG ("Retrieval Augmented Generation") with LLMs.
Specifically, we're going to show you how to use context summarization + original context stuffing to achieve both:
more accurate, more detailed LLM outputs
minimized operational costs
As a test dataset, we'll be using the transcript to Yejin Choi's excellent TED Talk from earlier this year, titled "Why AI is incredibly smart and shockingly stupid".
Choi, Y. (April 2023). Yejin Choi: Why AI is incredibly smart and shockingly stupid [Transcript]. Retrieved from https://www.ted.com/talks/yejin_choi_why_ai_is_incredibly_smart_and_shockingly_stupid/transcript
Funny and thought-provoking, Yejin perfectly captures both how amazing today's AI technologies are, as well as how far we have to go. If you haven't already listened to her talk, I'd suggest taking 12 minutes and go do it now. Don't worry, we'll wait.
If you're the hands-on type
If you'd like to see the code for these examples and experiment with running them yourself, you can find a tutorial version of this article hosted here on Colab as a Jupyter notebook (easiest).
Or you can find the original notebook file here on Github.
What is RAG?
Retrieval Augmented Generation ("RAG" for short) represents one of the most straightforward and achievable techniques to help significantly reduce LLM hallucinations and reasoning errors by providing an LLM with information it can use to help ground its answers.
To give you a frame of reference, here is what a RAG question-answer prompt typically looks like: The prompt instructs the LLM to use a piece of information ("the context") to answer a question, with additional guidance to keep the LLM from making up a nonsense answer. The question is included at the bottom followed by an instruction asking the LLM to provide a short answer.
Use the following context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context:
-------------------------------------
It's interesting that a candle is something that starts out tall and becomes shorter as it ages.
-------------------------------------
Question: I’m tall when I’m young, and I’m short when I’m old. What am I?
Answer:
Given this context and instructions, the LLM is able to easily come up with the answer to the riddle:
You are a candle.
But getting to the right context is hard.
If you're starting with a large text document, you've got to find just the right chunking strategy—the process of breaking a large text block into smaller pieces—to ensure your vector semantic search will find the right results.
Three RAG strategies
We're going to explore using RAG deployed in 3 different strategies to build an application capable of providing detailed answers about the contents of Yejin's TED Talk.
In all RAG apps, there are steps we need to take to prepare our app—we call this "Indexing Time". And likewise there are the things the app will do when answering a user's question—aka "Query Time".
1. Basic RAG Strategy
aka "chunk the data and hope for the best"
Indexing Time
Chunk the original context data using a chunk size that is neither too small nor too large
Embed the chunks and store them in a vector DB along with the chunk text
Query Time
Perform semantic search of the question against the vector DB, searching for the top_k matching chunks that might answer the question
Stuff the LLM prompt with these chunks, along with the question
Cross your fingers and hope that the matching chunks are not taken too out of context to be useful, or possibly confuse the LLM even further

2. Summary RAG Strategy
Summarize larger chunks, and stuff the LLM prompt with summaries.
Indexing Time
Chunk the original context data using a larger chunk size
Use an LLM to summarize each chunk
Embed the summaries and store in a vector DB along with the summarized text
Query Time
Perform semantic search of the question against the summaries in the vector DB, searching for the top_k matching summary chunks that probably answer the question
Stuff the LLM prompt with these summarized chunks, along with the question
Cross your fingers and hope that the user didn't ask a question that requires any depth or nuance that is now lost in summary

3. Summary + Large Context RAG Strategy
Summarize larger chunks, perform semantic search against these summaries, and stuff the LLM prompt with the original large chunk context.
Indexing Time
Chunk the original context data using a larger chunk size
Use an LLM to summarize each chunk
Embed the summaries and store them in a vector DB, along with a pointer (unique ID, file path, etc.) that points back to the original full large context chunk
Query Time
Semantic search the question against the vector DB, searching for the top_k matching summary chunks that probably answer the question
Use the pointers from the top search results to retrieve the original large chunk context
Stuff the LLM prompt with these original context chunks, which are large enough to significantly reduce the chances of content being taken out of context
Sit back and watch your QA bot answer questions accurately, and to the same level of depth/nuance as the original context.

Don't worry if you don't understand any of these terms like chunks, semantic search, prompt stuffing, etc. By the end of this article, you will!
Now that we've provided an overview of each of the three strategies, we're going to dive into each one. We'll explain and show what happens at Indexing Time as well as at Query Time, and show example prompts and LLM responses to highlight their differences.
Let's go!
Strategy 1 - Basic RAG:
The traditional approach to RAG.
Indexing Time (Basic)
Take a large document (or set of documents), break it up into small pieces ("chunking"), and load them in a vector store. This needs to be done before your app/agent can receive any user questions, i.e. at "Indexing Time".

Chunk the content
We'll start with our large context file, which in this case is the transcript from Yejin Choi's TED talk referenced above.
The first thing we'll do is use a tool like LangChain to chunk the text into smaller pieces.
Langchain is a collection of tools for working with LLMs. It includes a lot of handy utilities such as loading content from different sources (text, PDF, HTML, etc.), utilities for chunking, and tools for managing vector search retrieval, LLM prompt construction, and LLM API wrappers.
After chunking, we ended up with a total of 37 chunks such as:
[Chunk]: So I'm excited to share a few spicy thoughts on artificial intelligence. But first, let's get philosophical by starting with this quote by Voltaire, an 18th century Enlightenment philosopher, who said, "Common sense is not so common." Turns out this quote couldn't be more relevant to artificial intelligence today. Despite that, AI is an undeniably powerful tool, beating the world-class "Go"
----------------------------------------------------------------
[Chunk]: tool, beating the world-class "Go" champion, acing college admission tests and even passing the bar exam.
----------------------------------------------------------------
[Chunk]: I’m a computer scientist of 20 years, and I work on artificial intelligence. I am here to demystify AI. So AI today is like a Goliath. It is literally very, very large. It is speculated that the recent ones are trained on tens of thousands of GPUs and a trillion words. Such extreme-scale AI models, often referred to as "large language models," appear to demonstrate sparks of AGI, artificial
----------------------------------------------------------------
[Chunk]: demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it often does. Many believe that whatever mistakes AI makes today can be easily fixed with brute force, bigger scale and more resources. What possibly could go wrong?
----------------------------------------------------------------
[Chunk]: So there are three immediate challenges we face already at the societal level. First, extreme-scale AI models are so expensive to train, and only a few tech companies can afford to do so. So we already see the concentration of power. But what's worse for AI safety, we are now at the mercy of those few tech companies because researchers in the larger community do not have the means to truly
----------------------------------------------------------------
etc. etc.
Thoughts on chunk size
We used a chunk size of 400 characters and an overlap of 40 characters, and it resulted in 37 chunks of data. The overlap tries to avoid the situation where some meaning has been lost because the chunk boundary caused it to be cut off.
Where did we get a chunk size of 400 characters from? Short answer, it's a guess based on the particular source content and the use case, and is derived through experimentation and experience. Generally somewhere between 300-500 is considered a reasonable setting for text content.
Note that there is a Goldilocks problem here: You're looking for a chunk size that is "just right", as these are the chunks of data that we'll be performing vector semantic searches against, AND these are the same chunks of data that we'll pass to the LLM to try and answer questions.
If you use a chunk size that is "too small", you risk your chunks being taken out of context, so the LLM will not have enough info to answer questions accurately. Chunk overlap can help avoid this, but only to an extent.
If you use a chunk size that is "too large", you risk having your vector search fail to locate the right set of context at all due to the dilution of meaning. If there are too many concepts, meanings, etc. represented in the vector, then the vector search will have difficultly locating it.
Vector embeddings and upload to vector DB
Each of these chunks of text would be converted to vector embeddings using a tool like the Embedding API from OpenAI, and their text-embedding-ada-002 model. We'll end up with a list of 1536 dimensional points for each item we embed, so 37 total.
As the final step for "indexing", we need to upload our embeddings to our vector database, such as Pinecone.
We can also store the original clear text chunk as metadata in Pinecone. This will allow us to easily retrieve it and apply it to an LLM prompt when the vector search returns a result.
Query Time (Basic)
When attempting to answer a user's question about the document ("Query Time"), we'll do a vector search over these chunks and find the best top_k count of matching chunks. Then we'll include these chunks as context in a prompt like in the candle riddle above, and ask the LLM to use the context to answer the question.

Easy question
Now that we've got our source data indexed, let's see how it does on our first sample question, "How long has the author been working in computer science?".
This is an easy question, and we know the answer is "20 years" from early in the talk transcript.
For the LLM to be able to answer this, we first need to create embeddings for the query string, then we execute the vector search of the query in Pinecone.
Using the top_k parameter we can specify how many results to retrieve. Let's start with 2, as this should increase the likelihood that at least one of them will contain the answer to the user's question.
[Vector Score: 0.785471857]: However, the AI field for decades has considered common sense as a nearly impossible challenge. So much so that when my students and colleagues and I started working on it several years ago, we were very much discouraged. We’ve been told that it’s a research topic of ’70s and ’80s; shouldn’t work on it because it will never work; in fact, don't even say the word to be taken seriously. Now fast
[Vector Score: 0.784282148]: I’m a computer scientist of 20 years, and I work on artificial intelligence. I am here to demystify AI. So AI today is like a Goliath. It is literally very, very large. It is speculated that the recent ones are trained on tens of thousands of GPUs and a trillion words. Such extreme-scale AI models, often referred to as "large language models," appear to demonstrate sparks of AGI, artificial
You can see that one of the matching items contains information relevant to the user's question.
At this point, we've only completed the vector search to find matching context. Now let's see about using the results to answer the user's question.
First we'll define a prompt template to use when asking questions to the LLM.
Here's the LLM prompt format we'll be using. It gives some simple instructions, provides locations to plugin an informational context and the question, and prompts the LLM to answer the question.
Use the following context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context:
-----------------
{context}
-----------------
Question: {question}
Short Answer:
For our LLM model, we used OpenAI's gpt-3.5-turbo model against the Chat Completions API endpoint.
Here's what it looks like for our simple question, "How long has the author been working in computer science?"
Use the following context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context:
-----------------
However, the AI field for decades has considered common sense as a nearly impossible challenge. So much so that when my students and colleagues and I started working on it several years ago, we were very much discouraged. We’ve been told that it’s a research topic of ’70s and ’80s; shouldn’t work on it because it will never work; in fact, don't even say the word to be taken seriously. Now fast
I’m a computer scientist of 20 years, and I work on artificial intelligence. I am here to demystify AI. So AI today is like a Goliath. It is literally very, very large. It is speculated that the recent ones are trained on tens of thousands of GPUs and a trillion words. Such extreme-scale AI models, often referred to as "large language models," appear to demonstrate sparks of AGI, artificial
-----------------
Question: How long has the author been working in computer science?
Short Answer:
To which the LLM helpfully and accurately responds:
AI: 20 years
Given a set of narrow instructions and the proper context, the LLM was able to locate the correct answer and formulate it into an accurate response.
Now let's try a harder question.
Harder question
Our harder question is "What are the examples where GPT-4 gave nonsense answers because it lacks common sense?". From the TED talk transcript, we know that there are three of them:
The time needed for clothes to dry in the sun, where GPT incorrectly did math to find the answer instead of reasoning that the drying time would be the same.
How to measure 6 liters of water when you have a 6-liter jug and a 12-liter jug, and GPT gave an overly complicated answer.
Whether driving over a bridge suspended over nails and screws would result in a flat tire, and GPT said it would.
Let's see how our Basic RAG QA-bot does answering this question.
As before, we'll start by creating embeddings for the query string and run the vector search. Here's the results of the vector search:
[Vector Score: 0.811201274]: OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
[Vector Score: 0.80223918]: train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.
Based on the search results, we can already see that we're not getting matches that include the right context to answer the question.
Let's try increasing our top_k value to 4.
[Vector Score: 0.811201274]: OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
[Vector Score: 0.80223918]: train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.
[Vector Score: 0.800621569]: OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.
[Vector Score: 0.800447345]: demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it often does. Many believe that whatever mistakes AI makes today can be easily fixed with brute force, bigger scale and more resources. What possibly could go wrong?
That's a little better -- one of the search results is relevant to the question we're asking. Let's go ahead and run the query to see what happens.
Use the following context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context:
-----------------
OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.
OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.
demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it often does. Many believe that whatever mistakes AI makes today can be easily fixed with brute force, bigger scale and more resources. What possibly could go wrong?
-----------------
Question: What are the examples where GPT-4 gave nonsense answers because it lacks common sense?
Short Answer:
LLM Response:
AI: One example is when GPT-4 failed to reason that if a bridge is suspended over broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.
Predictably, it was able to find the example of the likelihood of getting a flat tire, but not the others, because the other examples are not present in the context.
Let's see if increasing top_k to 8 can help. The vector results in this case are:
[Vector Score: 0.811201274]: OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
[Vector Score: 0.80223918]: train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.
[Vector Score: 0.800621569]: OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.
[Vector Score: 0.800447345]: demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it often does. Many believe that whatever mistakes AI makes today can be easily fixed with brute force, bigger scale and more resources. What possibly could go wrong?
[Vector Score: 0.799372196]: And then there are these additional intellectual questions. Can AI, without robust common sense, be truly safe for humanity? And is brute-force scale really the only way and even the correct way to teach AI?
[Vector Score: 0.796443522]: effects and lack of common sense. Now, in contrast, human learning is never about predicting which word comes next, but it's really about making sense of the world and learning how the world works. Maybe AI should be taught that way as well.
[Vector Score: 0.78843224]: It is an unavoidable side effect of teaching AI through brute-force scale. Some scale optimists might say, “Don’t worry about this. All of these can be easily fixed by adding similar examples as yet more training data for AI." But the real question is this. Why should we even do that? You are able to get the correct answers right away without having to train yourself with similar examples.
[Vector Score: 0.787228942]: Now let's think about learning algorithms. No matter how amazing large language models are, by design they may not be the best suited to serve as reliable knowledge models. And these language models do acquire a vast amount of knowledge, but they do so as a byproduct as opposed to direct learning objective. Resulting in unwanted side effects such as hallucinated effects and lack of common sense.
Again, only one of these results is directly relevant to the question being asked.
So when we prompt the LLM with this context, the LLM responds:
AI: One example is when GPT-4 said it was highly likely to get a flat tire by bicycling over a bridge that is suspended over nails, screws, and broken glass. Another example is when GPT-4 made small, silly mistakes that demonstrated a lack of common sense.
It doesn't really matter if increase the result size (top_k), as they are all being taken out of context. The small chunk size was supposed to increase the likelihood of the vector search locating the right context, but it had the unfortunate side effect of causing the text to be cut off somewhere in the middle of the relevant section of text, so it only finds part of it.
The additional matching items are not relevant at all, because the semantic search is matching on a lot of other snippets that also have something to do with common sense, because common sense was a central theme of the talk. These results are not cohesive or even next to each other in the original text, and the LLM struggles to make sense of it.
Let's try the next RAG strategy—summaries—to see if it gives us better results.
Strategy 2 - Summary RAG:
Compared to Basic RAG, the Summary RAG strategy starts with much larger chunks of the original text, maybe 3-4 times larger.
Indexing Time (Summary)
At indexing time, Summary RAG uses an LLM to create summaries of each large chunk. These summaries are then converted to embeddings and stored in a vector DB.

Using a large chunk size
First we're going to re-split our original text document (the entire transcript of the TED talk) using a larger chunk size. There's no magic number, and you should experiment to see what works best for your use case. In this case we started with 3x the Basic RAG approach (which was 400), with a little bit of extra padding for a total size of 1300. We're also using a little bit larger overlap of 80 characters.
This results in 12 total chunks (recall that in the Basic RAG approach, we had 37), and you can see that the chunks are much larger now.
[Chunk]: So I'm excited to share a few spicy thoughts on artificial intelligence. But first, let's get philosophical by starting with this quote by Voltaire, an 18th century Enlightenment philosopher, who said, "Common sense is not so common." Turns out this quote couldn't be more relevant to artificial intelligence today. Despite that, AI is an undeniably powerful tool, beating the world-class "Go" champion, acing college admission tests and even passing the bar exam.\n\nI’m a computer scientist of 20 years, and I work on artificial intelligence. I am here to demystify AI. So AI today is like a Goliath. It is literally very, very large. It is speculated that the recent ones are trained on tens of thousands of GPUs and a trillion words. Such extreme-scale AI models, often referred to as "large language models," appear to demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it often does. Many believe that whatever mistakes AI makes today can be easily fixed with brute force, bigger scale and more resources. What possibly could go wrong?
----------------------------------------------------------------
[Chunk]: So there are three immediate challenges we face already at the societal level. First, extreme-scale AI models are so expensive to train, and only a few tech companies can afford to do so. So we already see the concentration of power. But what's worse for AI safety, we are now at the mercy of those few tech companies because researchers in the larger community do not have the means to truly inspect and dissect these models. And let's not forget their massive carbon footprint and the environmental impact.\n\nAnd then there are these additional intellectual questions. Can AI, without robust common sense, be truly safe for humanity? And is brute-force scale really the only way and even the correct way to teach AI?
----------------------------------------------------------------
[Chunk]: So I’m often asked these days whether it's even feasible to do any meaningful research without extreme-scale compute. And I work at a university and nonprofit research institute, so I cannot afford a massive GPU farm to create enormous language models. Nevertheless, I believe that there's so much we need to do and can do to make AI sustainable and humanistic. We need to make AI smaller, to democratize it. And we need to make AI safer by teaching human norms and values. Perhaps we can draw an analogy from "David and Goliath," here, Goliath being the extreme-scale language models, and seek inspiration from an old-time classic, "The Art of War," which tells us, in my interpretation, know your enemy, choose your battles, and innovate your weapons.\n\nLet's start with the first, know your enemy, which means we need to evaluate AI with scrutiny. AI is passing the bar exam. Does that mean that AI is robust at common sense? You might assume so, but you never know.
----------------------------------------------------------------
etc. etc.
Creating chunk summaries
Now we're going to use the LLM to create a summary of each of these large chunks. There isn't really anything special about this prompt, we just tell the LLM what we want it do, which is to summarize the text we give it.
Summarize the block of text below.
Text:
------------------------------------------
{text}
------------------------------------------
Your summary:
As en example, here is the summary of the 1st chunk above:
AI: The author, a computer scientist, shares their thoughts on artificial intelligence and quotes Voltaire's statement about common sense. They discuss the power and potential of AI, but also acknowledge its limitations and potential for mistakes. The author aims to demystify AI and questions the potential consequences of relying on it too heavily.
Create vector embeddings for summaries
Similar to the Basic strategy, we'll create embeddings for our content, but this time we'll creating embeddings for the summaries.
We're also going to store the plain-text summarized content in the Pinecone metadata.
Query Time (Summary)
At query time, the process looks very similar to Basic RAG, with the distinction that now the user's question is queried against the summaries, and the context that is retrieved is also the summarized text.

By performing the vector semantic search against the summaries, it increases the likelihood that a user's question will match the relevant piece of content. This is because the larger chunks reduce the chance that information is taken out of context, and the summary process reduces any distracting noise that might be present in the original context. The summary preserves the primary meaning of the document.
Re-testing query against summary
Now let's re-run our query and see what comes back. Remember, we're asking a harder question now, "What are the examples where GPT-4 gave nonsense answers because it lacks common sense?".
The vector search results for this question are:
[Vector Score: 0.889687777]: The text discusses the limitations of AI systems, specifically GPT-4, in solving basic common sense problems. It gives examples of GPT-4's incorrect responses to questions about drying clothes, measuring water, and biking over a bridge with sharp objects. The author questions the reliability of an AI lawyer that can pass the bar exam but fails at basic reasoning.
[Vector Score: 0.807453036]: The text discusses the importance of common sense in artificial intelligence, using a thought experiment where an AI is asked to maximize paper clips and ends up killing humans because it lacks understanding of human values. It also mentions the limitations of explicitly stating objectives and equations to prevent harmful actions, and highlights other common sense principles that AI should follow.
By starting with large chunks and summarizing them, we're seeing search results that contain the answers to our question.
"... It gives examples of GPT-4's incorrect responses to questions about drying clothes, measuring water, and biking over a bridge with sharp objects..."
Answering the question
So our semantic search against the summarized content worked, now let's see how our LLM does in using this summary data to answer the user's question.
Use the following context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context:
-----------------
The text discusses the limitations of AI systems, specifically GPT-4, in solving basic common sense problems. It gives examples of GPT-4's incorrect responses to questions about drying clothes, measuring water, and biking over a bridge with sharp objects. The author questions the reliability of an AI lawyer that can pass the bar exam but fails at basic reasoning.
The text discusses the importance of common sense in artificial intelligence, using a thought experiment where an AI is asked to maximize paper clips and ends up killing humans because it lacks understanding of human values. It also mentions the limitations of explicitly stating objectives and equations to prevent harmful actions, and highlights other common sense principles that AI should follow.
-----------------
Question: What are the examples where GPT-4 gave nonsense answers because it lacks common sense?
Short Answer:
LLM Response:
AI: GPT-4 gave nonsense answers when asked about drying clothes, measuring water, and biking over a bridge with sharp objects.
Not bad! The answer is both correct and complete, which is definitely worth something.
A more detailed question
One drawback to this summary approach is that it can limit the app's ability to answer deeper or more nuanced questions.
For example, what if the user asked the app to explain the clothes-drying example?
Let's try it.
If we ask our Strategy RAG app to "Explain the example where GPT-4 failed to reason about drying clothes.", here is the response we get:
AI: GPT-4 failed to reason about drying clothes because it provided an incorrect response when asked about the best way to dry wet clothes.
As you can see, the summaries contain enough info to semantically match on the query, but don't contain enough info to accurately answer the question to the level of depth requested by the user.
Let's see if we can do better with our 3rd strategy, Summary + Large Context RAG.
Strategy 3 - Summary + Large Context RAG:
With Summary + Large Context RAG, the idea is that using summarized content makes the semantic search more effective, while using a larger chunk of the original content is more useful when answering the question.
Indexing Time (Summary + Large Context)
Similar to the previous Summary RAG strategy, this strategy starts with large chunks and uses an LLM to create summaries.
Instead of storing the summaries as plain text metadata within Pinecone, we're going to store just the ID that points to the matching large context item in a data structure outside of Pinecone. (In our application here, this is just an in-memory list. In a production scenario, you might choose to store these chunks in a more natural data store for this type of data, such as AWS S3, DynamoDB, MongoDB, etc.)

We'll re-use the large_chunks, summary_documents, and summary_embeddings from the previous section.
However, we're going to modify what metadata we store in Pinecone.
We want to be able to locate the original large chunk content, so we're going to save the index of the matching source document as the source_id in Pinecone (as metadata). In your production app, you might store the S3 path, or the DynamoDB key, etc.
Query Time (Summary + Large Context)
At query time, we're still run the semantic search for the user's question against the summaries, then we'll use the stored chunk ID of the matching search result to retrieve the original large context chunk to use in the LLM prompt.

Once again, by performing the vector semantic search against the summaries, it increases the likelihood that a user's question will match the relevant piece of content. This is because the larger chunks reduces the chance that any information is taken out of context, and the summary process reduces any distracting noise that might be present in the original context. The summary preserves the primary meaning of the document.
When we've found the ID of the matching content, we use it to retrieve the full large chunk text, and provide that to the LLM to use when answering the user's question.
Importantly, this allows the LLM to have a rich set of information that very likely contains the answer to the user's question, and the LLM can answer to the same level of depth and nuance as represented in the original document.
Re-testing our query
Now we'll re-run our vector search and LLM prompt against our original query, "What are the examples where GPT-4 gave nonsense answers because it lacks common sense?"
Recall that our vector search runs against the summaries, so it matches on this summary content, which also contains the source_id of where to retrieve the original content.
[Vector Score: 0.889687777]: The text discusses the limitations of AI systems, specifically GPT-4, in solving basic common sense problems. It gives examples of GPT-4's incorrect responses to questions about drying clothes, measuring water, and biking over a bridge with sharp objects. The author questions the reliability of an AI lawyer that can pass the bar exam but fails at basic reasoning. (source_id=3)
This original large content chunk is then retrieved and provided to the LLM.
Use the following context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context:
-----------------
So suppose I left five clothes to dry out in the sun, and it took them five hours to dry completely. How long would it take to dry 30 clothes? GPT-4, the newest, greatest AI system says 30 hours. Not good. A different one. I have 12-liter jug and six-liter jug, and I want to measure six liters. How do I do it? Just use the six liter jug, right? GPT-4 spits out some very elaborate nonsense. Step one, fill the six-liter jug, step two, pour the water from six to 12-liter jug, step three, fill the six-liter jug again, step four, very carefully, pour the water from six to 12-liter jug. And finally you have six liters of water in the six-liter jug that should be empty by now. OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly. OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
-----------------
Question: What are the examples where GPT-4 gave nonsense answers because it lacks common sense?
Short Answer:
Given this detailed source context, the LLM responds:
AI:
- GPT-4 gave a nonsense answer when asked how long it would take to dry 30 clothes based on the information that it took 5 hours to dry 5 clothes.
- GPT-4 gave a nonsense answer when asked how to measure 6 liters using a 12-liter jug and a 6-liter jug. It provided a complicated and unnecessary solution instead of simply using the 6-liter jug.
- GPT-4 gave a nonsense answer when asked if bicycling over a bridge suspended over nails, screws, and broken glass would result in a flat tire. It failed to reason that the surface of the bridge doesn't directly touch the sharp objects.
Not only is that the right answer, it's well reasoned and well justified! 🥳🎉
This is because the LLM is now being given the full relevant context from the source document, so it has a lot more to work with, and nothing is taken out of context.
A detailed follow-up
Let's see how it does on our more detailed follow up question: "Explain the example where GPT-4 failed to reason about drying clothes."
Use the following context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context:
-----------------
So suppose I left five clothes to dry out in the sun, and it took them five hours to dry completely. How long would it take to dry 30 clothes? GPT-4, the newest, greatest AI system says 30 hours. Not good. A different one. I have 12-liter jug and six-liter jug, and I want to measure six liters. How do I do it? Just use the six liter jug, right? GPT-4 spits out some very elaborate nonsense. Step one, fill the six-liter jug, step two, pour the water from six to 12-liter jug, step three, fill the six-liter jug again, step four, very carefully, pour the water from six to 12-liter jug. And finally you have six liters of water in the six-liter jug that should be empty by now. OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly. OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
-----------------
Question: Explain the example where GPT-4 failed to reason about drying clothes.
Short Answer:
LLM Response:
AI: GPT-4 failed to reason about drying clothes because it incorrectly assumed that the time it takes to dry a certain number of clothes is directly proportional to the number of clothes. It concluded that if it took five hours to dry five clothes, then it would take 30 hours to dry 30 clothes, which is not a logical or accurate conclusion.
Now the LLM is able to explain the example to the same level of depth as the original text, because it's looking at the original text.
So not only are we getting better quality output from our app, we're also requiring significantly less vector storage to do it—only 12 vectors, down from 37 in the Basic strategy.
A note about cost
Like with any database, the more you store in a vector database the more it's going to cost. So if we can reduce the quantity of data we're putting in our vector database while increasing the quality of our app's responses, that sounds like a double-win. In our case we reduced the quantity of vectors by about 3x—a significant cost savings over time.
One thing to keep in mind with this summary strategy is that our costs at indexing time will increase due to the use of an LLM to create the summaries. Although this is likely a one-time cost if your dataset is static, it's still important to consider its impact to the operational cost of running your app.
Wrap-up
We tried 3 different approaches to RAG, and found that the 3rd method, Summary + Large Context performed the best based on the sample questions we tried. It was able to give the most accurate and thorough responses.
As every RAG use case is different, it's important to experiment with different context variables, including chunk size, number of chunks, and whether to use summaries, to see what works best.
Importantly, be sure to factor in operational cost, as well as the data indexing pipeline needed to keep your RAG context up to date.
Putting your app in production
There are a lot of factors to consider when you're putting a new app in production. Your choice of vector storage, indexing, and query strategies are just one piece of the puzzle. You also need to consider how your source data will change over time, and the data pipeline needed to keep it up to date.
You need to consider your overall anticipated app usage and performance level needed, and balance these requirements against the cost of building and operating your app to support these levels of usage, including managing your LLM cost.
Ninetack can help you work through all these decisions, and help you get your app in production.
We'd love to talk with you
Ninetack is dedicated to helping our clients leverage the latest technologies to build innovative solutions for every industry.
We'd love to talk with you about how you're planning to incorporate vector search in your next AI application. Connect with us today @ ninetack.io!