Building Complex RAG Systems

During the development of Petals, we've been battling the tradeoff between maximising context windows and building complex retrieval augmented generation (RAG) systems. We're observing that as frontier models continue to improve and become more intelligent, context windows are becoming wider and more consistent.

Recently, Nick Baumann at Cline published a blog post called 'Why Cline doesn't index your codebase and why that's a good thing' which attracted a lot of attention on social media, read the post. Nick outlined how unlike Cursor, they opt to be more selective with their retrieval - utilising large context windows over indexing your codebase locally and performing RAG against it. Given their use case, this seems a reasonable decision - particularly considering their argument for codebases not being able to be chunked in a way that effectively maintains relevant context. There is of course a tradeoff with this method, being the increased cost and token throughput - though we can expect this to decrease with time.

While the 'RAG is dead' rhetoric is becoming increasingly popular, particularly following the release of Google's Gemini 2.5 Pro (with it's 1 million token context window) we at Petals strongly believe that RAG remains an important building block for scalable, context aware AI systems. However, the question still remains - What information does an LLM need to be given, and how, to effectively produce relevant, context aware responses?

The Problem

I've long said that the ability to utilise LLMs for certain problems in your application, strongly hinges on your ability to guide users through providing relevant context around the problem you're trying to solve. This introduces UI/UX issues - some creativity is needed to effectively gather this additional context.

This has been an issue we've faced when designing features that require a high level of accuracy. You could one-shot an LLM and pray that it returns the response you're looking for, or you could design a system that ensures the LLM has the smallest of margins (if any) for inaccuracy.

At Petals, we've built agents to query against a users' system data via third party integrations or existing warehouses and or dbs. The idea is that a user, technical or not, can retrieve information in minutes instead of hours / days or even weeks in some cases with a simple text prompt.

[ VIDEO DEMO ]

However, as mentioned, the need to high accuracy here is paramount. This raises some technical challenges; users may have custom definitions of data / tables that dont align with the agent's understanding. The agent might misinterpret the users' request and / or make incorrect assumptions. Depending on your system prompt, the agent will often perform expensive 'safety checks' to ensure is has a) written correct SQL, b) has retrieved the correct data and c) even understands the schema.

[ DIAGRAM ]

This becomes particularly frustrating for a user when trying to interact with their data — but also on average with gpt-4o, we were spending 60,000 tokens per query in order to return an answer the user deemed 'useful', which is not only laughably poor in terms of latency, but also unsustainably expensive. To fix this, we created a set of definitions to describe tables and their functions for effective retrieval and analysis.

Context Engineering

When designing an abstract mechanism for data warehouse context provision, we had to consider what the purpose of the user's data actually was — and for what it was being used. Bear in mind, we were working with companies that specifically had not yet built out their data function and usually had no 'data people' in their organisation — making this problem particularly difficult to solve on a case-by-case basis.

In theory, this document would sit on top of the user's warehouse - wrapping the entire schema but also being abstract enough to direct an agent deeper.

[ DIAGRAM ]

The reason we're focusing on the lowest layer is because the LLMS-DB.txt document offers the greatest performance improvement and is agent / use case agnostic.

The documents are typically around 5000-10000 tokens in size, but semantically chunked using 1024 dimensions for cost effective retrieval. They're structured as follows:

[ table_name ]
[ table_description_short ]

[ columns ] : [ column_purpose ]

[ example_sql ]

[ example_sample ]

[ foreign_keys ]

This design is the result of testing multiple formats for performance, this document maintains relevant details and is short enough not to completely maximise an agent's context window. With this implementation in place, we had to find the best way to actually get this data from a user's warehouse — without requiring a data engineer to fill it in themselves, but giving them the power to tweak / enhance it if needed.

Prompt Engineering

Perhaps the most simple way to effectively increase the performance of your agent is to tighten the system prompt you give it. Often this can include more clearly defining the problem scope. For example:

[ PETALS CHAT ANALYSIS ]

This is an example of a poorly designed system prompt, not clearly or succinctly outlining the role of the agent. With this prompt we saw accuracy of [ ACCURACY ]. While the role has been described, the windows for error are very wide — as well as that fact that even trying to evaluate generated outputs is futile — the agent is simply too widely specialised to produce an output worth noting.

Ensuring that our agents were locked onto the task at hand was of utmost importance to us — despite most of our agents not being directly addressable by a user.

In the case of designing a system prompt for an agent responsible for generation and LLMS-DB.txt for each table in the user's warehouse, we had to ensure that each of the outputs followed our desired specification and had low levels of variance and randomness.

We took a lot from Anthropic's [ Anthropic's system prompt design ] guide to prompt engineering, in following their recommended structure, opting for XML tags to describe agent purpose, task, example, context and output structure.

Part of testing whether this was working correctly was utilising agent evaluations provided as a neat API in the Mastra class Mastra.evaluate_agent_performance. We designed evaluations to measure the relevance of the output from agents, as well as semantic correctness. This was a trial and error process but garnered useful results and helped us tighten our prompt to the point where we now have 99.8% desirable LLMS-DB.txt outputs.