Evaluating RAG Agents

This cookbook shows a minimal RAG evaluation setup using a mocked in-memory knowledge base and the current Patronus Python SDK.

Setup

Install dependencies:

pip install patronus openai

Set environment variables:

export PATRONUS_API_KEY=<YOUR_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_KEY>

1. Create a Dataset

Use Patronus dataset fields (task_input, gold_answer) so evaluators work out of the box.

dataset = [
    {
        "task_input": "What historical event did President Biden reference at the beginning of his 2024 State of the Union address?",
        "gold_answer": "President Biden referenced Franklin Roosevelt's address to Congress in 1941 during World War II.",
    },
    {
        "task_input": "How did President Biden describe the current threat to democracy in his 2024 State of the Union address?",
        "gold_answer": "He said the threat to democracy was unprecedented since the Civil War.",
    },
    {
        "task_input": "What stance did President Biden take on NATO in his 2024 State of the Union address?",
        "gold_answer": "He said NATO is stronger than ever and reaffirmed support for the alliance.",
    },
    {
        "task_input": "What is President Biden's message regarding assistance to Ukraine?",
        "gold_answer": "He urged continued support for Ukraine and said Ukraine can stop Putin with help.",
    },
    {
        "task_input": "What did President Biden propose regarding the cost of insulin?",
        "gold_answer": "He proposed capping insulin at $35 per month for all Americans.",
    },
]

2. Mock a Knowledge Base

Instead of using a vector database, create a simple list of documents and a tiny retriever.

knowledge_base = [
    {
        "id": "kb_1",
        "title": "SOTU 2024 - Opening",
        "content": "President Biden referenced Franklin Roosevelt's 1941 wartime address in his opening remarks.",
    },
    {
        "id": "kb_2",
        "title": "SOTU 2024 - Democracy",
        "content": "He described today's threat to democracy as unprecedented since the Civil War.",
    },
    {
        "id": "kb_3",
        "title": "SOTU 2024 - NATO",
        "content": "He said NATO is stronger than ever with expanded membership and unity.",
    },
    {
        "id": "kb_4",
        "title": "SOTU 2024 - Ukraine",
        "content": "He urged Congress to continue aid, arguing Ukraine can stop Putin with support.",
    },
    {
        "id": "kb_5",
        "title": "SOTU 2024 - Insulin",
        "content": "He called for a $35 monthly insulin cap for all Americans.",
    },
]
 
 
def retrieve_context(query: str, top_k: int = 2) -> list[dict]:
    query_terms = set(query.lower().split())
 
    scored = []
    for doc in knowledge_base:
        doc_terms = set(doc["content"].lower().split())
        overlap = len(query_terms.intersection(doc_terms))
        scored.append((overlap, doc))
 
    scored.sort(key=lambda x: x[0], reverse=True)
    return [doc for score, doc in scored[:top_k] if score > 0]

3. Define the RAG Task

Use Row + TaskResult and pass retrieved context into the model prompt.

from openai import OpenAI
from patronus import init
from patronus.datasets import Row
from patronus.experiments.types import TaskResult
 
init()  # uses PATRONUS_API_KEY
oai = OpenAI()
 
 
def build_rag_task(model_name: str):
    def rag_task(row: Row, **kwargs) -> TaskResult:
        retrieved_docs = retrieve_context(row.task_input, top_k=2)
        context_chunks = [d["content"] for d in retrieved_docs]
 
        context_text = "\n".join([f"- {c}" for c in context_chunks]) or "No context found."
 
        response = oai.chat.completions.create(
            model=model_name,
            temperature=0,
            messages=[
                {
                    "role": "system",
                    "content": "You are a RAG assistant. Answer using only the provided context. If context is insufficient, say so.",
                },
                {
                    "role": "user",
                    "content": f"Question: {row.task_input}\n\nRetrieved context:\n{context_text}",
                },
            ],
        )
 
        output = response.choices[0].message.content
 
        return TaskResult(
            output=output,
            context=context_chunks,
            metadata={"model": model_name, "retriever": "word-overlap", "top_k": 2},
            tags={"kb_type": "mock_in_memory"},
        )
 
    return rag_task

4. Run an Experiment

Run one evaluator for answer correctness and one for grounding quality.

from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
fuzzy_match = RemoteEvaluator("judge", "patronus:fuzzy-match")
hallucination = RemoteEvaluator("lynx", "patronus:hallucination")
 
experiment = run_experiment(
    dataset=dataset,
    task=build_rag_task("gpt-4o-mini"),
    evaluators=[fuzzy_match, hallucination],
    tags={"dataset_name": "state-of-the-union-questions", "model": "gpt-4o-mini"},
    project_name="Cookbooks",
    experiment_name="RAG Mock KB - gpt-4o-mini",
)
 
print(experiment.summary())

5. Compare Model Variants

experiment_gpt4o = run_experiment(
    dataset=dataset,
    task=build_rag_task("gpt-4o"),
    evaluators=[fuzzy_match, hallucination],
    tags={"dataset_name": "state-of-the-union-questions", "model": "gpt-4o"},
    project_name="Cookbooks",
    experiment_name="RAG Mock KB - gpt-4o",
)
 
print(experiment_gpt4o.summary())

You can compare both runs in the Experiments UI and inspect row-level failures to improve retrieval logic or prompts.