Run an Experiment (Python)

Note: For comprehensive API documentation and more detailed examples, please refer to the Patronus Python SDK documentation.

Before you can start using the Patronus evaluation framework, you'll need to create an account here.

Additionally, you'll need an API Key. After signing in to the platform, you can generate one here.

Install Patronus SDK

To start using Experiments, you'll need to have Python 3.8 or higher installed on your machine. To install the Patronus library:

pip install patronus

Write Your First Evaluation Script

Below is a simple example of how to use Patronus to evaluate a model using a "Hello World" example.

import os
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
# Define a simple task function that processes each dataset row
def my_task(row, **kwargs):
    return f"{row.task_input} World"
 
# Run the experiment with a remote evaluator
experiment = run_experiment(
    dataset=[
        {
            "task_input": "Hello",
            "gold_answer": "Hello World"
        }
    ],
    task=my_task,
    evaluators=[
        RemoteEvaluator("judge", "patronus:fuzzy-match")
    ],
    project_name="Tutorial Project",
    experiment_name="Hello World Experiment",
    # You can pass API key directly if not set via environment variable
    api_key=os.environ.get("PATRONUS_API_KEY")
)

Explanation of the script

The my_task function processes each row in the dataset. It takes a row parameter (and additional keyword arguments) and returns a string result.
The run_experiment() function brings everything together:
- dataset: A list of examples to process
- task: The function that processes each example
- evaluators: A list of evaluators to assess the outputs (must be structured evaluators or adapted evaluators)
- project_name and experiment_name: Help organize your experiments in the Patronus platform
- api_key: Can be passed directly to the function if not set via environment variable

Note that unlike other Patronus SDK functions, the experiment framework does not require an explicit patronus.init() call.

Running the script

Before you run the script, provide your API key as an environment variable:

export PATRONUS_API_KEY="your_api_key_here"

Now you can execute the Python file:

python hello_world_evaluation.py

The output should look similar to this:

==================================
Experiment  Tutorial-Project/root-1729600247: 100%|██████████| 1/1 [00:00<00:00, 507.91sample/s]
 
patronus:fuzzy-match (judge)
---------------------------
Count     : 1
Pass rate : 1
Mean      : 0.95
Min       : 0.95
25%       : 0.95
50%       : 0.95
75%       : 0.95
Max       : 0.95
 
Score distribution
Score Range          Count      Histogram
0.00 - 0.20          0          
0.20 - 0.40          0          
0.40 - 0.60          0          
0.60 - 0.80          0          
0.80 - 1.00          1          ####################
 
https://app.patronus.ai/experiments/111247673728424740

You'll also be able to see the results of your evaluation in the Patronus Platform UI through the provided link.

A More Comprehensive Example

Let's create a more realistic example that evaluates a RAG (Retrieval-Augmented Generation) system using OpenAI's API.

First, install the required packages:

pip install pandas openai openinference-instrumentation-openai

from typing import Optional
import os
 
from patronus.evals import evaluator, RemoteEvaluator, EvaluationResult, StructuredEvaluator
from patronus.experiments import run_experiment, FuncEvaluatorAdapter, Row, TaskResult
from openai import OpenAI
from openinference.instrumentation.openai import OpenAIInstrumentor
 
# Initialize OpenAI client
oai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
 
# Option 1: Define a structured evaluator for content matching
class KeyTermsEvaluator(StructuredEvaluator):
    def evaluate(self, row: Row, task_result: TaskResult, **kwargs) -> Optional[EvaluationResult]:
        if not row.gold_answer or not task_result:
            return None
 
        gold_answer = row.gold_answer.lower()
        response = task_result.output.lower()
 
        key_terms = [term.strip() for term in gold_answer.split(',')]
        matches = sum(1 for term in key_terms if term in response)
        match_ratio = matches / len(key_terms) if key_terms else 0
 
        # Return a score between 0-1 indicating match quality
        return EvaluationResult(
            pass_=match_ratio > 0.7,
            score=match_ratio,
            explanation=f"Found {matches}/{len(key_terms)} key terms in the response."
        )
 
# Option 2: Define a function-based evaluator (must be wrapped for experiments)
@evaluator()
def fuzzy_match(row: Row, task_result: TaskResult, **kwargs) -> Optional[EvaluationResult]:
    if not row.gold_answer or not task_result:
        return None
 
    gold_answer = row.gold_answer.lower()
    response = task_result.output.lower()
 
    key_terms = [term.strip() for term in gold_answer.split(',')]
    matches = sum(1 for term in key_terms if term in response)
    match_ratio = matches / len(key_terms) if key_terms else 0
 
    # Return a score between 0-1 indicating match quality
    return EvaluationResult(
        pass_=match_ratio > 0.7,
        score=match_ratio,
    )
 
# Define a task that calls the OpenAI API
def rag_task(row, **kwargs):
    # In a real RAG system, this would retrieve context before calling the LLM
    prompt = f"""
    Based on the following context, answer the question.
 
    Context:
    {row.task_context}
 
    Question: {row.task_input}
 
    Answer:
    """
 
    # Call OpenAI to generate a response
    response = oai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system",
             "content": "You are a helpful assistant that answers questions based only on the provided context."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,
        max_tokens=150
    )
 
    return response.choices[0].message.content
 
# Create a test dataset
test_data = [
    {
        "task_input": "What is the main impact of climate change on coral reefs?",
        "task_context": """
        Climate change affects coral reefs through several mechanisms. Rising sea temperatures can cause coral bleaching,
        where corals expel their symbiotic algae and turn white, often leading to death. Ocean acidification, caused by
        increased CO2 absorption, makes it harder for corals to build their calcium carbonate structures. Sea level rise
        can reduce light availability for photosynthesis. More frequent and intense storms damage reef structures. The
        combination of these stressors is devastating to coral reef ecosystems worldwide.
        """,
        "gold_answer": "coral bleaching, ocean acidification, reduced calcification, habitat destruction"
    },
    {
        "task_input": "How do quantum computers differ from classical computers?",
        "task_context": """
        Classical computers process information in bits (0s and 1s), while quantum computers use quantum bits or qubits.
        Qubits can exist in multiple states simultaneously thanks to superposition, allowing quantum computers to process
        vast amounts of information in parallel. Quantum entanglement enables qubits to be correlated in ways impossible
        for classical bits. While classical computers excel at everyday tasks, quantum computers potentially have advantages
        for specific problems like cryptography, simulation of quantum systems, and certain optimization tasks. However,
        quantum computers face significant challenges including qubit stability, error correction, and scaling up to useful sizes.
        """,
        "gold_answer": "qubits instead of bits, superposition, entanglement, parallel processing"
    }
]
 
# Set up evaluators - using both approaches
evaluators = [
    KeyTermsEvaluator(),  # Structured evaluator
    FuncEvaluatorAdapter(fuzzy_match),  # Adapted function evaluator
    RemoteEvaluator("answer-relevance", "patronus:answer-relevance")  # Remote evaluator
]
 
# Run the experiment with OpenInference instrumentation
print("Running RAG evaluation experiment...")
experiment = run_experiment(
    dataset=test_data,
    task=rag_task,
    evaluators=evaluators,
    tags={"system": "rag-prototype", "model": "gpt-3.5-turbo"},
    integrations=[OpenAIInstrumentor()]
)
 
# Export results to dataframe for analysis
df = experiment.to_dataframe()
print(f"\nAverage key term match score: {df['KeyTermsEvaluator.score'].mean():.2f}")

This more comprehensive example demonstrates:

Creating evaluators in two ways:
- A proper structured evaluator by extending the StructuredEvaluator class
- A function-based evaluator adapted for use in experiments with FuncEvaluatorAdapter
Using a real RAG task that leverages the OpenAI API
Setting up a realistic dataset with context and expected answers
Combining custom evaluators with Patronus remote evaluators
Adding instrumentation to capture details of OpenAI API calls
Exporting and analyzing results using the DataFrame export

Best Practices

When working with the Patronus experimentation framework:

Choose the right evaluator type:
- Use function-based evaluators for simple cases (but remember to adapt them for experiments)
- Use structured evaluators for more complex evaluation logic
- Use remote evaluators for sophisticated evaluations without writing code
Structure your dataset consistently: Use standard field names like task_input, task_context, and gold_answer
Handle edge cases: Make your evaluators robust to missing or unexpected data
Add instrumentation: Use integrations like OpenInference to capture detailed traces
Tag your experiments: Add metadata tags to help organize and filter your experiments
Export results for analysis: Use the DataFrame or CSV export for deeper analysis

Next Steps

Now that you understand the basics of running experiments with Patronus, you can explore:

For more detailed API documentation and advanced features, please refer to the Patronus Python SDK documentation.