Experiments

Note: For comprehensive API documentation and more detailed examples, please refer to the Patronus Python SDK documentation.

At a high level, an Experiment is a structured evaluation of your AI application's performance across multiple samples. Experiments allow you to run batched evaluations to compare performance across different configurations, models, and datasets, so that you can make informed decisions to optimize your AI applications.

Patronus provides an intuitive Experimentation Framework to help you continuously improve your AI applications. Whether you're developing RAG apps, fine-tuning your own models, or iterating on prompts, this framework provides the tools you need to set up, execute, and analyze experiments efficiently.

Key Components

An experiment in Patronus consists of several components:

Dataset: A collection of examples to evaluate, which can be:
- A list of dictionaries in your code
- A CSV or JSON file
- A Pandas DataFrame
- Data from a Patronus dataset
Task (Optional): A function that processes each example, typically:
- Takes input from the dataset
- Calls an LLM or other AI system
- Returns an output for evaluation
- If your dataset already contains outputs, you can skip defining a task
Evaluators: One or more evaluators that assess the quality of outputs:
- Class-based: Extend StructuredEvaluator for more complex logic
- Function-based: Simple functions wrapped with FuncEvaluatorAdapter
- Remote: Patronus-hosted evaluators accessible via RemoteEvaluator
Configuration: Additional options to customize experiment behavior:
- Project and experiment names for organization
- Tags for filtering and categorization
- Concurrency settings for performance
- Instrumentation options for tracing

A Simple Experiment

Here's a basic example of running an experiment:

from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
# Define a simple task
def summarize_task(row, **kwargs):
    return f"Summary: {row.task_input}"
 
# Run the experiment
experiment = run_experiment(
    dataset=[
        {"task_input": "AI is improving rapidly.", "gold_answer": "AI technology is advancing quickly."},
        {"task_input": "The market is volatile.", "gold_answer": "Market conditions are unstable."}
    ],
    task=summarize_task,
    evaluators=[
        RemoteEvaluator("judge", "patronus:semantic-similarity")
    ],
    experiment_name="Basic Summarization Test"
)
 
# View results summary
print(experiment.summary())
 
# Export results for detailed analysis
df = experiment.to_dataframe()

Benefits of the Experimentation Framework

Using the Patronus Experimentation Framework offers several advantages:

Standardization: Consistent evaluation methodology across different models and datasets
Reproducibility: Easily rerun experiments with the same configuration
Efficiency: Parallel execution for faster evaluation of large datasets
Visibility: Detailed metrics and visualizations in the Patronus platform
Integration: Seamless connection with logging and tracing for end-to-end visibility

Getting Started with Experiments

Ready to dive in? Check out these detailed guides:

For comprehensive API references and additional examples, please refer to the Patronus Python SDK documentation.

Experiments

Key Components

A Simple Experiment

Benefits of the Experimentation Framework

Getting Started with Experiments

On this page