Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
Description
TutorialsExperiments

Experiments

Note: For comprehensive API documentation and more detailed examples, please refer to the Patronus Python SDK documentation.

At a high level, an Experiment is a structured evaluation of your AI application's performance across multiple samples. Experiments allow you to run batched evaluations to compare performance across different configurations, models, and datasets, so that you can make informed decisions to optimize your AI applications.

Patronus provides an intuitive Experimentation Framework to help you continuously improve your AI applications. Whether you're developing RAG apps, fine-tuning your own models, or iterating on prompts, this framework provides the tools you need to set up, execute, and analyze experiments efficiently.

Key Components

An experiment in Patronus consists of several components:

  1. Dataset: A collection of examples to evaluate, which can be:

    • A list of dictionaries in your code
    • A CSV or JSON file
    • A Pandas DataFrame
    • Data from a Patronus dataset
  2. Task (Optional): A function that processes each example, typically:

    • Takes input from the dataset
    • Calls an LLM or other AI system
    • Returns an output for evaluation
    • If your dataset already contains outputs, you can skip defining a task
  3. Evaluators: One or more evaluators that assess the quality of outputs:

    • Class-based: Extend StructuredEvaluator for more complex logic
    • Function-based: Simple functions wrapped with FuncEvaluatorAdapter
    • Remote: Patronus-hosted evaluators accessible via RemoteEvaluator
  4. Configuration: Additional options to customize experiment behavior:

    • Project and experiment names for organization
    • Tags for filtering and categorization
    • Concurrency settings for performance
    • Instrumentation options for tracing

A Simple Experiment

Here's a basic example of running an experiment:

from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
# Define a simple task
def summarize_task(row, **kwargs):
    return f"Summary: {row.task_input}"
 
# Run the experiment
experiment = run_experiment(
    dataset=[
        {"task_input": "AI is improving rapidly.", "gold_answer": "AI technology is advancing quickly."},
        {"task_input": "The market is volatile.", "gold_answer": "Market conditions are unstable."}
    ],
    task=summarize_task,
    evaluators=[
        RemoteEvaluator("judge", "patronus:semantic-similarity")
    ],
    experiment_name="Basic Summarization Test"
)
 
# View results summary
print(experiment.summary())
 
# Export results for detailed analysis
df = experiment.to_dataframe()

Benefits of the Experimentation Framework

Using the Patronus Experimentation Framework offers several advantages:

  • Standardization: Consistent evaluation methodology across different models and datasets
  • Reproducibility: Easily rerun experiments with the same configuration
  • Efficiency: Parallel execution for faster evaluation of large datasets
  • Visibility: Detailed metrics and visualizations in the Patronus platform
  • Integration: Seamless connection with logging and tracing for end-to-end visibility

Getting Started with Experiments

Ready to dive in? Check out these detailed guides:

  1. Running your first experiment with Python
  2. Working with different types of evaluators
  3. Managing and creating datasets
  4. Advanced experiment configurations

For comprehensive API references and additional examples, please refer to the Patronus Python SDK documentation.

On this page