Batch Evaluation Script

🚧

This is a more advanced script. It can be a confusing to make sense of what it's doing especially if you're not familiar with concurrency. We'd recommend following the different tutorials and using the other scripts before trying this one out.

This script demonstrates a very powerful paradigm for using the Patronus API: concurrent calls. That means many requests to our API run at the exact same time. By parallelizing your API requests, you can get through massive amounts of data in a much shorter amount of time.

For more information on running concurrent calls in Python, take a look at the docs here.

import datetime
import argparse
import csv
import requests
import time
from typing import Dict
from concurrent.futures import ThreadPoolExecutor, as_completed

retrieval_samples = [
    {
        "evaluated_model_input": "What is one of the biggest benefits of Gen AI?",
        "retrieved_context": [
            "One of AI's biggest benefits, some believe, is that it will free up our precious time to pursue higher ideals. Although deep generative models are very promising, their objective is to mimic a dataset, and as we know, similarity isn't enough if you truly want to innovate. Just because we'll be using the same systems doesn't mean we'll be generating the same outputs. AI, in fact, suggests that we should challenge ourselves to do otherwise and make each result our own.",
            "As engineers and designers, we often don't want to rehash a design that's already out there. What is exciting is that everyone will use generative AI differently, which means each person's experiments can generate unique innovations or value.",
            "ChatGPT holds up the mirror to humanity. It might be considered clever and be able to produce incredible art, literature and music – but only we can burst into tears at the sight of sheer beauty or brilliance. We will always be the best at being humans than any other machine, computer or robot that we could ever create. If this works out as planned, our species could be prompted into being the best versions of ourselves. Imagine that.",
        ],
        "model_output": "Gen AI will free up humanity's time so that we humans can focus on more purposeful ideals.",
    },
    {
        "evaluated_model_input": "What is one of the biggest benefits of Gen AI?",
        "retrieved_context": [
            "One of AI's biggest benefits, some believe, is that it will free up our precious time to pursue higher ideals. Although deep generative models are very promising, their objective is to mimic a dataset, and as we know, similarity isn't enough if you truly want to innovate. Just because we'll be using the same systems doesn't mean we'll be generating the same outputs. AI, in fact, suggests that we should challenge ourselves to do otherwise and make each result our own.",
            "As engineers and designers, we often don't want to rehash a design that's already out there. What is exciting is that everyone will use generative AI differently, which means each person's experiments can generate unique innovations or value.",
            "ChatGPT holds up the mirror to humanity. It might be considered clever and be able to produce incredible art, literature and music – but only we can burst into tears at the sight of sheer beauty or brilliance. We will always be the best at being humans than any other machine, computer or robot that we could ever create. If this works out as planned, our species could be prompted into being the best versions of ourselves. Imagine that.",
        ],
        "model_output": "Gen AI will generate massive amounts of capital for any company clever enough to hop on the train early.",
    },
    {
        "evaluated_model_input": "What is one of the biggest benefits of Gen AI?",
        "retrieved_context": [
            "One of AI's biggest benefits, some believe, is that it will free up our precious time to pursue higher ideals. Although deep generative models are very promising, their objective is to mimic a dataset, and as we know, similarity isn't enough if you truly want to innovate. Just because we'll be using the same systems doesn't mean we'll be generating the same outputs. AI, in fact, suggests that we should challenge ourselves to do otherwise and make each result our own.",
            "As engineers and designers, we often don't want to rehash a design that's already out there. What is exciting is that everyone will use generative AI differently, which means each person's experiments can generate unique innovations or value.",
            "ChatGPT holds up the mirror to humanity. It might be considered clever and be able to produce incredible art, literature and music – but only we can burst into tears at the sight of sheer beauty or brilliance. We will always be the best at being humans than any other machine, computer or robot that we could ever create. If this works out as planned, our species could be prompted into being the best versions of ourselves. Imagine that.",
        ],
        "model_output": "Gen AI will make engineers more artistic and give them more free time to pursue art.",
    },
    {
        "evaluated_model_input": "What is one of the biggest benefits of Gen AI?",
        "retrieved_context": [
            "One of AI's biggest benefits, some believe, is that it will free up our precious time to pursue higher ideals. Although deep generative models are very promising, their objective is to mimic a dataset, and as we know, similarity isn't enough if you truly want to innovate. Just because we'll be using the same systems doesn't mean we'll be generating the same outputs. AI, in fact, suggests that we should challenge ourselves to do otherwise and make each result our own.",
            "As engineers and designers, we often don't want to rehash a design that's already out there. What is exciting is that everyone will use generative AI differently, which means each person's experiments can generate unique innovations or value.",
            "ChatGPT holds up the mirror to humanity. It might be considered clever and be able to produce incredible art, literature and music – but only we can burst into tears at the sight of sheer beauty or brilliance. We will always be the best at being humans than any other machine, computer or robot that we could ever create. If this works out as planned, our species could be prompted into being the best versions of ourselves. Imagine that.",
        ],
        "model_output": "Generative artificial intelligence (generative AI) is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. AI technologies attempt to mimic human intelligence in nontraditional computing tasks like image recognition, natural language processing (NLP), and translation. Generative AI is the next step in artificial intelligence. You can train it to learn human language, programming languages, art, chemistry, biology, or any complex subject matter. It reuses training data to solve new problems. For example, it can learn English vocabulary and create a poem from the words it processes. Your organization can use generative AI for various purposes, like chatbots, media creation, and product development and design.",
    },
]

retrieval_samples = (
    retrieval_samples * 100
)  # Repeat the samples to simulate a larger batch


def save_data_in_csv(output_file: str, data: Dict[str, str]):
    with open(output_file, mode="a", encoding="utf-8-sig", newline="") as csvfile:
        fieldnames = [
            "sample_id",
            "model_input",
            "retrieved_context",
            "model_output",
            "retrieval-answer-relevance-large-2024-05-31",
            "explanation-retrieval-answer-relevance-large-2024-05-31",
            "retrieval-context-relevance-large-2024-05-31",
            "explanation-retrieval-context-relevance-large-2024-05-31",
            "retrieval-hallucination-lynx-large-2024-07-16",
            "explanation-retrieval-hallucination-lynx-large-2024-07-16",
        ]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        if csvfile.tell() == 0:
            writer.writeheader()
        writer.writerow(data)


def call_api(api_key: str, model_input: str, retrieved_context: str, model_output: str):
    headers = {
        "Content-Type": "application/json",
        "X-API-KEY": f"{api_key}",
    }
    data = {
        "criteria": [
            {"criterion_id": "retrieval-answer-relevance-large-2024-05-31"},
            {"criterion_id": "retrieval-context-relevance-large-2024-05-31"},
            {
                "criterion_id": "retrieval-hallucination-lynx-large-2024-07-16",
                "explain_strategy": "always",
            },
        ],
        "input": model_input,
        "retrieved_context": retrieved_context,
        "output": model_output,
        "capture": "none",
    }
    # NOTE: It is important to catch exceptions in the block below so we can retry as needed
    try:
        response = requests.post(
            "https://api.patronus.ai/v1/evaluate", headers=headers, json=data
        )
        if response.status_code == 200:
            results = response.json().get("results", [])
            _ = [
                bool(result["result"]["pass"]) for result in results
            ]  # Will raise an exception if the result was not returned properly
            return results
        else:
            print(f"Error: Received status code {response.status_code}")
            print(response.text)
            return False
    except Exception as e:
        print(f"Error: {e}")
        return False


def process_sample(api_key, sample_id, sample, output_file):
    results = False
    retries = 0
    number_of_retries = 5

    while True:
        results = call_api(
            api_key,
            sample["model_input"],
            sample["retrieved_context"],
            sample["model_output"],
        )
        if isinstance(results, bool):
            retries += 1
            sleep_time = 2
            print(f"Retrying for {sample_id} in {sleep_time} seconds. Retry: {retries}")
            time.sleep(sleep_time)
            if retries > number_of_retries:
                print(f"Failed to get results for {sample_id}. Exiting.")
                return None
        else:
            break

    output_data = {
        "sample_id": sample_id,
        "model_input": sample["model_input"],
        "retrieved_context": sample["retrieved_context"],
        "model_output": sample["model_output"],
    }
    for result in results:
        criterion_id = result["criterion_id"]
        # profile_name = result["profile_name"]
        passed = bool(result["result"]["pass"])
        output_data[criterion_id] = "PASS" if passed else "FAIL"
        output_data[f"explanation-{criterion_id}"] = (
            result["result"]["additional_info"]["explanation"] if not passed else ""
        )

    save_data_in_csv(output_file, output_data)
    return sample_id


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Concurrent API Script for Retrieval Evaluation"
    )
    parser.add_argument("-k", "--api_key", type=str, help="API Key for Patronus API")
    parser.add_argument(
        "-o",
        "--output",
        type=str,
        default="evaluation_results.csv",
        help="Output CSV File",
    )
    parser.add_argument(
        "-w", "--max-workers", type=int, default=5, help="Maximum number of workers"
    )
    args = parser.parse_args()

    output_file = args.output
    max_workers = args.max_workers
    api_key = args.api_key

    count = 0
    start_time = time.time()

    print(f"Start: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [
            executor.submit(process_sample, api_key, sample_id, sample, output_file)
            for sample_id, sample in enumerate(retrieval_samples)
        ]

        for future in as_completed(futures):
            result = future.result()
            if result is not None:
                count += 1
                if count % 10 == 0:
                    print(
                        f"Processed {count} samples in {(time.time() - start_time)/60} minutes TS: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
                    )
                else:
                    print(".", end="")

    print(
        f"Total samples processed: {count} in {(time.time() - start_time)/60} minutes TS: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
    )