Confidence Intervals

This document describes confidence intervals that estimate an expected range of values taken by an evaluator.

🚧

This feature is in Beta and the output fields in confidence_interval may change. Exercise caution if taking a dependency on these fields.

In statistics, a confidence interval (CI) is an interval that is expected to contain the parameter being estimated. More specifically, given a confidence level α (95% and 99% are typical values), a CI is a random interval containing the parameter being estimated α% of the time.

Score Types

All confidence intervals are calculated using historical scores returned by evaluators. If evaluator returns only a score of 0 or 1, this value is equal to the fail/pass flag. Because of that, depending on the type of score, the interpretation of what CI is may change:

  • Binary: If an evaluator returns only 0 or 1, the confidence interval represents a range of expected probabilities for which an evaluator will return 1, given a confidence level α.
  • Continuous: If an evaluator returns a continuous range of values, the confidence interval represents a range of expected values, given a confidence level α.
Evaluator FamilyScore Type
customBinary
exact-matchBinary
phiBinary
piiBinary
retrieval-hallucinationBinary
retrieval-answer-relevanceBinary
retrieval-context-relevanceBinary
retrieval-context-sufficiencyBinary
toxicityContinuous
metricsContinuous

Calculation Methods

For every evaluator family, there is an option to calculate confidence intervals using a selected CI calculation strategy defined by the parameter confidence_interval_strategy. Currently, there are two available options:

  • none: A strategy that does not produce any confidence interval.
  • full-history: A strategy that creates a confidence interval based on the latest 1000 historical evaluations. At least 2 evaluations are required to generate a confidence interval.

📘

Good to Know

Here's how we produce confidence intervals under the full-history strategy:

  • Use data from the following sources:
    • Evaluation runs executed from the context of the application,
    • Evaluation API responses that were run with the strategy "capture": "always"
  • Scope data down to your Account ID, evaluator ID, and profile name
  • Calculate intervals using Monte-Carlo-like bootstrapping on thescore_raw response parameter

Request

There are 5 fields returned as the response from the confidence interval calculation:

  • strategy: CI calculation strategy,
  • alpha: Confidence level
  • lower: Lower ((1 - α) / 2) percentile of values produced by the evaluator
  • median: Expected median of values produced by the evaluator
  • upper: Upper (1 - (1 - α) / 2) percentile of values produced by the evaluator

Generated Confidence Interval = [lower, upper]

Here is an example parameter list for a sample request to the /v1/evaluate endpoint:

{
  "capture": "all",
  "app": "default",
  "evaluators": [
    {
      // continuous score type
      "evaluator": "toxicity"
    },
    {
      // binary score type
      "evaluator": "custom",
      "profile_name": "no-comma"
    }
  ],
  "evaluated_model_input": "Question?",
  "evaluated_model_output": "Answer.",
  "explain": true,
  "confidence_interval_strategy": "full-history",
  "tags": null
}

You can expect the following response back out:

{
    "results": [
        {
            "evaluator_id": "toxicity-2024-05-16",
            "profile_name": "system:detect-all-toxicity",
            "status": "success",
            ...
            "evaluation_result": {
                ...
                "additional_info": {
                    ...
                    "evaluator_family": "toxicity",
                    "confidence_interval": {
                        "strategy": "full-history",
                        "alpha": 0.95,
                        "lower": 0.2061775791977413,
                        "median": 0.22602017580167078,
                        "upper": 0.24472213619434405
                    }
                },
            },
        },
        {
            "evaluator_id": "custom-large-2024-05-16",
            "profile_name": "no-comma",
            "status": "success",
            ...
            "evaluation_result": {
                ...
                "additional_info": {
                    ...
                    "evaluator_family": "custom",
                    "confidence_interval": {
                        "strategy": "full-history",
                        "alpha": 0.95,
                        "lower": 0.25,
                        "median": 0.5,
                        "upper": 1.0
                    }
                },
            },
        }
    ]
}

Notice how the confidence interval for toxicity is continuous and thus includes a lower, median, and upper score. The interpretation is that 95% of the time, you can expect the raw score to fall between 0.2061775791977413 and 0.24472213619434405.