Our docs got a refresh! Check out the new content and improved navigation. For detailed API reference see our Python SDK docs and TypeScript SDK.
Description
EvaluatorsAdvanced concepts

Confidence Intervals

This feature is in Beta and the output fields in confidence_interval may change. Exercise caution if taking a dependency on these fields.

In statistics, a confidence interval (CI) is an interval that is expected to contain the parameter being estimated. More specifically, given a confidence level α (95% and 99% are typical values), a CI is a random interval containing the parameter being estimated α% of the time.

Score Types

All confidence intervals are calculated using historical scores returned by evaluators. If evaluator returns only a score of 0 or 1, this value is equal to the fail/pass flag. Because of that, depending on the type of score, the interpretation of what CI is may change:

  • Binary: If an evaluator returns only 0 or 1, the confidence interval represents a range of expected probabilities for which an evaluator will return 1, given a confidence level α.
  • Continuous: If an evaluator returns a continuous range of values, the confidence interval represents a range of expected values, given a confidence level α.
Evaluator FamilyScore Type
customBinary
exact-matchBinary
phiBinary
piiBinary
retrieval-hallucinationBinary
retrieval-answer-relevanceBinary
retrieval-context-relevanceBinary
retrieval-context-sufficiencyBinary
toxicityContinuous
metricsContinuous

Calculation Methods

For every evaluator family, there is an option to calculate confidence intervals using a selected CI calculation strategy defined by the parameter confidence_interval_strategy. Currently, there are two available options:

  • none: A strategy that does not produce any confidence interval.
  • full-history: A strategy that creates a confidence interval based on the latest 1000 historical evaluations. At least 2 evaluations are required to generate a confidence interval.

Good to Know

Here's how we produce confidence intervals under the full-history strategy:

  • Use data from the following sources:
    • Evaluation runs executed from the context of the application,
    • Evaluation API responses that were run with the strategy "capture": "always"
  • Scope data down to your Account ID, evaluator ID, and profile name
  • Calculate intervals using Monte-Carlo-like bootstrapping on thescore_raw response parameter

Request

There are 5 fields returned as the response from the confidence interval calculation:

  • strategy: CI calculation strategy,
  • alpha: Confidence level
  • lower: Lower ((1 - α) / 2) percentile of values produced by the evaluator
  • median: Expected median of values produced by the evaluator
  • upper: Upper (1 - (1 - α) / 2) percentile of values produced by the evaluator

Generated Confidence Interval = [lower, upper]

Here is an example parameter list for a sample request to the /v1/evaluate endpoint:

{
    "capture": "all",
    "app": "default",
    "evaluators": [
        {
            // continuous score type
            "evaluator": "toxicity"
        },
        {
            // binary score type
            "evaluator": "custom",
            "profile_name": "no-comma"
        }
    ],
    "evaluated_model_input": "Question?",
    "evaluated_model_output": "Answer.",
    "explain": true,
    "confidence_interval_strategy": "full-history",
    "tags": null
}

You can expect the following response back out:

{
    "results": [
        {
            "evaluator_id": "toxicity-2024-05-16",
            "profile_name": "system:detect-all-toxicity",
            "status": "success",
            ...
            "evaluation_result": {
                ...
                "additional_info": {
                    ...
                    "evaluator_family": "toxicity",
                    "confidence_interval": {
                        "strategy": "full-history",
                        "alpha": 0.95,
                        "lower": 0.2061775791977413,
                        "median": 0.22602017580167078,
                        "upper": 0.24472213619434405
                    }
                },
            },
        },
        {
            "evaluator_id": "custom-large-2024-05-16",
            "profile_name": "no-comma",
            "status": "success",
            ...
            "evaluation_result": {
                ...
                "additional_info": {
                    ...
                    "evaluator_family": "custom",
                    "confidence_interval": {
                        "strategy": "full-history",
                        "alpha": 0.95,
                        "lower": 0.25,
                        "median": 0.5,
                        "upper": 1.0
                    }
                },
            },
        }
    ]
}

Notice how the confidence interval for toxicity is continuous and thus includes a lower, median, and upper score. The interpretation is that 95% of the time, you can expect the raw score to fall between 0.2061775791977413 and 0.24472213619434405.

On this page