Confidence Intervals

This feature is in Beta and the output fields in confidence_interval may change. Exercise caution if taking a dependency on these fields.

In statistics, a confidence interval (CI) is an interval that is expected to contain the parameter being estimated. More specifically, given a confidence level α (95% and 99% are typical values), a CI is a random interval containing the parameter being estimated α% of the time.

Score Types

All confidence intervals are calculated using historical scores returned by evaluators. If evaluator returns only a score of 0 or 1, this value is equal to the fail/pass flag. Because of that, depending on the type of score, the interpretation of what CI is may change:

Binary: If an evaluator returns only 0 or 1, the confidence interval represents a range of expected probabilities for which an evaluator will return 1, given a confidence level α.
Continuous: If an evaluator returns a continuous range of values, the confidence interval represents a range of expected values, given a confidence level α.

Evaluator Family	Score Type
custom	Binary
exact-match	Binary
phi	Binary
pii	Binary
retrieval-hallucination	Binary
retrieval-answer-relevance	Binary
retrieval-context-relevance	Binary
retrieval-context-sufficiency	Binary
toxicity	Continuous
metrics	Continuous

Calculation Methods

For every evaluator family, there is an option to calculate confidence intervals using a selected CI calculation strategy defined by the parameter confidence_interval_strategy. Currently, there are two available options:

none: A strategy that does not produce any confidence interval.
full-history: A strategy that creates a confidence interval based on the latest 1000 historical evaluations. At least 2 evaluations are required to generate a confidence interval.

Good to Know

Here's how we produce confidence intervals under the full-history strategy:

Use data from the following sources:

Evaluation runs executed from the context of the application,

Evaluation API responses that were run with the strategy "capture": "always"

Scope data down to your Account ID, evaluator ID, and profile name

Calculate intervals using Monte-Carlo-like bootstrapping on thescore_raw response parameter

Request

There are 5 fields returned as the response from the confidence interval calculation:

strategy: CI calculation strategy,
alpha: Confidence level
lower: Lower ((1 - α) / 2) percentile of values produced by the evaluator
median: Expected median of values produced by the evaluator
upper: Upper (1 - (1 - α) / 2) percentile of values produced by the evaluator

Generated Confidence Interval = [lower, upper]

Here is an example parameter list for a sample request to the /v1/evaluate endpoint:

{
    "capture": "all",
    "app": "default",
    "evaluators": [
        {
            // continuous score type
            "evaluator": "toxicity"
        },
        {
            // binary score type
            "evaluator": "custom",
            "profile_name": "no-comma"
        }
    ],
    "evaluated_model_input": "Question?",
    "evaluated_model_output": "Answer.",
    "explain": true,
    "confidence_interval_strategy": "full-history",
    "tags": null
}

You can expect the following response back out:

{
    "results": [
        {
            "evaluator_id": "toxicity-2024-05-16",
            "profile_name": "system:detect-all-toxicity",
            "status": "success",
            ...
            "evaluation_result": {
                ...
                "additional_info": {
                    ...
                    "evaluator_family": "toxicity",
                    "confidence_interval": {
                        "strategy": "full-history",
                        "alpha": 0.95,
                        "lower": 0.2061775791977413,
                        "median": 0.22602017580167078,
                        "upper": 0.24472213619434405
                    }
                },
            },
        },
        {
            "evaluator_id": "custom-large-2024-05-16",
            "profile_name": "no-comma",
            "status": "success",
            ...
            "evaluation_result": {
                ...
                "additional_info": {
                    ...
                    "evaluator_family": "custom",
                    "confidence_interval": {
                        "strategy": "full-history",
                        "alpha": 0.95,
                        "lower": 0.25,
                        "median": 0.5,
                        "upper": 1.0
                    }
                },
            },
        }
    ]
}

Notice how the confidence interval for toxicity is continuous and thus includes a lower, median, and upper score. The interpretation is that 95% of the time, you can expect the raw score to fall between 0.2061775791977413 and 0.24472213619434405.

Confidence Intervals

Score Types

Calculation Methods

Request

On this page