Confidence Intervals (Beta)
This document describes confidence intervals that estimate an expected range of values taken by an evaluator.
This feature is in Beta and the output fields in
confidence_interval
may change. Exercise caution if taking a dependency on these fields.
In statistics, a confidence interval (CI) is an interval that is expected to contain the parameter being estimated. More specifically, given a confidence level α (95% and 99% are typical values), a CI is a random interval containing the parameter being estimated α% of the time.
Score Types
All confidence intervals are calculated using historical scores returned by evaluators. If evaluator returns only a score of 0
or 1
, this value is equal to the fail/pass
flag. Because of that, depending on the type of score, the interpretation of what CI is may change:
- Binary: If an evaluator returns only
0
or1
, the confidence interval represents a range of expected probabilities for which an evaluator will return 1, given a confidence level α. - Continuous: If an evaluator returns a continuous range of values, the confidence interval represents a range of expected values, given a confidence level α.
Evaluator Family | Score Type |
---|---|
custom | Binary |
exact-match | Binary |
phi | Binary |
pii | Binary |
retrieval-hallucination | Binary |
retrieval-answer-relevance | Binary |
retrieval-context-relevance | Binary |
retrieval-context-sufficiency | Binary |
toxicity | Continuous |
metrics | Continuous |
Calculation Methods
For every evaluator family, there is an option to calculate confidence intervals using a selected CI calculation strategy defined by the parameter confidence_interval_strategy
. Currently, there are two available options:
none
: A strategy that does not produce any confidence interval.full-history
: A strategy that creates a confidence interval based on the latest 1000 historical evaluations. At least 2 evaluations are required to generate a confidence interval.
Good to Know
Here's how we produce confidence intervals under the
full-history
strategy:
- Use data from the following sources:
- Evaluation runs executed from the context of the application,
- Evaluation API responses that were run with the strategy
"capture": "always"
- Scope data down to your
Account ID
,evaluator ID
, andprofile name
- Calculate intervals using Monte-Carlo-like bootstrapping on the
score_raw
response parameter
Request
There are 5 fields returned as the response from the confidence interval calculation:
strategy
: CI calculation strategy,alpha
: Confidence levellower
: Lower ((1 - α) / 2
) percentile of values produced by the evaluatormedian
: Expected median of values produced by the evaluatorupper
: Upper (1 - (1 - α) / 2
) percentile of values produced by the evaluator
Generated Confidence Interval = [lower, upper]
Here is an example parameter list for a sample request to the /v1/evaluate
endpoint:
{
"capture": "all",
"app": "default",
"evaluators": [
{
// continuous score type
"evaluator": "toxicity"
},
{
// binary score type
"evaluator": "custom",
"profile_name": "no-comma"
}
],
"evaluated_model_input": "Question?",
"evaluated_model_output": "Answer.",
"explain": true,
"confidence_interval_strategy": "full-history",
"tags": null
}
You can expect the following response back out:
{
"results": [
{
"evaluator_id": "toxicity-2024-05-16",
"profile_name": "system:detect-all-toxicity",
"status": "success",
...
"evaluation_result": {
...
"additional_info": {
...
"evaluator_family": "toxicity",
"confidence_interval": {
"strategy": "full-history",
"alpha": 0.95,
"lower": 0.2061775791977413,
"median": 0.22602017580167078,
"upper": 0.24472213619434405
}
},
},
},
{
"evaluator_id": "custom-large-2024-05-16",
"profile_name": "no-comma",
"status": "success",
...
"evaluation_result": {
...
"additional_info": {
...
"evaluator_family": "custom",
"confidence_interval": {
"strategy": "full-history",
"alpha": 0.95,
"lower": 0.25,
"median": 0.5,
"upper": 1.0
}
},
},
}
]
}
Notice how the confidence interval for toxicity
is continuous and thus includes a lower
, median
, and upper
score. The interpretation is that 95%
of the time, you can expect the raw score to fall between 0.2061775791977413
and 0.24472213619434405
.
Updated about 1 month ago