Confidence Intervals
This feature is in Beta and the output fields in
confidence_interval
may change. Exercise caution if taking a dependency on these fields.
In statistics, a confidence interval (CI) is an interval that is expected to contain the parameter being estimated. More specifically, given a confidence level α (95% and 99% are typical values), a CI is a random interval containing the parameter being estimated α% of the time.
Score Types
All confidence intervals are calculated using historical scores returned by evaluators. If evaluator returns only a score of 0
or 1
, this value is equal to the fail/pass
flag. Because of that, depending on the type of score, the interpretation of what CI is may change:
- Binary: If an evaluator returns only
0
or1
, the confidence interval represents a range of expected probabilities for which an evaluator will return 1, given a confidence level α. - Continuous: If an evaluator returns a continuous range of values, the confidence interval represents a range of expected values, given a confidence level α.
Evaluator Family | Score Type |
---|---|
custom | Binary |
exact-match | Binary |
phi | Binary |
pii | Binary |
retrieval-hallucination | Binary |
retrieval-answer-relevance | Binary |
retrieval-context-relevance | Binary |
retrieval-context-sufficiency | Binary |
toxicity | Continuous |
metrics | Continuous |
Calculation Methods
For every evaluator family, there is an option to calculate confidence intervals using a selected CI calculation strategy defined by the parameter confidence_interval_strategy
. Currently, there are two available options:
none
: A strategy that does not produce any confidence interval.full-history
: A strategy that creates a confidence interval based on the latest 1000 historical evaluations. At least 2 evaluations are required to generate a confidence interval.
Good to Know
Here's how we produce confidence intervals under the
full-history
strategy:
- Use data from the following sources:
- Evaluation runs executed from the context of the application,
- Evaluation API responses that were run with the strategy
"capture": "always"
- Scope data down to your
Account ID
,evaluator ID
, andprofile name
- Calculate intervals using Monte-Carlo-like bootstrapping on the
score_raw
response parameter
Request
There are 5 fields returned as the response from the confidence interval calculation:
strategy
: CI calculation strategy,alpha
: Confidence levellower
: Lower ((1 - α) / 2
) percentile of values produced by the evaluatormedian
: Expected median of values produced by the evaluatorupper
: Upper (1 - (1 - α) / 2
) percentile of values produced by the evaluator
Generated Confidence Interval = [lower, upper]
Here is an example parameter list for a sample request to the /v1/evaluate
endpoint:
You can expect the following response back out:
Notice how the confidence interval for toxicity
is continuous and thus includes a lower
, median
, and upper
score. The interpretation is that 95%
of the time, you can expect the raw score to fall between 0.2061775791977413
and 0.24472213619434405
.