Confidence Intervals
This feature is in Beta and the output fields in
confidence_intervalmay change. Exercise caution if taking a dependency on these fields.
In statistics, a confidence interval (CI) is an interval that is expected to contain the parameter being estimated. More specifically, given a confidence level α (95% and 99% are typical values), a CI is a random interval containing the parameter being estimated α% of the time.
Score Types
All confidence intervals are calculated using historical scores returned by evaluators. If evaluator returns only a score of 0 or 1, this value is equal to the fail/pass flag. Because of that, depending on the type of score, the interpretation of what CI is may change:
- Binary: If an evaluator returns only
0or1, the confidence interval represents a range of expected probabilities for which an evaluator will return 1, given a confidence level α. - Continuous: If an evaluator returns a continuous range of values, the confidence interval represents a range of expected values, given a confidence level α.
| Evaluator Family | Score Type |
|---|---|
| custom | Binary |
| exact-match | Binary |
| phi | Binary |
| pii | Binary |
| retrieval-hallucination | Binary |
| retrieval-answer-relevance | Binary |
| retrieval-context-relevance | Binary |
| retrieval-context-sufficiency | Binary |
| toxicity | Continuous |
| metrics | Continuous |
Calculation Methods
For every evaluator family, there is an option to calculate confidence intervals using a selected CI calculation strategy defined by the parameter confidence_interval_strategy. Currently, there are two available options:
none: A strategy that does not produce any confidence interval.full-history: A strategy that creates a confidence interval based on the latest 1000 historical evaluations. At least 2 evaluations are required to generate a confidence interval.
Good to Know
Here's how we produce confidence intervals under the
full-historystrategy:
- Use data from the following sources:
- Evaluation runs executed from the context of the application,
- Evaluation API responses that were run with the strategy
"capture": "always"- Scope data down to your
Account ID,evaluator ID, andprofile name- Calculate intervals using Monte-Carlo-like bootstrapping on the
score_rawresponse parameter
Request
There are 5 fields returned as the response from the confidence interval calculation:
strategy: CI calculation strategy,alpha: Confidence levellower: Lower ((1 - α) / 2) percentile of values produced by the evaluatormedian: Expected median of values produced by the evaluatorupper: Upper (1 - (1 - α) / 2) percentile of values produced by the evaluator
Generated Confidence Interval = [lower, upper]
Here is an example parameter list for a sample request to the /v1/evaluate endpoint:
You can expect the following response back out:
Notice how the confidence interval for toxicity is continuous and thus includes a lower, median, and upper score. The interpretation is that 95% of the time, you can expect the raw score to fall between 0.2061775791977413 and 0.24472213619434405.
