# Confidence Intervals (Beta)

This document describes confidence intervals that estimate an expected range of values taken by an evaluator.

This feature is in Beta and the output fields in

`confidence_interval`

may change. Exercise caution if taking a dependency on these fields.

In statistics, a confidence interval (CI) is an interval that is expected to contain the parameter being estimated. More specifically, given a confidence level α (95% and 99% are typical values), a CI is a random interval containing the parameter being estimated α% of the time.

## Score Types

All confidence intervals are calculated using historical scores returned by evaluators. If evaluator returns only a score of `0`

or `1`

, this value is equal to the `fail/pass`

flag. Because of that, depending on the type of score, the interpretation of what CI is may change:

**Binary**: If an evaluator returns only`0`

or`1`

, the confidence interval represents a range of expected probabilities for which an evaluator will return 1, given a confidence level α.**Continuous**: If an evaluator returns a continuous range of values, the confidence interval represents a range of expected values, given a confidence level α.

Evaluator Family | Score Type |
---|---|

custom | Binary |

exact-match | Binary |

phi | Binary |

pii | Binary |

retrieval-hallucination | Binary |

retrieval-answer-relevance | Binary |

retrieval-context-relevance | Binary |

retrieval-context-sufficiency | Binary |

toxicity | Continuous |

metrics | Continuous |

## Calculation Methods

For every evaluator family, there is an option to calculate confidence intervals using a selected CI calculation strategy defined by the parameter `confidence_interval_strategy`

. Currently, there are two available options:

`none`

: A strategy that does not produce any confidence interval.`full-history`

: A strategy that creates a confidence interval based on the latest 1000 historical evaluations. At least 2 evaluations are required to generate a confidence interval.

Good to Know

Here's how we produce confidence intervals under the

`full-history`

strategy:

- Use data from the following sources:

- Evaluation runs executed from the context of the application,
- Evaluation API responses that were run with the strategy
`"capture": "always"`

- Scope data down to your
`Account ID`

,`evaluator ID`

, and`profile name`

- Calculate intervals using Monte-Carlo-like bootstrapping on the
`score_raw`

response parameter

## Request

There are 5 fields returned as the response from the confidence interval calculation:

`strategy`

: CI calculation strategy,`alpha`

: Confidence level`lower`

: Lower (`(1 - α) / 2`

) percentile of values produced by the evaluator`median`

: Expected median of values produced by the evaluator`upper`

: Upper (`1 - (1 - α) / 2`

) percentile of values produced by the evaluator

`Generated Confidence Interval = [lower, upper]`

Here is an example parameter list for a sample request to the `/v1/evaluate`

endpoint:

```
{
"capture": "all",
"app": "default",
"evaluators": [
{
// continuous score type
"evaluator": "toxicity"
},
{
// binary score type
"evaluator": "custom",
"profile_name": "no-comma"
}
],
"evaluated_model_input": "Question?",
"evaluated_model_output": "Answer.",
"explain": true,
"confidence_interval_strategy": "full-history",
"tags": null
}
```

You can expect the following response back out:

```
{
"results": [
{
"evaluator_id": "toxicity-2024-05-16",
"profile_name": "system:detect-all-toxicity",
"status": "success",
...
"evaluation_result": {
...
"additional_info": {
...
"evaluator_family": "toxicity",
"confidence_interval": {
"strategy": "full-history",
"alpha": 0.95,
"lower": 0.2061775791977413,
"median": 0.22602017580167078,
"upper": 0.24472213619434405
}
},
},
},
{
"evaluator_id": "custom-large-2024-05-16",
"profile_name": "no-comma",
"status": "success",
...
"evaluation_result": {
...
"additional_info": {
...
"evaluator_family": "custom",
"confidence_interval": {
"strategy": "full-history",
"alpha": 0.95,
"lower": 0.25,
"median": 0.5,
"upper": 1.0
}
},
},
}
]
}
```

Notice how the confidence interval for `toxicity`

is continuous and thus includes a `lower`

, `median`

, and `upper`

score. The interpretation is that `95%`

of the time, you can expect the raw score to fall between `0.2061775791977413`

and `0.24472213619434405`

.

Updated about 1 month ago