Evaluate

Requires either input or output field to be specified. Absence of both leads to an HTTP_422 (Unprocessable Entity) error.

POST

/v1/evaluate

Request Body

application/jsonRequired

`evaluators`
Required
Evaluators

List of evaluators to evaluate against.

`system_prompt`System Prompt

The system prompt provided to the LLM.

`task_context`Task Context

Optional context retrieved from vector database. This is a list of strings, with the following restrictions:

Number of items must be less/equal than 50.
The sum of tokens in all elements must be less/equal than 120000, using o200k_base tiktoken encoding

`task_input`Task Input

The input (prompt) provided to LLM.

`task_output`Task Output

LLM's response to the given input.

`gold_answer`Gold Answer

Gold answer for given evaluated model input

`capture`CaptureOptions

Capture evaluation result based on given option, default is none:

all captures the result of all evaluations (pass + failed).
fails-only captures the evaluation result when evaluation failed.
none does not capture evaluation result

Default: "all"Value in: "all" | "fails-only" | "none"

`project_id`Project Id

Attach project with given ID to the evaluation.

Note: This parameter is ignored in case project_name or experiment_id is provided.

`project_name`Project Name

Attach project with given name to the evaluation. If project with given name doesn't exist, one will be created.

Note: This parameter is ignored in case experiment_id is provided.

Note: This parameter takes precedence over project_id.

`app`App

Assigns evaluation results to the app.

app cannot be used together with experiment_id.
If app and experiment_id is omitted, app is set automatically to "default" on capture.
Automatically creates an app if it doesn't exist.
Only relevant for captured results. If will capture the results under given app.

`experiment_id`Experiment Id

Assign evaluation results to the experiment.

experiment_id cannot be used together with app.
Only relevant for captured results. If will capture the results under experiment.

`dataset_id`Dataset Id

The ID of the dataset from which the evaluated sample originates. This field serves as metadata for the evaluation. This endpoint does not ensure data consistency for this field. There is no guarantee that the dataset with the given ID is present in the Patronus AI platform, as this is a self-reported value.

`dataset_sample_id`Dataset Sample Id

The ID of the sample within the dataset. This field serves as metadata for the evaluation. This endpoint does not ensure data consistency for this field. There is no guarantee that the dataset and sample are present in the Patronus AI platform, as this is a self-reported value.

`tags`Tags

Tags are key-value pairs used to label resources

`confidence_interval_strategy`ConfidenceIntervalStrategies

Create confidence intervals based on one of the following strategies:

'none': returns None
'full-history': calculates upper boundary, median, and lower boundary of confidence interval based on all available historic records.
'generated': calculates upper boundary, median, and lower boundary of confidence interval based on on-flight generated sample of evaluations.

Default: "none"Value in: "none" | "full-history"

`evaluated_model_attachments`Evaluated Model Attachments

Optional list of attachments to be associated with the evaluation sample. This will be added to all evaluation results in this request. Each attachment is a dictionary with the following keys:

url: URL of the attachment.
media_type: Media type of the attachment (e.g., "image/jpeg", "image/png").
usage_type: Type of the attachment (e.g., "evaluated_model_system_prompt", "evaluated_model_input").

`trace_id`Trace Id

`span_id`Span Id

`log_id`Log Id

curl -X POST "https://api.patronus.ai/v1/evaluate" \
  -H "X-API-KEY: <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "evaluators": [
      {
        "evaluator": "string",
        "criteria": "string",
        "explain_strategy": "never"
      }
    ],
    "system_prompt": "string",
    "task_context": [
      "string"
    ],
    "task_input": "string",
    "task_output": "string",
    "gold_answer": "string",
    "capture": "all",
    "project_id": "405d8375-3514-403b-8c43-83ae74cfe0e9",
    "project_name": "string",
    "app": "string",
    "experiment_id": "string",
    "dataset_id": "string",
    "dataset_sample_id": "string",
    "tags": {},
    "confidence_interval_strategy": "none",
    "evaluated_model_attachments": [
      {
        "url": "string",
        "media_type": "image/jpeg",
        "usage_type": "evaluated_model_system_prompt"
      }
    ],
    "trace_id": "string",
    "span_id": "string",
    "log_id": "string"
  }'

Successful Response

{
  "results": [
    {
      "evaluator_id": "string",
      "status": "string",
      "error_message": "string",
      "evaluation_result": {
        "id": "string",
        "log_id": "14b5977f-7a80-40ca-bb79-eca6c2abdb34",
        "app": "string",
        "project_id": "405d8375-3514-403b-8c43-83ae74cfe0e9",
        "experiment_id": "string",
        "created_at": "2019-08-24T14:15:22Z",
        "evaluator_id": "string",
        "profile_name": "string",
        "criteria_revision": 0,
        "evaluated_model_system_prompt": "string",
        "evaluated_model_retrieved_context": [
          "string"
        ],
        "evaluated_model_input": "string",
        "evaluated_model_output": "string",
        "evaluated_model_gold_answer": "string",
        "evaluated_model_attachments": [
          {
            "url": "string",
            "media_type": "string",
            "usage_type": "string"
          }
        ],
        "explain_strategy": "never",
        "pass": true,
        "score_raw": 0,
        "text_output": "string",
        "additional_info": {
          "positions": [
            [
              0
            ]
          ],
          "extra": {},
          "confidence_interval": {
            "strategy": "string",
            "alpha": 0,
            "lower": 0,
            "median": 0,
            "upper": 0
          }
        },
        "evaluation_metadata": {},
        "explanation": "string",
        "evaluation_duration": "string",
        "explanation_duration": "string",
        "evaluation_run_id": 0,
        "evaluator_family": "string",
        "evaluator_profile_public_id": "fe6c9202-ffdf-40e1-8f9b-304d0cb5a8db",
        "evaluated_model_id": "string",
        "evaluated_model_name": "string",
        "evaluated_model_provider": "string",
        "evaluated_model_params": {},
        "evaluated_model_selected_model": "string",
        "dataset_id": "string",
        "dataset_sample_id": 0,
        "tags": {
          "property1": "string",
          "property2": "string"
        },
        "external": true,
        "favorite": true,
        "evaluation_feedback": true,
        "usage_tokens": 0,
        "metric_name": "string",
        "metric_description": "string",
        "evaluation_type": "string",
        "annotation_criteria_id": "e5d4b10b-c239-4b00-9620-5b9e8428bf29"
      }
    }
  ]
}

Evaluate

Authorization

`X-API-KEY`<token>

Request Body

`evaluators`
Required
Evaluators

`system_prompt`System Prompt

`task_context`Task Context

`task_input`Task Input

`task_output`Task Output

`gold_answer`Gold Answer

`capture`CaptureOptions

`project_id`Project Id

`project_name`Project Name

`app`App

`experiment_id`Experiment Id

`dataset_id`Dataset Id

`dataset_sample_id`Dataset Sample Id

`tags`Tags

`confidence_interval_strategy`ConfidenceIntervalStrategies

`evaluated_model_attachments`Evaluated Model Attachments

`trace_id`Trace Id

`span_id`Span Id

`log_id`Log Id

Evaluate

Authorization

X-API-KEY<token>

Request Body

evaluatorsRequiredEvaluators

EvaluateRequestEvaluator

system_promptSystem Prompt

task_contextTask Context

Object 1

task_inputTask Input

task_outputTask Output

gold_answerGold Answer

captureCaptureOptions

project_idProject Id

project_nameProject Name

appApp

experiment_idExperiment Id

dataset_idDataset Id

dataset_sample_idDataset Sample Id

tagsTags

Attributes

confidence_interval_strategyConfidenceIntervalStrategies

evaluated_model_attachmentsEvaluated Model Attachments

Object 1

trace_idTrace Id

span_idSpan Id

log_idLog Id

Response

TypeScript

`X-API-KEY`<token>

`evaluators`
Required
Evaluators

`system_prompt`System Prompt

`task_context`Task Context

`task_input`Task Input

`task_output`Task Output

`gold_answer`Gold Answer

`capture`CaptureOptions

`project_id`Project Id

`project_name`Project Name

`app`App

`experiment_id`Experiment Id

`dataset_id`Dataset Id

`dataset_sample_id`Dataset Sample Id

`tags`Tags

`confidence_interval_strategy`ConfidenceIntervalStrategies

`evaluated_model_attachments`Evaluated Model Attachments

`trace_id`Trace Id

`span_id`Span Id

`log_id`Log Id