NLP Metrics Evaluators

Currently we support bleu and rouge as NLP metrics in the metrics-v1 evaluator. These are common metrics in the world of NLP. For more information, you can read this blog.

To specify an NLP metric, pass system:compute-bleu or system:compute-rouge in the profile_name field.

Here's an example API request:

curl --location 'https://api.patronus.ai/v1/evaluate' \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--header 'X-API-KEY: ••••••' \
--data '{
  "evaluators": [
    {
      "evaluator": "metrics",
      "profile_name": "system:compute-bleu"
    }
  ],
  "output": "hello there general kenobi",
  "label": "hello there general kenobi I am doing great today!"
}'

And the example response:

{
    "results": [
        {
            "evaluator_id": "metrics-2024-05-16",
            "profile_name": "system:compute-bleu",
            "status": "success",
            ...
            "evaluation_result": {
                ...
                "id": "112825104711650870",
                "app": "default",
                "created_at": "2024-08-08T21:25:10.549203Z",
                "evaluator_id": "metrics-2024-05-16",
                "profile_name": "system:compute-bleu",
                "evaluated_model_system_prompt": null,
                "evaluated_model_retrieved_context": null,
                "evaluated_model_input": null,
                "evaluated_model_output": "hello there general kenobi",
                "evaluated_model_gold_answer": "hello there general kenobi I am doing great today!",
                "explain": false,
                "explain_strategy": "never",
                "pass": true,
                "score_raw": 0.22,
                "score_normalized": -1.0,
                "additional_info": {
                    "score_raw": 0.22,
                    "positions": null,
                    "extra": null,
                    "confidence_interval": null
                },
                "evaluation_duration": "PT0.216S",
                "evaluator_family": "metrics",
                "evaluator_profile_public_id": "99c73df3-a3b7-4599-a201-0442c4815778",
                "external": false,
            },
        }
    ]
}

The metric score is returned in score_raw. In this case, we have a BLEU score of 0.22.

Note that we return pass=true by default for this evaluator's system profiles (e.g. system:compute-bleu or system:compute-rouge). You can create your own evaluator profiles for the metrics family and specify the pass threshold.