Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
Description
Multimodal Evaluations

Multimodal Evaluations

Overview

Patronus multimodal evaluations enable the scoring of LLM-based image and audio outputs. Outputs from popular multimodal models like GPT-4o, Claude Opus, and Google’s Gemini can be evaluated for hallucinations, image-to-caption relevance, and other criteria. Our multimodal evaluations support applications like e-commerce platforms, digital first retailers and marketers, graphic design and creative software, food service delivery websites, and OCR-based image-to-text companies in medicine and finance.

Currently, we support the evaluation of image inputs against text inputs. The evaluators will output text explanations along with other metrics shown below.

How multimodal evaluations work

Explanations and confidence scores, as shown in the image below, provide insight into the evaluation.
Outputs from different models or prompts can be evaluated as new models are released and benchmarked. Example use cases include evaluating whether captions accurately describe images, if user queries are surfacing the most relevant product screenshots, if OCR extraction for tabular data is accurate, and if AI-generated brand images, logos, and listings are accurate.

An example log evaluating whether the image caption is accurately describing the primary object in the image. The example below shows that the model output shown in the "Task Output" field hallucinated the details of the image, referring to a farmhouse while the image shows a landscape. Although the image is semantically similar to the caption description, the model hallucinated details and did not accurately describe the primary object, a landscape.

multimodal_log

Use natural language explanations auto-generated by Patronus to augment your understanding of the evaluation.

multimodal_explanation

View the results summary from multiple multimodal evaluators.

multimodal_summary

Add annotations to the results to build a ground truth dataset and validate the evaluators.

add_annotation_multimodal

Details

Currently, we support URLs (and not local file paths) as image uploads.

Supported media types:

    - image/jpeg
    - image/png

Incoming media types:

    - audio/flac
    - audio/mp3
    - audio/mp4
    - audio/mpeg
    - audio/wav

Caption hallucination example

The following example demonstrates how to evaluate an image against a caption. Below are all of the remote Patronus evaluators that assist with this task. These evaluators build a ground truth snapshot of the image by scanning for the presence of text and its location, grid structure, spatial orientation of objects and text, and object identification and description. Based on this snapshot, the evaluators judge whether the text describing the image is accurately describing the primary and non-primary objects and object locations among other criteria. Below is the full set of out-of-box criteria.

multimodal_evaluators

Running an evaluation

 
from patronus.evals import RemoteEvaluator
 
caption_hallucination = RemoteEvaluator("judge-image", "patronus:caption-hallucination")
caption_describes_primary_object = RemoteEvaluator("judge-image", "patronus:caption-describes-primary-object")
caption_describes_non_primary_objects = RemoteEvaluator("judge-image", "patronus:caption-describes-non-primary-objects")
caption_mentions_primary_object_location = RemoteEvaluator("judge-image","patronus:caption-mentions-primary-object-location")
caption_hallucination_strict = RemoteEvaluator("judge-image", "patronus:caption-hallucination-strict")
 
caption_hallucination.evaluate(
    task_output="""A skilled gymnast soars in a flawless arch above the beam, her body elegantly extended in a full rotation. Her concentrated gaze and tensed muscles reflect years of discipline and training, as she transitions from takeoff to a poised landing. With arms gracefully outstretched, she demonstrates both precision and artistry, highlighting the athletic mastery required to execute such a breathtaking maneuver. The narrow beam beneath her contrasts sharply with her fluid, controlled motion, underscoring the delicate balance and unwavering focus it takes to perform at this elite level.""",
    task_attachments=[
        {
            "media_type": "image/jpeg",
            "url": "https://d2z00kf51ll94q.cloudfront.net/archive/2019/large/OS_FI19007S_12.jpg",
        }
    ]
)

On this page