Multimodal Evaluations
Overview
Patronus multimodal evaluations enable the scoring of LLM-based image and audio outputs. Outputs from popular multimodal models like GPT-4o, Claude Opus, and Google’s Gemini can be evaluated for hallucinations, image-to-caption relevance, and other criteria. Our multimodal evaluations support applications like e-commerce platforms, digital first retailers and marketers, graphic design and creative software, food service delivery websites, and OCR-based image-to-text companies in medicine and finance.
Currently, we support the evaluation of image inputs against text inputs. The evaluators will output text explanations along with other metrics shown below.
How multimodal evaluations work
Explanations and confidence scores, as shown in the image below, provide insight into the evaluation.
Outputs from different models or prompts can be evaluated as new models are released and benchmarked.
Example use cases include evaluating whether captions accurately describe images, if user queries are surfacing the most relevant product screenshots, if OCR extraction for tabular data is accurate, and if AI-generated brand images, logos, and listings are accurate.
An example log evaluating whether the image caption is accurately describing the primary object in the image. The example below shows that the model output shown in the "Task Output" field hallucinated the details of the image, referring to a farmhouse while the image shows a landscape. Although the image is semantically similar to the caption description, the model hallucinated details and did not accurately describe the primary object, a landscape.
Use natural language explanations auto-generated by Patronus to augment your understanding of the evaluation.
View the results summary from multiple multimodal evaluators.
Add annotations to the results to build a ground truth dataset and validate the evaluators.
Details
Currently, we support URLs (and not local file paths) as image uploads.
Supported media types:
Incoming media types:
Caption hallucination example
The following example demonstrates how to evaluate an image against a caption. Below are all of the remote Patronus evaluators that assist with this task. These evaluators build a ground truth snapshot of the image by scanning for the presence of text and its location, grid structure, spatial orientation of objects and text, and object identification and description. Based on this snapshot, the evaluators judge whether the text describing the image is accurately describing the primary and non-primary objects and object locations among other criteria. Below is the full set of out-of-box criteria.