LLM Judges

Note: For comprehensive API documentation and detailed examples, please refer to the Python SDK Documentation.

LLM Judges are evaluators that leverage language models to assess the quality of AI outputs. They're particularly useful for subjective evaluations like helpfulness, relevance, or creativity.

Creating an LLM Judge

You can create a custom LLM judge using the StructuredEvaluator class:

from patronus import init
from patronus.evals import StructuredEvaluator, EvaluationResult
import json
from openai import OpenAI
 
init()
oai = OpenAI()
 
class LLMJudge(StructuredEvaluator):
    """LLM-based evaluator that assesses answers against reference answers."""
    
    def evaluate(self, question: str, answer: str, reference: str = None) -> EvaluationResult:
        """
        Evaluate an answer using GPT-4o-mini.
        
        Args:
            question: The original question
            answer: The model's answer to evaluate
            reference: Optional reference answer for comparison
        """
        # Build the evaluation prompt based on available information
        if reference:
            evaluation_prompt = f"""
            Given the QUESTION and REFERENCE ANSWER, is the ANSWER correct?
            Score from 0.0 to 1.0, where 1.0 is completely correct.
            Return a JSON object with "score" and "explanation" fields.
            
            QUESTION: {question}
            ANSWER: {answer}
            REFERENCE ANSWER: {reference}
            """
        else:
            evaluation_prompt = f"""
            Given the QUESTION, evaluate if the ANSWER is correct, helpful, and relevant.
            Score from 0.0 to 1.0, where 1.0 is excellent.
            Return a JSON object with "score" and "explanation" fields.
            
            QUESTION: {question}
            ANSWER: {answer}
            """
 
        # Call the LLM for evaluation
        response = oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are an expert evaluator assessing the quality of answers."},
                {"role": "user", "content": evaluation_prompt}
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )
        
        # Parse the response
        try:
            result = json.loads(response.choices[0].message.content)
            score = float(result.get("score", 0))
            explanation = result.get("explanation", "No explanation provided")
        except (json.JSONDecodeError, ValueError):
            # Fallback if parsing fails
            score = 0
            explanation = "Failed to parse LLM response"
        
        return EvaluationResult(
            score=score,
            pass_=score >= 0.7,  # Consider 0.7 as the passing threshold
            explanation=explanation
        )

Using the LLM Judge

Once defined, you can use your LLM judge directly in your code:

# Create the evaluator
answer_quality = LLMJudge()
 
# Evaluate an answer
result = answer_quality.evaluate(
    question="What is the capital of France?",
    answer="Paris is the capital of France.",
    reference="The capital of France is Paris."
)
 
print(f"Score: {result.score}")
print(f"Passed: {result.pass_}")
print(f"Explanation: {result.explanation}")

Using with Tracing

LLM judges integrate seamlessly with Patronus tracing:

from patronus import init, traced
from patronus.evals import StructuredEvaluator, EvaluationResult
import json
from openai import OpenAI
 
init()
oai = OpenAI()
 
class LLMJudge(StructuredEvaluator):
    # Implementation as shown above
    # ...
 
@traced()
def answer_question(question: str) -> str:
    """Generate an answer for a question."""
    # In a real application, this would call an LLM
    return "Paris is the capital of France and known for the Eiffel Tower."
 
@traced()
def process_query(query: str):
    """Process a user query and evaluate the response."""
    answer = answer_question(query)
    
    # Use LLM judge to evaluate the response
    judge = LLMJudge()
    result = judge.evaluate(
        question=query,
        answer=answer
    )
    
    return {
        "query": query,
        "answer": answer,
        "quality_score": result.score,
        "quality_passed": result.pass_,
        "evaluation_explanation": result.explanation
    }
 
# Process a query
response = process_query("What is the capital of France?")

Advanced Customization

You can enhance your LLM judge with features like:

Custom Rubrics: Provide detailed scoring guidelines in your prompt.
Multi-criteria Evaluation: Assess multiple aspects like accuracy, clarity, and completeness.
Model Selection: Use different models for different evaluation needs.
Calibration: Include few-shot examples to improve consistency.

For more detailed examples and best practices, refer to the Python SDK Documentation.

LLM Judges

Creating an LLM Judge

Using the LLM Judge

Using with Tracing

Advanced Customization

On this page