Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
TutorialsEvals
LLM Judges
Creating evaluators powered by language models
Note: For comprehensive API documentation and detailed examples, please refer to the Python SDK Documentation.
LLM Judges are evaluators that leverage language models to assess the quality of AI outputs. They're particularly useful for subjective evaluations like helpfulness, relevance, or creativity.
You can create a custom LLM judge using the StructuredEvaluator class:
from patronus import initfrom patronus.evals import StructuredEvaluator, EvaluationResultimport jsonfrom openai import OpenAIinit()oai = OpenAI()class LLMJudge(StructuredEvaluator): """LLM-based evaluator that assesses answers against reference answers.""" def evaluate(self, question: str, answer: str, reference: str = None) -> EvaluationResult: """ Evaluate an answer using GPT-4o-mini. Args: question: The original question answer: The model's answer to evaluate reference: Optional reference answer for comparison """ # Build the evaluation prompt based on available information if reference: evaluation_prompt = f""" Given the QUESTION and REFERENCE ANSWER, is the ANSWER correct? Score from 0.0 to 1.0, where 1.0 is completely correct. Return a JSON object with "score" and "explanation" fields. QUESTION: {question} ANSWER: {answer} REFERENCE ANSWER: {reference} """ else: evaluation_prompt = f""" Given the QUESTION, evaluate if the ANSWER is correct, helpful, and relevant. Score from 0.0 to 1.0, where 1.0 is excellent. Return a JSON object with "score" and "explanation" fields. QUESTION: {question} ANSWER: {answer} """ # Call the LLM for evaluation response = oai.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": "You are an expert evaluator assessing the quality of answers."}, {"role": "user", "content": evaluation_prompt} ], temperature=0, response_format={"type": "json_object"} ) # Parse the response try: result = json.loads(response.choices[0].message.content) score = float(result.get("score", 0)) explanation = result.get("explanation", "No explanation provided") except (json.JSONDecodeError, ValueError): # Fallback if parsing fails score = 0 explanation = "Failed to parse LLM response" return EvaluationResult( score=score, pass_=score >= 0.7, # Consider 0.7 as the passing threshold explanation=explanation )
Once defined, you can use your LLM judge directly in your code:
# Create the evaluatoranswer_quality = LLMJudge()# Evaluate an answerresult = answer_quality.evaluate( question="What is the capital of France?", answer="Paris is the capital of France.", reference="The capital of France is Paris.")print(f"Score: {result.score}")print(f"Passed: {result.pass_}")print(f"Explanation: {result.explanation}")
LLM judges integrate seamlessly with Patronus tracing:
from patronus import init, tracedfrom patronus.evals import StructuredEvaluator, EvaluationResultimport jsonfrom openai import OpenAIinit()oai = OpenAI()class LLMJudge(StructuredEvaluator): # Implementation as shown above # ...@traced()def answer_question(question: str) -> str: """Generate an answer for a question.""" # In a real application, this would call an LLM return "Paris is the capital of France and known for the Eiffel Tower."@traced()def process_query(query: str): """Process a user query and evaluate the response.""" answer = answer_question(query) # Use LLM judge to evaluate the response judge = LLMJudge() result = judge.evaluate( question=query, answer=answer ) return { "query": query, "answer": answer, "quality_score": result.score, "quality_passed": result.pass_, "evaluation_explanation": result.explanation }# Process a queryresponse = process_query("What is the capital of France?")