Nomic Atlas
Patronus is a great way of determining whether the outputs from your LLM are correct. We can run those through a series of evaluators and return automated pass/fail results to you. This helps you quantify correctness and is the first step in improving the performance of your model.
If you want more actionable insights, one approach would be to export those results and dive deeper into what's going on. A great tool for that is Nomic Atlas which helps you visualize those results in an embedding map to understand semantic similarity between questions and answers. Our friends at Nomic did exactly that on our open-sourced hallucination benchmarking dataset HaluBench and wrote a great blog post about it here.
This quick tutorial uses a much more trivial dataset and shows you the step-by-step of how to generate results in Patronus, filter them in LLM Monitoring, export them to CSV, and easily upload them to Nomic Atlas to leverage its capabilities.
Generating Results
For the purpose of this demo, we will be running the script below using the associated dataset of 100 questions in JSONL format. These questions are taken from a FlexiQuiz blog post posted here. Each row has a prompt that we will feed into Llama 3.1 8B using Together API, a golden answer provided by FlexiQuiz, and a question category for tagging purposes. We then call the Patronus API and invoke the system:is-similar-to-gold-answer
evaluator to check if the model output is similar in meaning to the golden answer stored as label
.
Be sure to download the dataset before trying to run the script since it's a dependency and save it to the same directory as a file named 100-trivia-questions.jsonl
.
With the dataset and the script ready to go, we can execute it. Just remember to swap in two API keys (the PATRONUS_API_KEY
and the TOGETHER_API_KEY
) and login to the Nomic CLI using nomic login NOMIC_API_KEY
. You of course do not need to use Together AI if you have another model you'd like to work with. After running the script, you should be able to see the experiment in the Experiments tab.
Within an experiment, you can download the results to CSV.
Option #1: Use the Python Nomic SDK to Upload the Dataset
Now let's use Nomic Atlas to do Failure Mode Clustering. We can leverage Atlas' capabilities to visualize the prompts we fed into our model and understand if there are specific types of questions that Llama 3.1 8B tends to fail on. We'll plot points using the embedding space around the evaluated_model_input
since this will give us an idea if similar topics of questions tend to mess up our model more than others.
The script creates a Pandas DataFrame as the results come in from Patronus' API. Those results can then be uploaded directly to Nomic Atlas using the Nomic Python SDK as shown in the code. It's pretty seamless and you can then view the results via the Nomic Atlas Dashboard.
Option #2: Upload the Dataset on the Dashboard with the Results CSV
The first step is to access your Dashboard and create a new dataset. You'll be asked to upload it in whatever format suits you best including CSV. That's great news for us since the data we exported from Patronus is already in CSV and can be uploaded directly into Nomic Atlas without needing to make a single format modification.
Be sure to choose the evaluated_model_input
as the embedding column if you want to follow along. You can of course select something else if you'd like to investigate something different. It's often worth visualizing the same dataset using a variety of different embedding fields to better understand what's going on.
Viewing the Nomic Atlas Map
Nomic will take a bit of time to process the uploaded dataset and create a map based on the embedding column we gave it. You'll get an email once it's ready to go. At that point, you can click into the Map and see what's going on. We choose to color the points based on the pass
field because we're trying to understand if questions that our model failed to answer correctly (i.e., the model output was different in meaning to the golden label we provided originally) are on similar topics. Below is the visualization we get with False
results colored in green.
The 100 question trivia dataset visualized in Nomic Atlas here with True
in orange and False
in green
Now it looks like we've got some failures sprinkled across the embedding space but most of the results tend to be correct. That's good news for the LLM. There is one area in particular though where our model does seem to be messing up. We can use the Lasso tool to focus on that area and see what kinds of questions are problematic to our model. Those will be brought into focus and help us get to the bottom of this high failure rate cluster.
A Few Failed Examples
Let's pull out a few examples of failures to see what's going on.
Looking at the first question, it seems our model answer Kyle Schwarber although the correct answer is actually Luke Voit. Seems like it's hallucinating and we caught it which is good. We should think about how to reduce hallucinations by either feeding in context or some other approach.
The second question is a bit confusing. It's asking us to identify which team holds a record for career strikeouts which is inherently an individual's record. Not sure how valuable of an answer this truly is. The other issue is that the golden answer states the New York Yankees are that team and then provide 27 championships as additional information to answer the question which is unrelated and therefore not useful. Seems like this is more of a data quality issue than a model mistake.
It highlights the importance of having incredibly high data quality when using benchmarking datasets so you can actually rely on the computed performance metrics. Good news is that's something Patronus can help with so you don't need to pull random trivia question datasets from Internet blog sites.
The third question is more specific and therefore easier to verify. The issue here is the golden answer is wrong since the longest MLB streak that season was held by the St. Louis Cardinals with 17 games won in a row. The second issue is that the question doesn't specify which sports we're talking about and without context the model assumed basketball instead of baseball.
Conclusion
In conclusion, it seems like the model isn't actually to blame for most of these failures. We ended up copying down a problematic benchmarking dataset from some unverified information source and identifying inconsistencies and inaccuracies within it.
By combining the power of Patronus evaluators with Nomic Atlas' visualization capabilities, we've come to this realization and can now decide on the right next steps. Seems like we need to fix our dataset first, rerun the model failure analysis, and then decide where our model's weaknesses still lie. We did confirm that the model did hallucinate on at least one occasion so it's worth keeping an eye out for more of those.