Selecting The Right AI Evals Tool
Over the past year, I’ve focused heavily on AI Evals, both in my consulting work and teaching. A question I get constantly is, “What’s the best tool for evals?”. I’ve always resisted answering directly for two reasons. First, people focus too much on tools instead of the process, thinking the tool will be an off-the-shelf solution when it rarely is. Second, the tools change so quickly that comparisons become outdated immediately.
Having used many of the popular eval tools, I can genuinely say that no single one is superior in every dimension. The “best” tool depends on your team’s skillset, technical stack, and maturity.
Instead of a feature-by-feature comparison, I think it’s more valuable to show you how a panel of data scientists skilled in evals assesses these tools. As part of my AI Evals course, we had three of the most dominant vendors—Langsmith, Braintrust, and Arize Phoenix complete the same homework assignment. This gave us a unique opportunity to see how they tackle the exact same challenge.
We recorded the entire process and live commentary, which is available below. We think this might be helpful in learning about the kinds of things you should consider when selecting a tool for your team.
Thanks to Shreya Shankar and Bryan Bischof for serving as the panelists (alongside me).
Langsmith
With Harrison Chase, CEO of LangChain.
Braintrust
With Wayde Gilliam, former developer relations at Braintrust.
Arize Phoenix
With SallyAnn DeLucia, Technical AI Product Leader at Arize.
Criteria for Assessing AI Evals Tools
Here are themes that consistently surfaced during our review.
1. Workflow and Developer Experience
Reducing friction is more important than any single feature. Concretely, you should be mindful of the time it takes to go from observing a failure to iterating on a solution. For example, we appreciated the ability to go from viewing a single trace to experimenting with that same trace in a playground. For some teams with data-science backgrounds, a notebook-centric workflow is ideal as it provides transparency and control. This happens to be my preferred workflow as well.
When considering a notebook-centric workflow, its important to pay attention to the ergonmics of the sdk. This often boils down to the quality of the documentation and integration with existing data tools.
2. Human-in-the-Loop Support
The best tools don’t try to automate away the human; they empower them. Since error analysis is the highest ROI activity in AI engineering, a tool’s ability to support efficient human review is paramount. Prioritize tools with first-class support for manual annotation and error analysis. As of this writing, one thing that is missing from many tools is axial coding.
3. Transparency and Control vs. “Magic”
Be deeply skeptical of features that promise full automation without human validation, as these can create a powerful and dangerous illusion of confidence. For example, be wary of features where an AI agent both creates an evaluation rubric and then immediately scores the outputs. This “stacking of abstractions” often hides flaws behind a high score. Favor tools that give you control and visibility.
4. Ecosystem Integration vs. Walled Gardens
An eval tool should fit your stack, not force you to fit its stack. Assess how well a tool integrates with your existing technologies. Also, beware of proprietary DSLs as they can add friction. Finally, the ability to export data into common formats for analysis in a variety of environments is a must-have.
Conclusion
The right choice of tool depends on your team’s workflow, skillset, and specific needs. I hope seeing how our panel approached this evaluation provides a better framework for making your own decision.
As for me personally, I tend to use these tools as a backend data store and use Jupyter notebooks as well as my own custom built annotation interfaces for most of my needs.
You should take these notes with a grain of salt. I recommend watch the videos above to get a sense of how we applied these criteria and where you might differ according to your neeeds.
Langsmith Evaluation Notes
Overall Sentiment The overall workflow is intuitive, especially for those new to formal evaluation processes. The UI guides you through creating datasets, running experiments, and annotating results.
Positive Feedback / What We Liked
Critiques and Areas for Improvement
Braintrust Evaluation Notes
Overall Sentiment The panel had a generally positive view of Braintrust, highlighting its clean UI and structured approach to evaluations. The tool’s emphasis on human-in-the-loop workflows was a significant strength.
Positive Feedback / What We Liked
Critiques and Areas for Improvement
Arize Phoenix Evaluation Notes
Overall Sentiment The panel had a generally positive view of Phoenix, with one panelist calling it one of his “favorite open source eval tools.” The tool is positioned as a developer-first, notebook-centric platform.
Positive Feedback / What We Liked
Critiques and Areas for Improvement