Databricks adds customizable evaluation tools to boost AI agent accuracy

“Enterprises value domain experts’ input to ensure accurate evaluations that reflect unique contexts, business needs, or compliance standards,” said Robert Kramer, principal analyst at Moor Strategy and Insights. “When Agent Bricks was initially introduced, many enterprises welcomed the ability to automate the evaluation and assessment of agents based on quality. As these agents transitioned from prototypes to a more demanding production environment, the limitations of generic evaluation logic became evident,” Kramer added.

Tunable Judges was the result of customer feedback, specifically on capturing subject matter expertise accurately and letting enterprises define what “correctness” is applicable to their agents, Wiley said.

Tunable Judges could be used in ensuring that clinical summaries don’t omit contraindications in healthcare, or in enforcing compliant language in portfolio recommendations, and evaluating tone, de-escalation accuracy, or even in policy adherence in customer support.