Evaluating quality is hard; doing so automatically with...

March 10th, 2025

Evaluating quality is hard; doing so automatically with...

Evaluating quality is hard; doing so automatically with LLMs is..." "" "I've been working on using LLMs to help evaluate quality for a couple of years, and I've noticed something critical: LLMs are good at objective judgments when given strict constraints. " "" "When trying to judge something subjective, though, LLMs quickly go off the rails. " "" "For example, using an LLM to evaluate the quality of a relatively small piece of code is straightforward: " "- Does it match general style guides for that language?" "- Does it compile?" "- Does it, given sample inputs, produce the expected output?" "" "Or perhaps a more subjective but still tractable example: Is a particular search result on topic for the given query? Especially with some of the more recent, larger models, such a judgment can align nicely with what a human would evaluate. " "" "However, using an LLM to judge an LLM-powered chat agent and ensuring each of its messages meets a specific quality bar is nearly untractable. " "" "A lot of work is ongoing around the industry to make such evaluations more tractable. Still, even when the agent is constrained to specific types of conversations, such evaluations are currently tough. I'm very curious, though, to see where this goes over the next few months and years!

Original post on LinkedIn