Discussion about this post

User's avatar
Neural Foundry's avatar

The chronometer analogy is brilliant. Most teams treat LLM judges like demo toys, optimizing for cherry-picked examples instead of real-world durability. The three-metric approach makes total sense because each captures a different failure mode that matters in production. I've seen systems with high Pearson r but low QWK, exactly that "directionally fine but catastrophically wrong sometimes" pattern. The recalibration loop you outlined is what separates toy prompts from production-grade systems. Just updating the gold set when playbooks change seems obvious in hindsight, but most teamsfreeze their eval sets and wonder why drift happens.

No posts

Ready for more?