LLM-as-a-Judge Offline Evaluation for Ordinal Scores
Pearson r + QWK + ±1
For a long time, sailors could figure out latitude (north–south) by looking at the Sun and stars, but longitude (east–west) was a different story. Longitude required knowing the time difference between the local time on the ship (based on the Sun’s position) and the time at a fixed reference location (Greenwich).
The bottleneck at that time was clocks. The best clocks of the era were pendulum clocks. These clocks were great on land, terrible on a moving ship.
When the ship is being tossed around by the waves, the pendulum’s motion gets corrupted. And when your clock is wrong, you don’t just get the time wrong, you can end up navigating to the wrong place.
In 1735, everything changed when John Harrison built the first practical marine chronometer: a clock that kept accurate time at sea. Mechanically, the key shift was moving from pendulum regulation to balance + spring regulation (the same core concept behind mechanical watches).

The story seems familiar in the context of AI today.
A lot of LLM systems behave like pendulum clocks: impressive in calm conditions, unreliable when the environment shifts. Small changes in phrasing, missing context, messy inputs, or edge-case scenarios can swing outputs from spot-on to wildly off.
If we want AI apps that people can trust, we need the equivalent of Harrison’s chronometer: a mechanism that works at sea, not just in the lab.
The chronometer wasn’t a nicer map; it was a repeatable standard. In AI, evals are that standard: a way to measure reliability across rough conditions, not just demo conditions.
Evaluations turn model behavior into something we can measure, compare, and improve. Instead of trusting a prompt because it “sounds good,” we run it against a fixed gold set, quantify performance, and track whether changes actually make the system more consistent.
In other words: evals make reliability an engineering loop.
Below, I’ll show how I built that loop for an LLM-as-a-judge that produces ordinal coaching scores for the sales coach app I’ve built (https://github.com/iamademar/llm-as-a-judge-sales-coach), and why I evaluate it with Pearson r + Quadratic Weighted Kappa + ±1 accuracy.
The production scenario
The concrete scenario here is a sales coaching app I built recently.
It takes a sales call transcript and produces two outputs:
Coaching feedback for the conversation (what went well, what to improve, suggested next moves)
Ordinal 1–5 scores for each SPIN dimension (Situation, Problem, Implication, Need-Payoff)
Under the hood, that assessment is produced by a prompt-based LLM judge: we pass the transcript plus a structured rubric/template, and the model returns scores + rationale. Here’s the demo of how it works in production:
This kind of workflow shows up everywhere beyond sales coaching too, any time you’re using an LLM to grade, rank, extract, or decide: routing tickets, scoring leads, auditing calls, classifying documents, prioritizing follow-ups.
And in all of those cases, the real product isn’t “LLM output.”
The product is consistent decisions.
The reliability problem
Prompt-based judging is powerful, but it’s also easy to believe your own demo.
A prompt can look great on a few handpicked examples and still drift badly when:
the transcript style changes,
the call gets messy,
the scenario is ambiguous,
the customer isn’t cooperative,
or the rep switches strategy mid-call.
This is where many LLM features die in production: not because the model is “bad,” but because reliability becomes unpredictable. When the output affects something important like coaching, performance reviews, or decision-making…
Unpredictable = Unshippable
So the real question isn’t “does the prompt work?”
It’s:
Does this prompt produce consistent, calibrated scores that match how a human coach would rate the same conversation?
And if the market shifts (new playbook, new ICP, new sales motion), can we recalibrate without guessing?
Because the implication of getting this wrong isn’t just a “wrong score.” It’s downstream damage:
managers stop trusting the tool and revert to manual reviews,
reps feel unfairly graded,
enablement focuses on the wrong behaviors,
and a promising AI feature is rolled back
The calibration loop
The fix is to treat the judge like an instrument that needs calibration.
We build a small gold set: a curated set of transcripts that clearly fit the SPIN framework, each labeled with “gold standard” SPIN scores (and ideally short notes explaining why).
Then we run the evaluator prompt over that same dataset and compare:
model scores vs. gold scores (per SPIN dimension)
aggregate performance across the dataset
failure cases (where it disagrees and why)
If the metrics show the model is close enough and the failure cases are understandable, we treat the prompt as calibrated.
From there, every prompt change is tested the same way. That gives you two production-grade properties:
Confidence to ship: you can say “this version is better” with evidence
Regression detection: you catch degradations before users do
At this point, reliability becomes a repeatable loop.
Now we can talk about how to measure it correctly because the judge outputs 1–5 coaching scores, and those labels are ordinal: they’re ordered, and how far off you are matters.
Why “accuracy” is misleading for 1–5 coaching scores
Exact-match accuracy treats these as equally wrong:
Gold = 5, Pred = 4 ✅ (close enough for coaching)
Gold = 5, Pred = 1 ❌ (way off)
Both are simply “wrong” under accuracy.
But in rubric scoring, distance matters. That’s exactly the use case for weighted kappa (where disagreements are penalized based on severity), and why QWK shows up in ordinal scoring benchmarks like automated essay scoring.
So instead of one brittle metric, I intentionally use three complementary “views” of quality…
Why I use three metrics (not one)
Prompt-based scoring isn’t something you “get right” once. It’s something you calibrate, then keep calibrated.
And because these scores are ordinal (1–5), you want metrics that answer three different production questions:
Are we moving in the right direction when the gold label moves? (trend alignment)
Do we agree like a second rater would? (ordinal agreement)
Are we close enough that the coaching experience still feels right? (product tolerance)
That’s why I use:
Pearson r → checks trend alignment
Quadratic Weighted Kappa (QWK) → checks ordinal agreement and punishes big misses
±1 accuracy → checks “close enough” from a coaching / UX standpoint
Together, they give you a reliability signal that’s much harder to game than plain accuracy.
Metric 1: Pearson r (trend alignment)
In backend/app/services/evaluation_metrics.py, pearson_r(...) implements Pearson correlation with two practical “production-friendly” choices:
If there’s only one data point, return
0.0If either series has zero variance (constant predictions or constant gold labels), return
0.0
# Handle single element case
if n == 1:
return 0.0
# Handle zero variance case (constant array)
if var_true == 0.0 or var_pred == 0.0:
return 0.0Source code: https://github.com/iamademar/llm-as-a-judge-sales-coach/blob/main/backend/app/services/evaluation_metrics.py
Why this is useful in production
This prevents eval runs from exploding on edge cases and makes failure modes obvious.
If your judge collapses into “everything is a 3,” Pearson r doesn’t politely pretend you’re doing fine — it drops toward 0.
What Pearson r catches
“When gold goes up, predictions go up too.”
Broad trend / rank alignment across transcripts
What Pearson r can miss
Calibration drift: you can be consistently off by +1 or −1 and still get decent correlation
Translation: Pearson r tells you if the judge has the right shape — not whether it’s well calibrated.
Metric 2: Quadratic Weighted Kappa (ordinal agreement)
In the same file, quadratic_weighted_kappa(...) treats the LLM judge like a second rater.
Two details matter here.
1) Quadratic penalty by distance
The weight matrix is explicitly quadratic:
weight_mat[i][j] = ((i - j) ** 2) / ((n_labels - 1) ** 2)
So:
off by 1 → small penalty
off by 2+ → much larger penalty
That matches how humans interpret rubric scores: a “4 vs 5” disagreement is a rounding error; “1 vs 5” is a broken judge.
2) Sensible handling of degenerate cases
If there’s no variation in labels (n_labels == 1), the function returns 1.0:
if n_labels == 1:
return 1.0
This avoids the annoying situation where your eval run fails just because a small slice of data happens to be uniform.
What QWK catches
“We agree like raters would.”
Big misses are heavily penalized — exactly what you want for coaching credibility
What QWK helps prevent (this is the production killer)
A judge that’s “directionally okay” but still frequently off by 2–4 points
The kind of disagreement that makes reps say: “This is random.”
Translation: QWK tells you whether the judge is trustworthy as a rater, not just correlated.
Metric 3: ±1 accuracy (coaching tolerance)
The repo’s plus_minus_one_accuracy(...) is a pragmatic product metric:
n_correct = sum(
1 for yt, yp in zip(y_true, y_pred)
if abs(yt - yp) <= 1
)
return n_correct / len(y_true)
This answers the question your users actually feel:
How often is the score close enough that the coaching still makes sense?
Why it matters
In coaching workflows, “4 vs 5” usually yields similar guidance. Even if the exact number differs, the actionable feedback is often basically the same.
So ±1 accuracy is the metric that maps closest to perceived reliability.
What ±1 accuracy catches
“Close enough” performance that preserves a good UX
Whether prompt changes are making the judge more usable
What it helps prevent
Quiet drift that technically looks “fine” on correlation but feels worse to users
The slow slide into “this tool is kind of flaky”
Translation: ±1 is your product guardrail metric.
How to interpret the three together
Here’s the simplest way I’ve found to read these in practice:
Pearson r high, QWK low → good direction, poor agreement (calibration/labeling mismatch)
QWK high, ±1 low → rare but catastrophic misses are happening (bad failure modes)
±1 high, QWK mediocre → UX feels okay, but big misses exist (risk of trust collapse)
All three improving together → your judge is getting reliably better, not just “better on paper”
That’s the real point of mixing metrics: you’re not optimizing a number, you’re making a judge that stays reliable when transcripts get messy and the world changes.
Bringing it back to the chronometer
Harrison didn’t make the ocean calmer. He built an instrument that stayed accurate anyway and navigators trusted it because it held up under real conditions.
Offline evaluation is our chronometer: the mechanism that keeps an LLM judge reliable.
And when the sales team shifts direction in response to the market, you don’t throw the system away. You update the gold set to reflect the new playbook and real examples, re-run the evals, and keep the judge aligned.
You don’t control the sea.
You recalibrate the instrument so it still tells the truth when things get rough.



The chronometer analogy is brilliant. Most teams treat LLM judges like demo toys, optimizing for cherry-picked examples instead of real-world durability. The three-metric approach makes total sense because each captures a different failure mode that matters in production. I've seen systems with high Pearson r but low QWK, exactly that "directionally fine but catastrophically wrong sometimes" pattern. The recalibration loop you outlined is what separates toy prompts from production-grade systems. Just updating the gold set when playbooks change seems obvious in hindsight, but most teamsfreeze their eval sets and wonder why drift happens.