LLM-as-a-Judge for Sales Coaching
Designing an Auditable Evaluator Pipeline
LLM-as-a-Judge has matured!
We’re finally past the phase where “LLMs can sort of grade things”!
Today’s models are increasingly capable of acting as automated evaluators, and there’s now a growing body of research studying when (and how) LLM judgments align with human evaluation:
Zheng et al. (2023) — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Liu et al. (2023) — G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Kim et al. (2024) — Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Chen et al. (2024) — Humans or LLMs as the Judge? A Study on Judgement Bias
Huang et al. (2025) — An Empirical Study of LLM-as-a-Judge for LLM Evaluation
What better way to test this out than in sales coaching? Sales coaching is already a “judge” workflow.
Here’s a real-world scenario:
A sales coach listens to a call recording, then fills out a scorecard. They grade the rep across a playbook, let’s say SPIN (Situation, Problem, Implication, Need-Payoff) or your internal rubric, and add notes like:
“They asked too many surface-level questions.”
“They didn’t push on impact.”
“Great discovery, but weak closing.”
The problem is that judgment is expensive, time-consuming, and possibly inconsistent. Two coaches can score the same call differently. Depending on mood, the same coach might score it differently a week later. And when a sales rep challenges the score and asks a question like “Why did I get a 2 on Implication?”, the explanation is usually buried in messy notes.
That’s why sales coaching is a perfect proving ground for LLM-as-a-Judge: it forces you to treat evaluation as a system, not a one-off prompt.
If the model is going to “judge” a call, it needs to do what good coaches do:
apply a consistent rubric,
produce structured scores, and
justify those scores with evidence from the call (transcript).
If an AI coaching system that can’t explain why a score was given or reproduce that score the next week, it really isn’t an effective AI system.
In this post, I’ll walk through a production-grade LLM-as-a-Judge architecture for sales coaching. I’ll show how transcripts are evaluated, validated, stored, and continuously calibrated, using my open-source reference implementation: https://github.com/iamademar/llm-as-a-judge-sales-coach/
Demo app deployed in Azure.
The Design: A Four-Stage Pipeline
The assessment pipeline follows this sequence:
Transcript → Judge → Validation → PersistenceStage 1: Transcript Ingestion
Every assessment starts with a raw conversation transcript. The frontend captures or uploads conversation data, which flows through our FastAPI backend:
# 1) Persist transcript first
transcript = Transcript(
representative_id=req.metadata.get("representative_id"),
buyer_id=req.metadata.get("buyer_id"),
call_metadata=req.metadata,
transcript=req.transcript,
)
db.add(transcript)
db.flush() # Get ID before scoringWhy persist before scoring? This ensures that if the LLM call fails, the data is not lost. The transcript is the source of truth, and assessments are derived artifacts.
Stage 2: The Judge
The scoring service orchestrates the LLM evaluation. Here’s the complete flow from the scorer.py:
def score_transcript(
transcript: str,
organization_id: Optional[uuid.UUID] = None,
db: Optional[Session] = None,
) -> tuple[dict, str, str]:
"""
Score a sales transcript using SPIN framework via LLM.
This is the main scoring pipeline that:
1. Fetches organization's active prompt template from database
2. Builds calibrated prompt (system + user) using the template
3. Calls LLM to generate assessment
4. Parses and validates JSON response
5. Validates score ranges and required keys
6. Returns assessment data with metadata
Args:
transcript: Sales conversation transcript with speaker tags (Rep:/Buyer:)
organization_id: Organization UUID for fetching prompt template and LLM credentials
db: Database session for template and credential lookup
Returns:
Tuple of (assessment_data, model_name, prompt_version) where:
- assessment_data: dict with "scores" and "coaching" keys
- model_name: LLM model identifier used
- prompt_version: Prompt template version identifier
Raises:
ValueError: If transcript is empty, no active template found, or JSON parsing fails
AssertionError: If scores are out of range [1, 5] or required keys missing
Examples:
>>> os.environ["MOCK_LLM"] = "true"
>>> # Note: Requires organization with active template in test DB
>>> data, model, version = score_transcript("Rep: Hi\\nBuyer: Hello", org_id, db)
>>> "scores" in data and "coaching" in data
True
"""
# Step 1: Fetch organization's active prompt template
if not organization_id or not db:
raise ValueError("organization_id and db are required for scoring")
template = template_crud.get_active_for_org(db, organization_id)
if not template:
raise ValueError(
"No active prompt template found for organization. "
"Please ensure a default template was created during setup."
)
# Step 2: Build calibrated prompt using org's template
system, user = build_prompt(
transcript,
system_prompt=template.system_prompt,
user_template=template.user_template,
)
prompt_version = template.version
# Step 3: Call LLM (mock or real provider)
raw_json = call_json(
system,
user,
model=MODEL_NAME,
temperature=0.0,
response_format_json=True,
organization_id=organization_id,
db=db,
)
# Step 4: Parse JSON with guardrails
data = parse_json_strict(raw_json)
# Step 5: Validate required keys
assert "scores" in data, "Missing required key 'scores' in LLM response"
assert "coaching" in data, "Missing required key 'coaching' in LLM response"
# Step 6: Validate score bounds (all must be 1-5)
scores = data["scores"]
required_score_keys = [
"situation", "problem", "implication", "need_payoff",
"flow", "tone", "engagement"
]
for key in required_score_keys:
assert key in scores, f"Missing required score key '{key}'"
score_value = scores[key]
assert isinstance(score_value, int), \
f"Score '{key}' must be integer, got {type(score_value).__name__}"
assert 1 <= score_value <= 5, \
f"Score '{key}' must be in range [1, 5], got {score_value}"
# Step 7: Validate coaching structure
coaching = data["coaching"]
required_coaching_keys = ["summary", "wins", "gaps", "next_actions"]
for key in required_coaching_keys:
assert key in coaching, f"Missing required coaching key '{key}'"
# Return assessment with metadata for tracking
return data, MODEL_NAME, prompt_versionKey design decision: The metadata alongside the assessment (model_name, prompt_version) is returned. This enables reproducibility and A/B testing of different prompt strategies.
Stage 3: Validation with Guardrails
LLMs rarely return clean JSON on first try. The guardrails handle the common failure modes:
def parse_json_strict(raw: str) -> dict:
"""
Parse JSON with fallback strategies for LLM-generated content.
Attempts multiple parsing strategies:
1. Direct json.loads() on raw input
2. Strip code fences (``` or ```json) and whitespace, then retry
3. Raise ValueError with debugging context if all strategies fail
Args:
raw: Raw string that may contain JSON (possibly wrapped in code fences)
Returns:
Parsed JSON as a dictionary
Raises:
ValueError: If JSON cannot be parsed after all strategies, includes
first ~120 chars of input for debugging
Examples:
>>> parse_json_strict('{"key": "value"}')
{'key': 'value'}
>>> parse_json_strict('```json\\n{"key": "value"}\\n```')
{'key': 'value'}
>>> parse_json_strict('```\\n{"key": "value"}\\n```')
{'key': 'value'}
"""
# Strategy 1: Try parsing as-is
try:
return json.loads(raw)
except (json.JSONDecodeError, TypeError):
pass
# Strategy 2: Strip code fences and whitespace
# Remove leading/trailing whitespace and newlines
cleaned = raw.strip()
# Pattern to match code fences with optional language specifier
# Matches: ```json ... ``` or ``` ... ```
fence_pattern = r'^```(?:json)?\s*\n?(.*?)\n?```$'
match = re.match(fence_pattern, cleaned, re.DOTALL)
if match:
# Extract content between fences
cleaned = match.group(1).strip()
else:
# Even without full fence pattern, strip any leading/trailing backticks
cleaned = cleaned.strip('`').strip()
# Try parsing cleaned version
try:
return json.loads(cleaned)
except (json.JSONDecodeError, TypeError) as e:
# Strategy 3: Raise with context
preview = raw[:120] if len(raw) > 120 else raw
raise ValueError(
f"Failed to parse JSON after cleanup strategies. "
f"Error: {str(e)}. "
f"Input preview (first 120 chars): {preview!r}"
)
This parser handles:
Markdown wrapping: \``json {…} ````
Extra whitespace: Leading/trailing newlines and spaces
Debugging failures: Provides first 120 chars of raw output when all strategies fail
Stage 4: Persistence
Once validated, the assessment is persisted with full metadata:
class Assessment(Base):
"""
SPIN framework assessment model.
Stores LLM-generated scores and coaching feedback for a transcript.
Tracks model name and prompt version for reproducibility and evaluation.
"""
__tablename__ = "assessments"
id = Column(Integer, primary_key=True, index=True, autoincrement=True)
transcript_id = Column(
Integer,
ForeignKey("transcripts.id", ondelete="CASCADE"),
nullable=False,
index=True,
comment="Reference to parent transcript"
)
scores = Column(
JSON,
nullable=False,
comment="SPIN scores: {situation, problem, implication, need_payoff, flow, tone, engagement}"
)
coaching = Column(
JSON,
nullable=False,
comment="Coaching feedback: {summary, wins, gaps, next_actions}"
)
model_name = Column(
String,
nullable=False,
index=True,
comment="LLM model identifier (e.g., 'gpt-4o-mini', 'claude-3-sonnet')"
)
prompt_version = Column(
String,
nullable=False,
index=True,
comment="Prompt template version (e.g., 'spin_v1', 'spin_v2')"
)
latency_ms = Column(
Integer,
nullable=True,
comment="LLM call latency in milliseconds"
)
created_at = Column(
DateTime(timezone=True),
server_default=func.now(),
nullable=False,
comment="Timestamp when assessment was created"
)
# Relationship to transcript
transcript_ref = relationship("Transcript", back_populates="assessments")
def __repr__(self):
return (
f"<Assessment(id={self.id}, transcript_id={self.transcript_id}, "
f"model={self.model_name!r}, prompt_version={self.prompt_version!r})>"
)Why store model_name and prompt_version? This enables:
Compare different LLM providers (GPT-4 vs Claude vs Gemini)
Track prompt evolution and measure quality improvements
Debug regressions when scores drift
Run historical analyses on assessment quality
Note: This can be seen on the demo where each assessment is labell with the LLM used.
Reducing Drift with Rubrics and Schemas
What keeps the system reliable is the explicit agreements between the LLM and our application about output format and scoring criteria. We’ve built both in prompt and through Pydantic validators.
The Rubric: Calibrated Scoring Criteria
Every LLM evaluation uses a detailed rubric that defines each score on a 1-5 scale. Here’s an excerpt from the default prompt template used on the app:
SCORING RUBRIC (1-5 scale):
**situation** (1-5): Quality of situation questions
- 1: No situation questions; jumps to pitch
- 2: Minimal context gathering; superficial questions
- 3: Adequate situation questions covering basic context
- 4: Good situation questions establishing clear current state
- 5: Excellent situation questions; thorough understanding of buyer's environment
**problem** (1-5): Quality of problem questions
- 1: No problem identification; ignores pain points
- 2: Weak problem exploration; misses key issues
- 3: Identifies some problems but lacks depth
- 4: Good problem identification with clear pain points
- 5: Exceptional problem discovery; uncovers hidden issues
**implication** (1-5): Quality of implication questions
- 1: No exploration of consequences; stays surface-level
- 2: Minimal urgency building; weak consequence exploration
- 3: Some implication questions but lacks impact
- 4: Good implication development building urgency
- 5: Outstanding implication questions creating compelling urgency
**need_payoff** (1-5): Quality of need-payoff questions
- 1: No connection between solution and buyer value
- 2: Weak value proposition; generic benefits
- 3: Adequate need-payoff with some value connection
- 4: Strong need-payoff linking solution to specific pains
- 5: Exceptional need-payoff; buyer articulates own value
**flow** (1-5): Adherence to SPIN sequence (S→P→I→N)
- 1: Random questioning; no discernible structure
- 2: Poor flow; jumps between stages inconsistently
- 3: Follows SPIN loosely; some stage mixing
- 4: Good SPIN sequence with clear progression
- 5: Excellent SPIN flow; natural and purposeful transitions
**tone** (1-5): Professional, empathetic, confident, adaptive communication
- 1: Pitchy, monologue-style; ignores buyer cues
- 2: Inconsistent tone; occasional empathy gaps
- 3: Mixed empathy and clarity; adequate professionalism
- 4: Strong tone; professional, warm, and responsive
- 5: Exceptional tone; adaptive, empathetic, confident, concise
**engagement** (1-5): Active listening, reflection, buyer talk-time
- 1: Dominates conversation; no active listening
- 2: Limited listening; minimal buyer participation
- 3: Adequate engagement; balanced talk-time
- 4: Good engagement; actively listens and reflects
- 5: Outstanding engagement; buyer-led insights and high talk-timeNote: This can be prompt can be tweaked so when the sales team learns better ways to approach sales conversations, the “judgement” can be recalibrated.
This rubric serves multiple purposes:
Anchors the LLM’s judgment to specific observable behaviors
Reduces prompt sensitivity by explicitly defining what each score means
Enables inter-rater reliability when comparing LLM vs human evaluations
Documents the evaluation criteria for stakeholders and audits
The Schema: Enforcing Structure
Alongside the rubric, we embed a JSON schema directly in the prompt:
JSON SCHEMA (your response must match this exactly):
{{
"type": "object",
"properties": {{
"scores": {{
"type": "object",
"properties": {{
"situation": {{"type": "integer", "minimum": 1, "maximum": 5}},
"problem": {{"type": "integer", "minimum": 1, "maximum": 5}},
"implication": {{"type": "integer", "minimum": 1, "maximum": 5}},
"need_payoff": {{"type": "integer", "minimum": 1, "maximum": 5}},
"flow": {{"type": "integer", "minimum": 1, "maximum": 5}},
"tone": {{"type": "integer", "minimum": 1, "maximum": 5}},
"engagement": {{"type": "integer", "minimum": 1, "maximum": 5}}
}},
"required": ["situation", "problem", "implication", "need_payoff", "flow", "tone", "engagement"]
}},
"coaching": {{
"type": "object",
"properties": {{
"summary": {{"type": "string"}},
"wins": {{"type": "array", "items": {{"type": "string"}}}},
"gaps": {{"type": "array", "items": {{"type": "string"}}}},
"next_actions": {{"type": "array", "items": {{"type": "string"}}}}
}},
"required": ["summary", "wins", "gaps", "next_actions"]
}}
}},
"required": ["scores", "coaching"]
}}We also reinforce this with Pydantic models on the application side:
class AssessmentScores(BaseModel):
"""
SPIN scoring model with additional conversation quality metrics.
All scores must be integers in the range [1, 5] where:
- 1 = Poor
- 2 = Below Average
- 3 = Average
- 4 = Good
- 5 = Excellent
"""
situation: int = Field(..., ge=1, le=5, description="Quality of situation questions")
problem: int = Field(..., ge=1, le=5, description="Quality of problem questions")
implication: int = Field(..., ge=1, le=5, description="Quality of implication questions")
need_payoff: int = Field(..., ge=1, le=5, description="Quality of need-payoff questions")
flow: int = Field(..., ge=1, le=5, description="Overall conversation flow")
tone: int = Field(..., ge=1, le=5, description="Tone and professionalism")
engagement: int = Field(..., ge=1, le=5, description="Customer engagement level")
@field_validator(
"situation", "problem", "implication", "need_payoff",
"flow", "tone", "engagement",
mode="before"
)
@classmethod
def validate_score_range(cls, v, info):
"""Ensure all scores are integers in valid range [1, 5]"""
if not isinstance(v, int):
raise ValueError(f"{info.field_name} must be an integer, got {type(v).__name__}")
if v < 1 or v > 5:
raise ValueError(f"{info.field_name} must be between 1 and 5, got {v}")
return vWhy this dual approach?
The prompt-embedded schema guides the LLM’s output generation
The Pydantic validators catch any deviations before persistence
Together, they form a contract that dramatically reduces drift
Failure Modes: What We Learned the Hard Way
Building this system revealed three major categories of failure. I’ve listed them below and the solution.
1. Inconsistency: Score Drift Over Time
Problem: Early versions showed score inflation—the LLM would gradually shift toward higher scores without calibration.
Solution:
Explicit rubric anchoring (see above)
Temperature set to 0.0 for deterministic outputs
Tracking prompt_version to detect when drift correlates with prompt changes
# Deterministic LLM calls
raw_json = call_json(
system,
user,
model=MODEL_NAME,
temperature=0.0, # No randomness
response_format_json=True,
organization_id=organization_id,
db=db,
)2. Verbosity: When LLMs Over-Explain
Problem: LLMs would add commentary like “Here’s my assessment:” or wrap JSON in markdown, breaking parsers.
Solution: Multi-strategy parsing with guardrails (see Stage 3 above) and explicit instructions:
# Immutable system prompt - instructs LLM behavior
SYSTEM = """You are a senior sales coach specializing in the SPIN (Situation, Problem, Implication, Need-Payoff) selling methodology.
Your task is to evaluate sales conversations and provide scoring and coaching feedback.
CRITICAL INSTRUCTIONS:
- Return STRICT JSON that exactly matches the provided JSON Schema
- Do NOT include any extra keys beyond those specified in the schema
- Do NOT wrap your response in markdown code blocks
- Ensure all scores are integers between 1 and 5 (inclusive)
- Base your assessment on evidence from the conversation transcript"""The system prompt is immutable and enforces strict JSON mode.
3. Prompt Sensitivity: Small Changes, Big Impact
Problem: Tweaking prompt wording would cause score distributions to shift unexpectedly.
Solution:
Store prompts as versioned database records (not hardcoded strings)
Track prompt_version on every assessment
Run evaluation datasets before promoting new prompt versions
class PromptTemplate(Base):
"""
Prompt template for SPIN assessments.
Each organization has at least one template (v0 default created automatically).
Only one template per organization can be active at a time.
"""
__tablename__ = "prompt_templates"
id = Column(
UUID(),
primary_key=True,
default=uuid.uuid4,
index=True,
comment="Unique template identifier",
)
organization_id = Column(
UUID(),
ForeignKey("organizations.id", ondelete="CASCADE"),
nullable=False,
index=True,
comment="Organization this template belongs to",
)
name = Column(
String(100),
nullable=False,
comment="Human-readable template name",
)
version = Column(
String(20),
nullable=False,
default="v0",
comment="Version identifier (e.g., 'v0', 'v1', 'custom_v2')",
)
system_prompt = Column(
Text,
nullable=False,
comment="System prompt defining LLM behavior",
)
user_template = Column(
Text,
nullable=False,
comment="User prompt template (must contain {transcript} placeholder)",
)
is_active = Column(
Boolean,
default=False,
server_default="false",
nullable=False,
comment="Only one template per org can be active",
)
created_at = Column(
DateTime(timezone=True),
server_default=func.now(),
nullable=False,
comment="Template creation timestamp",
)
updated_at = Column(
DateTime(timezone=True),
server_default=func.now(),
onupdate=func.now(),
nullable=False,
comment="Template last update timestamp",
)
# Relationships
organization = relationship("Organization", back_populates="prompt_templates")
evaluation_runs = relationship(
"EvaluationRun",
back_populates="prompt_template",
cascade="all, delete-orphan",
order_by="desc(EvaluationRun.created_at)",
)
@property
def latest_evaluation(self):
"""Get the most recent evaluation run for this template."""
return self.evaluation_runs[0] if self.evaluation_runs else None
@property
def best_qwk_score(self):
"""Get the best QWK score across all evaluations."""
if not self.evaluation_runs:
return None
valid_runs = [run.macro_qwk for run in self.evaluation_runs if run.macro_qwk is not None]
return max(valid_runs) if valid_runs else None
def __repr__(self):
return (
f"<PromptTemplate(id={self.id}, "
f"org={self.organization_id}, "
f"name={self.name!r}, "
f"version={self.version}, "
f"is_active={self.is_active})>"
)This design enables A/B testing of prompt variants, provides the ability to roll back if a new prompt degrades quality, and allows for historical analysis of how prompts impact assessment quality.
System Architecture
If your looking at the code and need a map of how to read it, here’s the complete sequence diagram showing how a transcript flows through the system:
┌──────────┐
│ Frontend │
│ (Next.js)│
└────┬─────┘
│ POST /assess
│ { transcript, metadata }
▼
┌─────────────────────┐
│ FastAPI Router │
│ (assess.py) │
└────┬────────────────┘
│ 1) Persist transcript
▼
┌─────────────────────┐
│ PostgreSQL │
│ transcripts table │
└────┬────────────────┘
│ transcript_id
▼
┌─────────────────────┐
│ Scoring Service │
│ (scorer.py) │
└────┬────────────────┘
│ 2) Fetch active prompt template
▼
┌─────────────────────┐
│ PostgreSQL │
│ prompt_templates │
└────┬────────────────┘
│ system_prompt, user_template
▼
┌─────────────────────┐
│ Prompt Builder │
│ (build_prompt) │
└────┬────────────────┘
│ 3) Build prompts with rubric + schema
▼
┌─────────────────────┐
│ LLM Client │
│ (llm_client.py) │
└────┬────────────────┘
│ 4) Call OpenAI/Anthropic/Google
▼
┌─────────────────────┐
│ External LLM API │
│ (GPT-4, Claude, etc)│
└────┬────────────────┘
│ raw JSON response
▼
┌─────────────────────┐
│ JSON Guardrails │
│ (parse_json_strict) │
└────┬────────────────┘
│ 5) Parse & strip code fences
▼
┌─────────────────────┐
│ Schema Validator │
│ (scorer.py) │
└────┬────────────────┘
│ 6) Validate scores [1-5], keys, types
▼
┌─────────────────────┐
│ PostgreSQL │
│ assessments table │
└────┬────────────────┘
│ assessment_id, scores, coaching
▼
┌─────────────────────┐
│ FastAPI Response │
│ (AssessResponse) │
└────┬────────────────┘
│ 7) Return to frontend
▼
┌──────────┐
│ Frontend │
│ Dashboard│
└──────────┘Key Takeaways
If you’re building LLM-as-a-judge systems, these takeaways might be helpful:
Embed schemas in prompts — Don’t rely on post-hoc parsing alone
Version your prompts — Treat them like code, store in DB with metadata
Build guardrails at every layer — Prompt, parsing, application validation. Something like this:
# Layer 1: Prompt schema (in system prompt) # Layer 2: Parsing guardrails data = parse_json_strict(raw_json) # Layer 3: Application validation assert “scores” in data, “Missing required key ‘scores’” for key in required_score_keys: assert key in scores, f”Missing score key ‘{key}’” assert 1 <= scores[key] <= 5, f”Score out of range: {scores[key]}”Note: For full details, see code on github.
Track metadata religiously — model_name, prompt_version, latency_ms enable debugging. Really helpful later on!
Evaluate before deploying — Use golden datasets with human labels for QA. In this scenario, these are conversations from sales reps that closed and considered a really good sales call.
Set temperature to 0.0 — Determinism reduces drift for structured outputs
LLMs are powerful judges, but they need strong contracts and validation to become reliable production systems.
Code References
All code examples in this post are from the open-source codebase:
Scoring service: backend/app/services/scorer.py
JSON guardrails: backend/app/utils/json_guardrails.py
Assessment model: backend/app/models/assessment.py
Assessment router: backend/app/routers/assess.py
Prompt templates: backend/app/prompts/prompt_templates.py
LLM client: backend/app/services/llm_client.py
Prompt template model: backend/app/models/prompt_template.py
Assessment schemas: backend/app/schemas/assessment.py
The full source is available in the repository:
https://github.com/iamademar/llm-as-a-judge-sales-coach
Questions or comments? I’d love to hear about your experiences building LLM judge systems. What failure modes have you encountered? What guardrails work best for your use case?



Great walkthrough of production-grade LLM judging. The three-layer guardrail aproach (prompt schema, parsing, Pydantic validation) is what seperates POCs from actual systems. I've seen too many projects fail because they only validated at one layer then got surprised by LLM creativity. The prompt_version tracking in the DB is clever, makes A/B testing across prompt iterations actully auditable insted of just hoping the new version is better.