Rubric Design: The Work Nobody Talks About
A bad rubric is worse than no rubric. A vague rubric produces vague scores, which produce vague feedback, which produces vague revisions. Here is what separates rubrics that work from rubrics that sound good in theory:
Each criterion must have a binary pass/fail condition, not a quality spectrum. "The writing is clear" is not a criterion. "Every sentence has a subject and a verb" is a criterion. "No sentence exceeds 35 words" is a criterion. If the grader has to make a judgment call about quality, the criterion is not specific enough.
Criteria must map to the writer's known failure modes. The best rubrics are built from postmortems. What went wrong in the last 10 output failures? Each failure mode becomes a rubric criterion. A rubric built this way will catch 80%+ of real errors because it was designed around real errors, not theoretical ones.
Weight criteria by consequence, not difficulty. A missing citation is not as important as a factually wrong claim. A structural format error is not as important as a missing required section. Build your rubric with explicit weights or tiers, and make failing criteria cause escalation rather than revision when the consequence is high enough.
Here is a rubric template for a content artifact (blog post, briefing, report):
{
"rubric": {
"blocking_criteria": [
{
"id": "factual_accuracy",
"description": "No claim is made that contradicts the source data provided",
"fail_condition": "Any claim present in artifact that is not supported by provided sources",
"fail_action": "escalate"
},
{
"id": "required_sections",
"description": "All required sections are present and non-empty",
"fail_condition": "Any required section is missing or contains only placeholder text",
"fail_action": "escalate"
}
],
"revision_criteria": [
{
"id": "citation_coverage",
"description": "80%+ of factual claims have inline citations",
"fail_condition": "More than 20% of factual claims lack citations",
"fail_action": "retry",
"max_retries": 2
},
{
"id": "specificity",
"description": "Quantitative claims include specific numbers, not ranges",
"fail_condition": "More than 2 quantitative claims use ranges instead of specific values",
"fail_action": "retry",
"max_retries": 2
}
]
}
}
Blocking criteria escalate to human review if failed — no amount of automated retry will fix a factually wrong artifact. Revision criteria trigger retry with feedback — these are fixable errors that the writer can address with the right guidance.
Implementation Variant 1: 50-Line Python (Simplest)
This implementation is deliberately minimal — no agent framework, no orchestration library, just the Anthropic Python SDK and standard control flow:
import anthropic
import json
client = anthropic.Anthropic()
def run_with_grading(writer_prompt: str, rubric: dict, max_retries: int = 3) -> dict:
grader_feedback = None
attempts = 0
while attempts < max_retries:
# Writer turn
writer_messages = [{"role": "user", "content": writer_prompt}]
if grader_feedback:
writer_messages.append({
"role": "assistant",
"content": "[previous attempt]"
})
writer_messages.append({
"role": "user",
"content": f"The grader found these issues:
{grader_feedback}
Revise."
})
writer_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
messages=writer_messages,
system=WRITER_SYSTEM_PROMPT,
)
artifact = writer_response.content[0].text
# Grader turn — completely fresh context
grader_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
system=GRADER_SYSTEM_PROMPT,
messages=[{
"role": "user",
"content": f"Grade this artifact:
{artifact}
Rubric:
{json.dumps(rubric, indent=2)}"
}]
)
grade = json.loads(grader_response.content[0].text)
attempts += 1
if grade["passed"]:
return {"artifact": artifact, "grade": grade, "attempts": attempts}
# Check for blocking failures
if any(c["action"] == "escalate" for c in grade["failed_criteria"]):
raise ValueError(f"Blocking criterion failed: {grade['failed_criteria']}")
grader_feedback = grade["feedback"]
raise RuntimeError(f"Max retries ({max_retries}) exceeded")
Notice the fresh client.messages.create call for the grader with no conversation history from the writer session. That is the entire implementation of context separation — a new API call. No framework required.
Implementation Variant 2: TypeScript with Typed Results
import Anthropic from '@anthropic-ai/sdk'
interface RubricCriterion {
id: string
description: string
failAction: 'retry' | 'escalate'
maxRetries?: number
}
interface GradeResult {
passed: boolean
score: number
failedCriteria: Array<{ id: string; reason: string; action: 'retry' | 'escalate' }>
feedback: string
}
interface GradedArtifact {
artifact: string
grade: GradeResult
attempts: number
}
async function runWithGrading(
writerPrompt: string,
rubric: RubricCriterion[],
maxRetries = 3
): Promise<GradedArtifact> {
const client = new Anthropic()
let graderFeedback: string | null = null
let attempts = 0
while (attempts < maxRetries) {
const writerMessages: Anthropic.MessageParam[] = [
{
role: 'user',
content: graderFeedback
? `Previous attempt was rejected.
Grader feedback:
${graderFeedback}
Original request:
${writerPrompt}`
: writerPrompt,
},
]
const writerResponse = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 4096,
system: WRITER_SYSTEM_PROMPT,
messages: writerMessages,
})
const artifact =
writerResponse.content[0].type === 'text' ? writerResponse.content[0].text : ''
// Fresh context for grader
const graderResponse = await client.messages.create({
model: 'claude-haiku-4-5-20251001',
max_tokens: 1024,
system: GRADER_SYSTEM_PROMPT,
messages: [
{
role: 'user',
content: `Grade this artifact against the rubric. Return JSON only.
Artifact:
${artifact}
Rubric criteria:
${JSON.stringify(rubric, null, 2)}`,
},
],
})
const graderText =
graderResponse.content[0].type === 'text' ? graderResponse.content[0].text : '{}'
const grade = JSON.parse(graderText) as GradeResult
attempts++
if (grade.passed) {
return { artifact, grade, attempts }
}
const hasBlockingFailure = grade.failedCriteria.some(c => c.action === 'escalate')
if (hasBlockingFailure) {
throw new Error(`Blocking criterion failed: ${JSON.stringify(grade.failedCriteria)}`)
}
graderFeedback = grade.feedback
}
throw new Error(`Max retries exceeded after ${maxRetries} attempts`)
}
Implementation Variant 3: LangGraph Eval Node
For teams already using LangGraph, the self-grading pattern maps cleanly to a conditional edge with an eval node:
from langgraph.graph import StateGraph, END
from typing import TypedDict, Optional
class AgentState(TypedDict):
prompt: str
artifact: Optional[str]
grade: Optional[dict]
attempts: int
feedback: Optional[str]
def writer_node(state: AgentState) -> AgentState:
# Writer uses state["feedback"] if available
artifact = run_writer(state["prompt"], state.get("feedback"))
return {**state, "artifact": artifact, "attempts": state["attempts"] + 1}
def grader_node(state: AgentState) -> AgentState:
# Grader gets ONLY artifact — no access to state["prompt"] or state["feedback"]
grade = run_grader(state["artifact"])
return {**state, "grade": grade, "feedback": grade.get("feedback")}
def routing_function(state: AgentState) -> str:
if state["grade"]["passed"]:
return "done"
if state["attempts"] >= 3:
return "escalate"
if any(c["action"] == "escalate" for c in state["grade"]["failed_criteria"]):
return "escalate"
return "retry"
builder = StateGraph(AgentState)
builder.add_node("writer", writer_node)
builder.add_node("grader", grader_node)
builder.add_node("escalate", escalate_node)
builder.set_entry_point("writer")
builder.add_edge("writer", "grader")
builder.add_conditional_edges("grader", routing_function, {
"done": END,
"retry": "writer",
"escalate": "escalate",
})
graph = builder.compile()
The key implementation detail: the grader node receives state["artifact"] only. If the grader node had access to state["prompt"], it would read the original instruction and develop sympathy for the writer's choices. Context separation is enforced at the node boundary.
Implementation Variant 4: Managed Agents (Production-Grade)
For production agents that run on a schedule and need audit trails, Anthropic's Managed Agents platform adds session management, Outcomes tracking, and webhook delivery on top of the base pattern. See the companion article on ORACLE PRIME for the full implementation — the architecture is identical to Variant 2 except API calls go to the managed sessions endpoints and outcomes are written explicitly.
Common Failure Modes
The grader that reads the room. If you pass the original writer prompt to the grader (even as context "so the grader understands the task"), the grader adjusts its standards based on the difficulty of the task. "Given how complex the requirement was, this is actually quite good." Fix: pass only the artifact and the rubric to the grader. Never explain the context.
The rubric that always passes. Rubrics that are too lenient produce high pass rates and low quality. If your self-grading loop has a 95%+ first-attempt pass rate, your rubric is too weak. Good production rubrics have 60-70% first-attempt pass rates. The retries are where quality improvement happens.
The feedback loop that diverges. If the grader's feedback on attempt 2 contradicts its feedback on attempt 1, the writer has no stable target to hit. This happens when rubric criteria are ambiguous enough that the grader interprets them differently across sessions. Fix: make criteria deterministic enough that the same artifact always produces the same grade.
Using the same model for writer and grader. The pattern works with the same model family — the key is separate context windows, not separate model families. But there is a practical benefit to using different model tiers: the grader does not need creative generation capability. Use Haiku for grading. Use Sonnet or Opus for writing. This cuts grading cost by 5-10x with no quality loss.
When Not to Use This Pattern
Self-grading adds latency and cost. For many agent tasks, that trade-off is not worth it:
- Interactive chat — Users do not want to wait 45 seconds for a graded response. Real-time applications need different quality assurance approaches.
- Simple lookups or transformations — If the task is "what is the capital of France" or "convert this JSON to CSV," grading is overkill. Save the pattern for complex, judgment-heavy artifacts.
- Tasks with external ground truth — If you can run unit tests, linters, or validators on the output, use those instead. Deterministic verification beats probabilistic grading every time.
- Low-stakes content — Internal notes, drafts, brainstorms. Self-grading is for artifacts that ship to users or drive business decisions.
The pattern earns its cost when: the artifact is complex enough that quality varies significantly across attempts, errors have meaningful consequences (wrong information ships to users, bad code goes to production), and the artifact cannot be verified by deterministic tools.
Building the Feedback Dataset
Every graded artifact is a training signal. A writer artifact that required 2 revisions before passing contains exactly the information you need to improve your writer system prompt — the original artifact shows what the model does by default, and the grader's feedback shows what was missing.
Over time, a self-grading loop generates a dataset of (prompt, artifact, grade, feedback, revised artifact) tuples. This is the raw material for fine-tuning, for improving system prompts, and for identifying patterns in what your writer agent consistently gets wrong.
At a minimum, log every grading cycle with: the session type, attempt number, criteria that failed, and final pass/fail. After 100 runs, review the most frequently failing criteria. Those are your writer agent's systematic weaknesses — and they are fixable by updating the writer's system prompt with explicit instructions about those specific failure modes.
The self-grading pattern is the lowest-leverage thing you can do to improve agent quality immediately, and the highest-leverage thing you can do to systematically improve it over time. Those two properties rarely coexist. Use them.
Sources
- Claude Agents Documentation — Anthropic (2026)
- LangGraph Documentation — LangChain (2026)
- Claude Model Pricing — Anthropic (2026)
Comments · 0
No comments yet. Be the first to share your thoughts.