TL;DR

How I rebuilt ORACLE PRIME on Claude Managed Agents with Outcomes — a self-grading architecture that refuses to ship until a separate AI grader confirms every rubric criterion. Full code + cost math.

For the last eight months I've been running a local Claude Code skill called ORACLE PRIME — a weekly competitive intelligence scanner that reads 40+ sources, identifies market shifts, and produces a structured briefing. It worked well enough. But it had no way to know when it produced a mediocre briefing versus an exceptional one. The output quality varied by 40% depending on which sources were available and what Claude happened to focus on. There was no feedback loop, no quality floor, no way to block a weak report from shipping.

The self-grading pattern, now running on Anthropic's Managed Agents platform with Outcomes, fixes that. The agent refuses to ship until a separate grader model reads the artifact against a rubric and returns a score above the threshold. If the score is too low, the writer retries with the grader's feedback. Three tries maximum. If it still fails, it escalates to a human rather than shipping garbage.

This post covers the full architecture: the 6-API-call session flow, the coordinator system prompt, rubric design, the session trigger script, the webhook handler, and the cost math. Total cost per full scan: $2.36. That number is the most important detail in this article — because it is what makes the self-grading loop financially viable.

Why Local Skills Hit a Quality Ceiling

CLAUDE.md skills are powerful. They encode institutional memory, enforce conventions, and turn complex multi-step workflows into single slash commands. But they have a structural weakness: the writer and the judge are the same model instance in the same context window. When the writer finishes, it is in a state of completion bias — it has been building toward a goal for the last 20,000 tokens and is psychologically (if that word applies to language models) committed to the output it just produced.

Ask that same model to review its own work and it will find minor issues, rephrase a sentence or two, and declare the output acceptable. It will not find the 3 sections that are shallow, the 2 sources it missed, or the conclusion that contradicts the evidence in section 4. Human writers have the same problem — which is why every serious editorial process separates the writer from the editor.

The Managed Agents platform makes separation architecturally trivial. Each agent call is a fresh session with its own context window, its own system prompt, and no memory of what the previous agent produced except what you explicitly pass through the API. The grader genuinely does not know what the writer was "trying" to do — it only sees the artifact and the rubric.

The 6-API-Call Architecture

Here is the exact flow ORACLE PRIME uses on each weekly scan trigger:

Session Create (writer) — POST to /v1/agents/sessions with the coordinator system prompt and the week's source list. Returns a session ID.
Tool calls (research) — The writer agent runs 8-12 tool calls: web search, competitor pricing APIs, GSC data via MCP, GitHub trending, Hacker News top stories. Each returns structured JSON.
Artifact generation — The writer produces the briefing artifact: 1,200-1,800 word structured markdown with 5 fixed sections (Market Shifts, Competitor Moves, Audience Signals, Risk Flags, 3 Ship-Now Actions).
Session Create (grader) — New session, completely separate context. System prompt is the rubric. Input is the artifact. No other context.
Grader evaluation — The grader scores each of the 8 rubric criteria on a 1-5 scale and returns structured JSON. If any criterion scores below 3, the overall score fails.
Outcome write — POST to /v1/agents/sessions/{id}/outcome with the final score, pass/fail status, and the artifact. This is what the Outcomes API is designed for.

If the grader returns a fail, the feedback JSON is passed back to the writer as a new user turn: "The grader found these issues. Revise the artifact." Maximum 3 revision cycles. After 3 failures, the Outcomes API records a status: "escalated" event and a Telegram alert fires to my phone.

Criterion	Pass Condition	Weight
Citation coverage	Every Market Shift item has a URL or named source + date	High
Competitor specificity	At least one pricing number or specific feature name per competitor mentioned	High
Action completeness	All 3 Ship-Now Actions have OWNER + DEADLINE + EXPECTED IMPACT filled	High
Section completeness	No section is empty or says "no data available" without explanation	Medium
No speculation language	Zero instances of "might," "could suggest," "appears to," "seems"	Medium
Freshness	At least 60% of cited sources are within the last 14 days	Medium
Actionability	Ship-Now Actions are completable in under 4 hours by one person	Low
No repetition	No item is identical to an item from the previous briefing	Low

API Call	Model	Input Tokens	Output Tokens	Cost
Writer session (with tools)	Sonnet 4.6	~22,000	~2,400	$1.10
Grader session	Haiku 4.5	~3,500	~600	$0.07
Outcome write	N/A	—	—	$0.001
Tool calls (web search × 8)	N/A	—	—	$0.80

Why Local Skills Hit a Quality Ceiling

The 6-API-Call Architecture

Try Our Free Tools

JSON Formatter & Validator

cURL to Code Converter

More from AI Tools & Tutorials

Imagen 3 & 4 Shut Down June 24: Migrate to Gemini Image (2026)

The Coordinator System Prompt

Rubric Design: The 8 Criteria

The Session Trigger Script

The Webhook Handler

Cost Math: $2.36 Per Full Scan

Why Outcomes Is the Real Story

Sources

Ready to ship faster?

One insight, every Monday. 7am IST. Zero fluff.

Comments · 0

Article stats

Regex Playground

Base64 Encoder / Decoder

UUID Generator

Grok Build Agent Dashboard: Run 8 Parallel Coding Agents From One Screen

Build an MCP Server in TypeScript (2026): Claude Code Guide

Income Tax Calculator India 2025-26: Complete Guide

OpenAI Codex Goal Mode Is Now GA — Multi-Hour Autonomous Coding Sessions

GitHub Copilot Token Billing Week 1: What Developers Are Actually Paying