The Coordinator System Prompt
The coordinator is not a separate agent — it is the logic layer I wrote in TypeScript that orchestrates the two agents. Here is the writer agent's system prompt, which is the single most important configuration decision in the whole architecture:
You are ORACLE PRIME, a competitive intelligence analyst for wowhow.cloud.
## Mission
Produce a weekly briefing that a solo developer founder can act on within 24 hours.
Every section must be grounded in the source data provided. No speculation.
Every "Ship-Now Action" must be specific enough to assign to a person (or agent).
## Output Format (STRICT — the grader will check this)
## Market Shifts (3-5 items, each with source citation)
## Competitor Moves (pricing changes, feature launches, positioning shifts)
## Audience Signals (forum threads, Reddit discussions, support tickets that reveal pain)
## Risk Flags (anything that could hurt traffic, revenue, or reputation in 30 days)
## 3 Ship-Now Actions (format: ACTION | OWNER | DEADLINE | EXPECTED IMPACT)
## Quality Standards
- Each Market Shift item must cite a specific source (URL or named publication + date)
- Competitor Moves must include specific pricing numbers, not ranges
- Ship-Now Actions must be completable in under 4 hours by one person
- No section may be empty — if data is unavailable, say so explicitly
## What I Will NOT Do
- Speculate beyond the source data
- Use phrases like "it appears" or "might suggest"
- Report the same shift two weeks in a row without noting new developments
The constraint that every Market Shift must cite a specific source is what separates this from a hallucination-prone summary. The grader checks for citations. If the writer produces an uncited claim, the grader fails that criterion and the writer must revise.
Rubric Design: The 8 Criteria
The rubric is the most important piece of the self-grading pattern that nobody talks about. A vague rubric produces vague scores. These 8 criteria are designed to be measurable — a grader can determine pass/fail without ambiguity:
| Criterion | Pass Condition | Weight |
| Citation coverage | Every Market Shift item has a URL or named source + date | High |
| Competitor specificity | At least one pricing number or specific feature name per competitor mentioned | High |
| Action completeness | All 3 Ship-Now Actions have OWNER + DEADLINE + EXPECTED IMPACT filled | High |
| Section completeness | No section is empty or says "no data available" without explanation | Medium |
| No speculation language | Zero instances of "might," "could suggest," "appears to," "seems" | Medium |
| Freshness | At least 60% of cited sources are within the last 14 days | Medium |
| Actionability | Ship-Now Actions are completable in under 4 hours by one person | Low |
| No repetition | No item is identical to an item from the previous briefing | Low |
The three "High" criteria are blocking — a fail on any one of them fails the whole briefing regardless of how well the other criteria score. The "Medium" and "Low" criteria use a 1-5 scale; the briefing needs a combined score of 28/40 on those five criteria to pass.
This two-tier structure came from trial and error. My first rubric had all 8 criteria equally weighted. The result was that a briefing with citation coverage of 90% and three empty sections could still pass if the other criteria were perfect. That produced exactly the kind of quality floor I was trying to eliminate — technically passing but actually unacceptable.
The Session Trigger Script
import Anthropic from '@anthropic-ai/sdk'
const client = new Anthropic()
interface OracleRunResult {
sessionId: string
graderSessionId: string
artifact: string
score: number
passed: boolean
attempts: number
outcomeId: string
}
async function runOraclePrimeScan(sourceList: string[]): Promise<OracleRunResult> {
const MAX_ATTEMPTS = 3
let attempts = 0
let graderFeedback: string | null = null
// Create writer session
const writerSession = await client.beta.managedAgents.sessions.create({
system_prompt: WRITER_SYSTEM_PROMPT,
tools: [webSearchTool, gscTool, pricingApiTool],
model: 'claude-sonnet-4-6',
})
// Build initial user message
let userMessage = `Run a full ORACLE PRIME scan for the week of ${new Date().toISOString().slice(0, 10)}.
Sources to analyze:
${sourceList.join('
')}`
while (attempts < MAX_ATTEMPTS) {
if (graderFeedback) {
userMessage = `The grader rejected your previous artifact. Specific issues:
${graderFeedback}
Revise the artifact to address every point above.`
}
// Run writer
const writerResponse = await client.beta.managedAgents.sessions.run(
writerSession.id,
{ message: userMessage }
)
const artifact = extractArtifact(writerResponse)
// Grade the artifact in a fresh session
const graderSession = await client.beta.managedAgents.sessions.create({
system_prompt: GRADER_SYSTEM_PROMPT,
model: 'claude-haiku-4-5-20251001', // Cheaper model for grading
})
const graderResponse = await client.beta.managedAgents.sessions.run(
graderSession.id,
{
message: `Grade this briefing artifact against the rubric:
${artifact}`
}
)
const gradeResult = parseGraderResponse(graderResponse)
attempts++
if (gradeResult.passed) {
// Write outcome
const outcome = await client.beta.managedAgents.sessions.outcomes.create(
writerSession.id,
{
status: 'success',
score: gradeResult.score,
artifact,
metadata: { attempts, gradeResult },
}
)
return {
sessionId: writerSession.id,
graderSessionId: graderSession.id,
artifact,
score: gradeResult.score,
passed: true,
attempts,
outcomeId: outcome.id,
}
}
graderFeedback = gradeResult.feedback
}
// Escalate after max attempts
await client.beta.managedAgents.sessions.outcomes.create(writerSession.id, {
status: 'escalated',
score: 0,
metadata: { attempts, lastFeedback: graderFeedback },
})
await sendTelegramAlert(`ORACLE PRIME failed after ${MAX_ATTEMPTS} attempts. Manual review required.`)
throw new Error(`ORACLE PRIME failed after ${MAX_ATTEMPTS} grading attempts`)
}
Two things to notice. First, the grader runs on Haiku, not Sonnet. Grading a structured document against a checklist is a mechanical task — Haiku handles it well and costs 5x less. This is the model tiering decision that makes the self-grading loop financially viable. Second, the outcome write happens inside the loop, not outside it. The outcome captures the attempt count, the final score, and the artifact — which makes the Outcomes API useful as an audit trail, not just a success/failure flag.
The Webhook Handler
Managed Agent sessions can run asynchronously. For a scan that takes 4-7 minutes, blocking a cron job thread is wasteful. Here is the webhook handler that receives the completion event:
// src/app/api/webhooks/oracle-prime/route.ts
import { NextRequest, NextResponse } from 'next/server'
import crypto from 'node:crypto'
export async function POST(req: NextRequest): Promise<NextResponse> {
const signature = req.headers.get('x-anthropic-signature') ?? ''
const body = await req.text()
// Verify signature
const expected = crypto
.createHmac('sha256', process.env.ANTHROPIC_WEBHOOK_SECRET ?? '')
.update(body)
.digest('hex')
if (!crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expected))) {
return NextResponse.json({ error: 'Invalid signature' }, { status: 401 })
}
const event = JSON.parse(body) as AgentOutcomeEvent
if (event.type === 'agent.session.outcome') {
const { status, artifact, score, metadata } = event.data
if (status === 'success') {
// Store briefing in Redis with 7-day TTL
await redis.setex(`oracle:briefing:${event.data.week}`, 604800, artifact)
// Post to internal Telegram channel
await sendTelegramBriefing(artifact, score, metadata.attempts)
} else if (status === 'escalated') {
await sendTelegramAlert(
`ORACLE PRIME escalated after ${metadata.attempts} attempts. Check /admin/oracle for details.`
)
}
}
return NextResponse.json({ received: true })
}
The timing-safe comparison on the signature is load-bearing. Webhook endpoints that handle authenticated events without constant-time comparison are vulnerable to timing attacks — a class of vulnerability that shows up in every security audit. The crypto.timingSafeEqual call costs zero performance and closes the vulnerability entirely.
Cost Math: $2.36 Per Full Scan
Here is the token breakdown for a typical successful scan (first attempt pass, which happens about 65% of the time):
| API Call | Model | Input Tokens | Output Tokens | Cost |
| Writer session (with tools) | Sonnet 4.6 | ~22,000 | ~2,400 | $1.10 |
| Grader session | Haiku 4.5 | ~3,500 | ~600 | $0.07 |
| Outcome write | N/A | — | — | $0.001 |
| Tool calls (web search × 8) | N/A | — | — | $0.80 |
First-attempt pass: ~$1.97. For scans that require one revision cycle (about 30% of runs), add another $0.85 for the second writer pass and second grader session. Weighted average across all run types: $2.36.
The web search tool calls dominate the non-model cost. Each Serper API call costs $0.10, and a full scan runs 8 searches. This is the single biggest optimization opportunity — caching source content across the week and only re-fetching changed sources would cut tool costs by ~60%.
Compare this to the old local-skill approach: no grading, no Outcomes, no retry loop, but also no quality floor. A bad scan cost the same as a good scan — approximately $1.80 in model tokens. The self-grading loop adds $0.56 on average and eliminates the 15-20% of runs that previously produced unusable output that I would catch only on manual review hours later.
Why Outcomes Is the Real Story
The self-grading pattern is the mechanism. Outcomes is the architecture. Here is why that distinction matters.
Without Outcomes, you have a session that ran, produced an artifact, and exited. You know it finished. You do not know if it succeeded by any meaningful measure. You cannot filter runs by quality. You cannot build a feedback dataset. You cannot track improvement over time. You cannot set automated alerts based on quality thresholds.
With Outcomes, every session has a structured result that includes: a status (success, failure, escalated), a numeric score, the final artifact, arbitrary metadata, and a timestamp. You can query the Outcomes API to find all sessions that scored below 30/40 in the last month. You can plot quality over time. You can train on the high-scoring artifacts. You can trigger different downstream actions based on the outcome status rather than just "did the session complete."
For production agents that run unsupervised, Outcomes is the difference between a system you can trust and a system you have to babysit. The self-grading pattern is what produces meaningful outcomes. But the Outcomes API is what makes those outcomes actionable beyond the immediate session.
The full architecture — ORACLE PRIME CLAUDE.md template, the TypeScript trigger script, the grader system prompt, the webhook handler, and the rubric JSON schema — is available in the WOWHOW product catalog. The AI API Cost Calculator can model your own self-grading loop costs before you build.
Sources
- Claude Managed Agents Documentation — Anthropic (2026)
- Managed Agents API Reference — Anthropic (2026)
- Claude Model Overview and Pricing — Anthropic (2026)
Comments · 0
No comments yet. Be the first to share your thoughts.