PRD-103 Outline: Verification & Quality
Type: Research + Design Status: Outline (Loop 0) Depends On: PRD-100 (Research Master), PRD-101 (Mission Schema — success_criteria JSONB), PRD-102 (Coordinator Architecture — verify step in lifecycle) Blocks: PRD-106 (Outcome Telemetry — verifier_score feeds telemetry)
Section 1: Problem Statement
Why This PRD Exists
Automatos has no automated output verification. Today:
agent_reports with 1-5 star grading (grade SMALLINT)
Human-only, manual, after-the-fact
recipe_quality_service.py — 5-dimension scoring (completeness, accuracy, efficiency, reliability, cost) → 0-1.0 → A-F grade
Recipe-scoped only, not wired to mission tasks
quality_assessor.py (Stage 7) — 5-dimension weighted scoring (completeness, coherence, accuracy, professionalism, clarity)
Per-execution, not against success criteria
FutureAGI live traffic scoring — completeness, is_helpful, is_concise via agent-opt-worker
Prompt-quality eval, not task-outcome verification
RAG quality scorer — avg_similarity, source_diversity, coverage, freshness
RAG-specific, not general-purpose
heartbeat_results JSONB — findings[], actions_taken[], tokens_used, cost
Captures what happened, not whether it was good
BoardTask.result (TEXT) + error_message
Free-text, no structured quality signal
The Verification Gap
Missions (PRD-100 Section 3) require a verify step between task execution and human review. The coordinator (PRD-102) needs a signal: "Did this agent's output actually satisfy the success criteria?" Without this:
Human bottleneck — every task output requires manual review with no pre-screening
No quality gate — bad outputs flow to dependent tasks, cascading failures
No learning signal — PRD-106 telemetry has no
verifier_scoreto learn fromCoordinator is blind — cannot decide whether to retry, continue, or escalate
What This PRD Delivers
A VerificationService that:
Takes a task output + its
success_criteriaJSONB (from PRD-101'smission_tasks)Evaluates output against each criterion using LLM-as-judge
Produces a structured score (per-criterion + aggregate)
Returns a pass/fail/partial verdict with confidence
Feeds results to the coordinator for decision-making and to telemetry for learning
Section 2: Prior Art Research Targets
Systems to Study (each gets dedicated research)
LLM-as-Judge (MT-Bench/Arena)
Zheng et al. 2023 (arxiv:2306.05685); Arena-Hard (LMSYS 2024)
Pairwise vs absolute scoring, position bias mitigation, 80%+ human agreement on objective tasks, 50-70% on subjective
Should we use rubric-based absolute scoring (scales to single outputs) or pairwise comparison?
Constitutional AI Critique
Anthropic 2022 (arxiv:2212.08073)
Principle-based self-critique, revision cycles, constitutional principles as evaluation rubric
Should success criteria be expressed as constitutional principles that the verifier checks?
OpenAI Evals Framework
openai/evals GitHub
Modular evaluators, deterministic + model-graded composition, different model for grading vs generation
Should we compose deterministic checks (format, length) with LLM checks (quality, accuracy)?
DeepEval
confident-ai/deepeval GitHub
G-Eval (simple custom metrics), DAG (decision-tree evaluation), 14+ pre-built metrics, Pytest-like test framework
Should we adopt DeepEval's DAG pattern for structured evaluation pipelines?
RAGAS
explodinggradients/ragas GitHub
Faithfulness (95% human agreement), answer relevancy, context precision/recall, rubric-based criteria scoring
Should we use RAGAS metrics for RAG-heavy mission tasks?
FutureAGI (existing)
orchestrator/core/services/futureagi_service.py
Already integrated: assess, optimize, safety, live scoring via agent-opt-worker. Extensible via new worker endpoints
How do we extend FutureAGI's existing /score pattern for mission task verification?
Existing Quality Scoring
recipe_quality_service.py, quality_assessor.py, report_service.py
3 scoring systems already built with overlapping dimensions. Report grading (1-5 stars) is human-only
Should verification unify these scoring approaches or add a fourth?
Key Patterns Discovered in Research
Rubric-based absolute scoring is the right default (Zheng et al. 2023, Raschka 2024): Pairwise comparison requires O(n²) comparisons and doesn't work for evaluating single outputs. Rubric-based scoring (1-5 Likert per dimension) is more stable (~9% score flip rate vs 35% for pairwise under manipulation) and scales to evaluating individual task outputs. MT-Bench achieved 80%+ agreement with human evaluators on objective tasks using this approach.
Self-preference bias requires cross-model verification (arxiv:2410.21819, 2024): LLMs systematically rate their own outputs higher due to lower perplexity of self-generated text. Mission verification MUST use a different model than the one that executed the task. Using a smaller/cheaper model as judge reduces correlation with the generator AND saves cost.
Deterministic-first, LLM-second (OpenAI Evals pattern): Many success criteria can be checked deterministically: word count, format compliance, required sections present, URL validity, JSON schema conformance. LLM judge should only evaluate what deterministic checks cannot: quality, accuracy, completeness of reasoning. This dramatically reduces verification cost.
DeepEval's DAG is the right evaluation architecture: Decision-tree evaluation with conditional branching maps directly to success criteria checking: check format → check completeness → check accuracy → check quality. Each node can be deterministic or LLM-based. Failed early checks skip expensive later checks.
Verification cost budget: 10-30% of generation cost (industry benchmarks): Single LLM judge run adds ~10-15% of generation cost. Robust evaluation (position-swapped double-check) adds ~20-30%. Ensemble judging (3+ models) adds 50-100%. For missions, single-judge with escalation is the right tradeoff.
Section 3: Verification Taxonomy
Types of Verification (by what they check)
Format Compliance
Output matches required structure
Deterministic (regex, JSON schema, section headers)
"Report must have Executive Summary, Findings, Recommendations sections"
Completeness
All required elements present
Hybrid (deterministic count + LLM assessment)
"Must cover all 5 EU AI Act risk categories"
Factual Accuracy
Claims are grounded in sources
LLM-as-judge with source verification
"All cited regulations must exist and be correctly described"
Success Criteria Match
Output satisfies each stated criterion
LLM-as-judge against criteria rubric
"Must identify at least 3 compliance gaps with severity ratings"
Cross-Agent Consistency
Multiple agents' outputs don't contradict
LLM comparison of related task outputs
"Research findings and compliance report must agree on risk levels"
Quality Threshold
Output meets minimum quality bar
LLM scoring on dimensions (clarity, depth, professionalism)
"Writing quality score ≥ 3.5/5"
Verification Granularity
Per-criterion
Individual success criterion
After task execution
1 LLM call per criterion
Per-task
Aggregate across all criteria for one task
After per-criterion checks
0 additional (aggregation)
Cross-task
Consistency between related task outputs
After dependent tasks complete
1 LLM call per pair
Per-mission
Overall mission quality assessment
Before human review
1 LLM call (summary)
Section 4: Key Design Questions
Q1: Same Model or Different Model for Verification?
Answer (from research): ALWAYS different model.
Self-preference bias is empirically demonstrated (arxiv:2410.21819). Options:
Cheaper model judges expensive model — GPT-4o-mini verifies Claude output. Saves cost, reduces correlation.
Same-tier different provider — Claude verifies GPT-4o output (or vice versa). Maximizes independence.
Configurable per mission — user picks verifier model, defaults to cheaper cross-provider.
Design question for PRD: Should mission_runs have a verifier_model field, or should this be workspace-level config?
Q2: Scoring Rubric Design
Options:
Fixed rubric — same 5 dimensions for every task (completeness, accuracy, relevance, clarity, format)
Dynamic rubric — generated from
success_criteriaJSONB per taskHybrid — fixed quality dimensions + per-task criteria dimensions
Recommendation: Hybrid. Fixed quality dimensions provide baseline comparability across missions. Per-task criteria dimensions ensure specific requirements are checked.
Q3: Pass/Fail Threshold
Options:
Hard threshold — score ≥ 0.7 passes (configurable per mission)
Per-criterion threshold — each criterion must individually pass
Weighted aggregate — some criteria are must-pass (blockers), others are nice-to-have
Confidence-gated — low confidence scores → human review, high confidence → auto-pass/fail
Design question for PRD: What is the default threshold? Should it be per-workspace, per-mission, or per-task?
Q4: Human Override Flow
When verification fails:
Auto-retry — coordinator retries with feedback from verifier (continuation, not retry — Symphony pattern)
Escalate to human — flag for review with verifier's reasoning
Accept with warning — proceed but mark as low-confidence
Design question for PRD: How many auto-retries before escalation? Should the coordinator modify the prompt based on verifier feedback?
Q5: Verification Cost Control
Verification adds 10-30% to task cost. For a 5-task mission at $2, that's $0.20-0.60.
Options:
Always verify — every task gets verified
Smart verification — only verify tasks above cost threshold or with downstream dependencies
Sample verification — verify a random subset for learning, verify all for high-stakes missions
Tiered — deterministic checks always, LLM verification only for tasks with subjective criteria
Design question for PRD: Should verification be opt-out (default on) or opt-in (default off)?
Q6: Batch vs Inline Verification
Inline — verify immediately after each task completes. Blocks next task if dependency. Enables fast retry.
Batch — verify all tasks after mission completes. Cheaper (can batch LLM calls). Delays feedback.
Hybrid — verify critical-path tasks inline, verify leaf tasks in batch.
Recommendation: Inline for tasks with dependents, batch for terminal tasks.
Section 5: FutureAGI Integration
What Already Exists
FutureAGI is integrated via futureagi_service.py → agent-opt-worker HTTP proxy:
Prompt assessment
POST /assess
✅ Production
Partially — assesses prompt quality, not task output quality
Live traffic scoring
POST /score
✅ Production
Yes — same pattern: input + output + metrics → scores
Prompt optimization
POST /optimize
✅ Production
No — optimizes prompts, not evaluates outputs
Safety check
POST /safety
✅ Production
Yes — safety is one verification dimension
Extension Strategy
New worker endpoint: POST /verify-task
Key design: the verifier prompt
The verifier prompt is the most critical design artifact. It must:
Present success criteria as a rubric (not a checklist — rubrics produce better LLM judgments)
Instruct the verifier to reason before scoring (chain-of-thought improves accuracy)
Explicitly instruct against position bias and verbosity bias
Request confidence alongside scores
Use a structured output format (JSON) for reliable parsing
Data Flow
Section 6: Acceptance Criteria for Full PRD
Must Have
Should Have
Nice to Have
Section 7: Risks & Dependencies
Risks
1
LLM-as-judge unreliability on subjective tasks
Medium
High
Use rubric-based scoring (9% flip rate vs 35% pairwise). Default to human review for low-confidence scores. Track human override rate to measure reliability.
2
Verification cost exceeding task cost
Medium
Medium
Deterministic checks first (zero LLM cost). Skip LLM verification for cheap tasks. Single-judge default, ensemble only for high-stakes. Budget: verification ≤ 30% of task cost.
3
Self-preference bias corrupting scores
High
High
ENFORCE cross-model verification. Never use same model for execution and verification. Default: cheaper model from different provider.
4
False positives (bad output passes)
High
Medium
Multiple criteria checked independently. Any "must-pass" criterion failing = task fails regardless of aggregate. Human review for missions above cost threshold.
5
False negatives (good output fails)
Medium
Medium
Confidence scores + human escalation path. Track false negative rate via human override data. Adjust thresholds based on telemetry (PRD-106).
6
Verifier prompt engineering is hard
High
High
Start with simple rubric template. Iterate based on human override data. Constitutional AI principles as fallback. FutureAGI's existing eval expertise.
7
Three existing quality scoring systems create confusion
Medium
Medium
Verification is the new canonical quality signal for missions. Do NOT unify existing scoring systems — they serve different scopes: recipe_quality_service = recipe-scoped rolling average (stays as-is), quality_assessor = per-execution pipeline stage (stays as-is), report grading = human 1-5 stars on agent reports (stays as-is). Verification (PRD-103) is new and separate — it evaluates mission task outputs against explicit success_criteria. Attempting to merge these would break existing consumers and conflate different evaluation contexts.
Dependencies
success_criteria JSONB schema
PRD-101
Verification input — must be structured enough for rubric generation
Coordinator verify step
PRD-102
Where in the lifecycle verification is called, what the coordinator does with results
verifier_score telemetry field
PRD-106
Verification scores must be stored in a format telemetry can aggregate
agent-opt-worker extensibility
Existing infra
New /verify-task endpoint on the worker service
Cross-model API access
Existing infra (OpenRouter)
Must be able to call a different model for verification than was used for execution
Cross-PRD Connections
PRD-104 (Ephemeral Agents): Contractor agents have no memory/personality — verification is even more important as quality signal since there's no agent reputation to rely on
PRD-105 (Budget): Verification cost must be included in mission budget estimation. Budget enforcement must account for verification retries.
PRD-107 (Context Interface): Verifier may need access to shared context to assess cross-agent consistency. Context interface must support read-only verification access.
Last updated

