PRD-103 — Verification & Quality
Version: 1.0 Type: Research + Design Status: Complete — Ready for Peer Review Priority: P0 Dependencies: PRD-100 (Research Master), PRD-101 (Mission Schema — success_criteria JSONB), PRD-102 (Coordinator Architecture — verify step in lifecycle) Blocks: PRD-106 (Outcome Telemetry — verifier_score feeds telemetry) Author: Gerard Kavanagh + Claude Date: 2026-03-15
1. Problem Statement
1.1 The Gap
Automatos has no automated output verification. The platform can execute tasks, but it cannot answer: "Did this agent's output actually satisfy the success criteria?" Without verification, the coordinator (PRD-102) is blind — it cannot decide whether to retry, continue, or escalate.
1.2 Existing Quality Systems
recipe_quality_service.py — 5-dimension scoring (completeness, accuracy, efficiency, reliability, cost) → A-F grade
orchestrator/core/services/
Recipe executions only
Not wired to mission tasks, no success criteria matching
quality_assessor.py (Stage 7) — 5-dimension weighted scoring (completeness, coherence, accuracy, professionalism, clarity)
orchestrator/modules/orchestrator/stages/
Per-execution quality
Assesses general quality, not against specific success criteria
FutureAGI live traffic scoring — completeness, is_helpful, is_concise via agent-opt-worker
orchestrator/core/services/futureagi_service.py
Prompt quality eval
Evaluates prompts, not task outcomes
RAG quality scorer — avg_similarity, source_diversity, coverage, freshness
orchestrator/modules/search/
RAG-specific metrics
Only for retrieval-augmented tasks
agent_reports with 1-5 star grading (grade SMALLINT)
orchestrator/core/models/core.py
Human manual grading
After-the-fact, not automated, not mission-integrated
BoardTask.result (TEXT) + error_message
orchestrator/core/models/board.py
Free-text result
No structured quality signal
Decision: Verification is a new, separate system. It does NOT unify existing scoring. Existing systems serve different scopes: recipe_quality_service = recipe rolling averages (stays), quality_assessor = pipeline stage quality (stays), report grading = human stars on reports (stays). Verification evaluates mission task outputs against explicit success_criteria. Merging these would break existing consumers and conflate different evaluation contexts.
1.3 What This PRD Delivers
A VerificationService that:
Takes a task output + its
success_criteriaJSONB (from PRD-101'sorchestration_tasks)Runs deterministic checks first (format, length, required sections) — zero LLM cost
Evaluates remaining criteria using LLM-as-judge with a different model than the executor
Produces a structured score (per-criterion + aggregate + confidence)
Returns a pass/fail/partial verdict to the coordinator for decision-making
Feeds results to telemetry (PRD-106) for learning
2. Prior Art: Verification Patterns
2.1 Overview
Six systems and patterns were studied to inform verification design. The core challenge: how do you reliably assess whether an LLM's output satisfies requirements, without hallucinating quality?
2.2 System-by-System Analysis
LLM-as-Judge (Zheng et al. 2023, MT-Bench/Arena; Raschka 2024)
MT-Bench achieved 80%+ agreement with human evaluators on objective tasks using rubric-based absolute scoring. The key findings:
Rubric-based absolute scoring (1-5 Likert per dimension) is more stable than pairwise comparison: ~9% score flip rate vs 35% for pairwise under prompt manipulation (Raschka 2024).
Position bias exists: LLMs prefer the first option in pairwise comparisons. Absolute scoring eliminates this.
Verbosity bias: LLMs rate longer outputs higher. Counter with explicit rubric instructions.
Self-preference bias (arxiv:2410.21819, 2024): LLMs rate their own outputs higher due to lower perplexity of self-generated text.
What we adopt: Rubric-based absolute scoring (not pairwise). Each success criterion becomes a rubric item with a 1-5 scale. This scales to single outputs (no reference output needed) and produces stable scores.
What we reject: Pairwise comparison (requires O(n^2) comparisons, doesn't work for single-output evaluation).
Constitutional AI Critique (Anthropic 2022, arxiv:2212.08073)
Constitutional AI evaluates outputs against a set of principles (the "constitution"). The critic identifies violations and suggests revisions. The key insight: principles as evaluation rubric — each success criterion can be framed as a constitutional principle the output must satisfy.
What we adopt: The principle-based evaluation framing. Success criteria are expressed as principles: "The output MUST cover all 6 EU AI Act risk categories" becomes a constitutional check.
What we reject: The revision cycle (Constitutional AI revises the output; we only evaluate, the agent retries if needed).
OpenAI Evals Framework (openai/evals)
OpenAI Evals composes evaluators: deterministic checks (exact match, regex, JSON schema) + model-graded checks. The key pattern: deterministic-first, LLM-second. Many criteria can be checked without LLM: word count, format compliance, required sections present, URL validity.
What we adopt: Deterministic-first pipeline. Check format, length, schema, required sections BEFORE calling the LLM judge. This dramatically reduces verification cost (deterministic checks are free) and catches obvious failures immediately.
What we reject: The full Evals framework infrastructure (we integrate with FutureAGI instead).
DeepEval (confident-ai/deepeval)
DeepEval's DAG evaluation pattern uses decision-tree evaluation with conditional branching: check format → check completeness → check accuracy → check quality. Each node can be deterministic or LLM-based. Failed early checks skip expensive later checks.
What we adopt: DAG evaluation pipeline. If format check fails, skip LLM quality assessment (no point evaluating quality of a malformed output). This is the "short-circuit" pattern.
What we reject: DeepEval's Pytest-like test framework (over-engineered for our in-process verification).
RAGAS (explodinggradients/ragas)
RAGAS provides specialized metrics for RAG tasks: faithfulness (95% human agreement), answer relevancy, context precision/recall. The key insight: task-type-specific verification dimensions.
What we adopt: For RAG-heavy mission tasks (task_type = "research"), add faithfulness and source grounding as verification dimensions alongside generic quality dimensions.
What we reject: Using RAGAS as the verification framework (too specialized; we need general-purpose verification).
FutureAGI (Existing — futureagi_service.py)
FutureAGI is already integrated via the agent-opt-worker HTTP proxy:
Prompt assessment
POST /assess
Production
Partially — prompt quality, not task output
Live traffic scoring
POST /score
Production
Yes — same pattern: input + output + metrics → scores
Prompt optimization
POST /optimize
Production
No — optimization, not evaluation
Safety check
POST /safety
Production
Yes — safety is one verification dimension
What we adopt: Extend the agent-opt-worker with a new POST /verify-task endpoint. Same infrastructure, same HTTP proxy pattern, new verification logic.
2.3 Architectural Decisions Summary
Scoring method
Rubric-based absolute scoring (1-5 Likert)
Zheng et al. 2023, Raschka 2024
80%+ human agreement; 9% flip rate (vs 35% pairwise); scales to single outputs
Judge model
Always different model from executor
arxiv:2410.21819
Self-preference bias empirically demonstrated; cross-model eliminates correlation
Evaluation pipeline
Deterministic → LLM (DAG with short-circuit)
OpenAI Evals, DeepEval
Free deterministic checks first; skip expensive LLM if format fails
Criteria framing
Constitutional principles from success_criteria
Anthropic 2022
Structured, evaluable, maps directly to task definition
Infrastructure
FutureAGI worker extension (POST /verify-task)
Existing integration
Zero new infrastructure; same proxy pattern
Cost target
Verification ≤ 15% of task generation cost
Industry benchmarks
Single judge (not ensemble) with deterministic pre-filtering
3. VerificationService Interface
3.1 Core Interface
3.2 Verification Pipeline
4. Deterministic Check Registry
4.1 Check Types
4.2 Example: Required Sections Check
5. LLM Verification Protocol
5.1 Model Selection
Rule: verifier model MUST differ from executor model.
Self-preference bias (arxiv:2410.21819) is empirically demonstrated — LLMs systematically rate their own outputs higher due to lower perplexity of self-generated text. Cross-model verification eliminates this correlation.
Cost rationale: GPT-4o-mini and Claude Haiku are ~10-20x cheaper than their full counterparts. Using them as judges keeps verification cost at ~10-15% of task generation cost. MT-Bench showed cheaper models are adequate judges for rubric-based evaluation.
5.2 Verifier Prompt Template
"""
5.4 Verdict Computation
6. FutureAGI Worker Extension
6.1 New Endpoint: POST /verify-task
6.2 Worker Implementation
The agent-opt-worker service gets a new route that follows the existing pattern:
7. Coordinator Integration
7.1 How Verification Drives Coordinator Decisions
7.2 Retry-with-Feedback Protocol
When verification fails but retries remain, the verifier's reasoning is injected into the agent's next prompt:
8. Verification Timing Strategy
8.1 Inline vs Batch
Task with dependents (critical path)
Inline — verify immediately
Blocks next task; fast feedback enables quick retry
Terminal task (no dependents)
Inline — verify immediately
Still needed for mission completion assessment
Cross-task consistency
Batch — after all related tasks complete
Requires multiple outputs to compare
Decision: All per-task verification is inline. The 2-3 second latency of an LLM verification call is negligible compared to the minutes a task takes to execute. Batch is only for cross-task consistency (optional, post-completion).
8.2 Verification Cost Model
Deterministic checks
$0.00
Always (before LLM)
Single LLM judge call
~$0.003-0.01 per task
When deterministic checks pass
Cross-task consistency
~$0.005 per pair
Optional, after related tasks complete
For a typical 4-task mission ($2-4 total):
Verification cost: ~$0.012-0.04 (4 judge calls)
As % of mission cost: ~1-2%
With retries (assume 1 retry): ~2-4%
This is well within the 10-30% industry benchmark, primarily because deterministic checks filter out failures that would have required expensive LLM evaluation.
8.3 Verification Bypass Rules
Task cost < $0.05
Skip LLM verification, deterministic only
Cost of verification would exceed task cost
task_type = "simple"
Deterministic only
Simple tasks (formatting, routing) don't need LLM quality assessment
All criteria are deterministic
Skip LLM entirely
No subjective criteria to evaluate
Mission config skip_verification = true
Skip entirely
User explicitly opts out (autonomy mode with high trust)
9. Configurable Thresholds
9.1 Threshold Hierarchy
9.2 Default Configuration
10. Cross-Task Consistency Checking
10.1 When to Check
Cross-task consistency is checked when two or more tasks share a topic or produce related outputs. The coordinator identifies related task pairs based on dependency edges and task descriptions.
10.2 Consistency Verifier Prompt
"""
Last updated

