PRD-103 Outline: Verification & Quality

Type: Research + Design Status: Outline (Loop 0) Depends On: PRD-100 (Research Master), PRD-101 (Mission Schema — success_criteria JSONB), PRD-102 (Coordinator Architecture — verify step in lifecycle) Blocks: PRD-106 (Outcome Telemetry — verifier_score feeds telemetry)


Section 1: Problem Statement

Why This PRD Exists

Automatos has no automated output verification. Today:

What Exists
Limitation

agent_reports with 1-5 star grading (grade SMALLINT)

Human-only, manual, after-the-fact

recipe_quality_service.py — 5-dimension scoring (completeness, accuracy, efficiency, reliability, cost) → 0-1.0 → A-F grade

Recipe-scoped only, not wired to mission tasks

quality_assessor.py (Stage 7) — 5-dimension weighted scoring (completeness, coherence, accuracy, professionalism, clarity)

Per-execution, not against success criteria

FutureAGI live traffic scoring — completeness, is_helpful, is_concise via agent-opt-worker

Prompt-quality eval, not task-outcome verification

RAG quality scorer — avg_similarity, source_diversity, coverage, freshness

RAG-specific, not general-purpose

heartbeat_results JSONB — findings[], actions_taken[], tokens_used, cost

Captures what happened, not whether it was good

BoardTask.result (TEXT) + error_message

Free-text, no structured quality signal

The Verification Gap

Missions (PRD-100 Section 3) require a verify step between task execution and human review. The coordinator (PRD-102) needs a signal: "Did this agent's output actually satisfy the success criteria?" Without this:

  1. Human bottleneck — every task output requires manual review with no pre-screening

  2. No quality gate — bad outputs flow to dependent tasks, cascading failures

  3. No learning signal — PRD-106 telemetry has no verifier_score to learn from

  4. Coordinator is blind — cannot decide whether to retry, continue, or escalate

What This PRD Delivers

A VerificationService that:

  1. Takes a task output + its success_criteria JSONB (from PRD-101's mission_tasks)

  2. Evaluates output against each criterion using LLM-as-judge

  3. Produces a structured score (per-criterion + aggregate)

  4. Returns a pass/fail/partial verdict with confidence

  5. Feeds results to the coordinator for decision-making and to telemetry for learning


Section 2: Prior Art Research Targets

Systems to Study (each gets dedicated research)

System/Pattern
Source
Focus Areas
Key Question

LLM-as-Judge (MT-Bench/Arena)

Zheng et al. 2023 (arxiv:2306.05685); Arena-Hard (LMSYS 2024)

Pairwise vs absolute scoring, position bias mitigation, 80%+ human agreement on objective tasks, 50-70% on subjective

Should we use rubric-based absolute scoring (scales to single outputs) or pairwise comparison?

Constitutional AI Critique

Anthropic 2022 (arxiv:2212.08073)

Principle-based self-critique, revision cycles, constitutional principles as evaluation rubric

Should success criteria be expressed as constitutional principles that the verifier checks?

OpenAI Evals Framework

openai/evals GitHub

Modular evaluators, deterministic + model-graded composition, different model for grading vs generation

Should we compose deterministic checks (format, length) with LLM checks (quality, accuracy)?

DeepEval

confident-ai/deepeval GitHub

G-Eval (simple custom metrics), DAG (decision-tree evaluation), 14+ pre-built metrics, Pytest-like test framework

Should we adopt DeepEval's DAG pattern for structured evaluation pipelines?

RAGAS

explodinggradients/ragas GitHub

Faithfulness (95% human agreement), answer relevancy, context precision/recall, rubric-based criteria scoring

Should we use RAGAS metrics for RAG-heavy mission tasks?

FutureAGI (existing)

orchestrator/core/services/futureagi_service.py

Already integrated: assess, optimize, safety, live scoring via agent-opt-worker. Extensible via new worker endpoints

How do we extend FutureAGI's existing /score pattern for mission task verification?

Existing Quality Scoring

recipe_quality_service.py, quality_assessor.py, report_service.py

3 scoring systems already built with overlapping dimensions. Report grading (1-5 stars) is human-only

Should verification unify these scoring approaches or add a fourth?

Key Patterns Discovered in Research

Rubric-based absolute scoring is the right default (Zheng et al. 2023, Raschka 2024): Pairwise comparison requires O(n²) comparisons and doesn't work for evaluating single outputs. Rubric-based scoring (1-5 Likert per dimension) is more stable (~9% score flip rate vs 35% for pairwise under manipulation) and scales to evaluating individual task outputs. MT-Bench achieved 80%+ agreement with human evaluators on objective tasks using this approach.

Self-preference bias requires cross-model verification (arxiv:2410.21819, 2024): LLMs systematically rate their own outputs higher due to lower perplexity of self-generated text. Mission verification MUST use a different model than the one that executed the task. Using a smaller/cheaper model as judge reduces correlation with the generator AND saves cost.

Deterministic-first, LLM-second (OpenAI Evals pattern): Many success criteria can be checked deterministically: word count, format compliance, required sections present, URL validity, JSON schema conformance. LLM judge should only evaluate what deterministic checks cannot: quality, accuracy, completeness of reasoning. This dramatically reduces verification cost.

DeepEval's DAG is the right evaluation architecture: Decision-tree evaluation with conditional branching maps directly to success criteria checking: check format → check completeness → check accuracy → check quality. Each node can be deterministic or LLM-based. Failed early checks skip expensive later checks.

Verification cost budget: 10-30% of generation cost (industry benchmarks): Single LLM judge run adds ~10-15% of generation cost. Robust evaluation (position-swapped double-check) adds ~20-30%. Ensemble judging (3+ models) adds 50-100%. For missions, single-judge with escalation is the right tradeoff.


Section 3: Verification Taxonomy

Types of Verification (by what they check)

Type
What It Checks
Method
Example

Format Compliance

Output matches required structure

Deterministic (regex, JSON schema, section headers)

"Report must have Executive Summary, Findings, Recommendations sections"

Completeness

All required elements present

Hybrid (deterministic count + LLM assessment)

"Must cover all 5 EU AI Act risk categories"

Factual Accuracy

Claims are grounded in sources

LLM-as-judge with source verification

"All cited regulations must exist and be correctly described"

Success Criteria Match

Output satisfies each stated criterion

LLM-as-judge against criteria rubric

"Must identify at least 3 compliance gaps with severity ratings"

Cross-Agent Consistency

Multiple agents' outputs don't contradict

LLM comparison of related task outputs

"Research findings and compliance report must agree on risk levels"

Quality Threshold

Output meets minimum quality bar

LLM scoring on dimensions (clarity, depth, professionalism)

"Writing quality score ≥ 3.5/5"

Verification Granularity

Level
Scope
When
Cost

Per-criterion

Individual success criterion

After task execution

1 LLM call per criterion

Per-task

Aggregate across all criteria for one task

After per-criterion checks

0 additional (aggregation)

Cross-task

Consistency between related task outputs

After dependent tasks complete

1 LLM call per pair

Per-mission

Overall mission quality assessment

Before human review

1 LLM call (summary)


Section 4: Key Design Questions

Q1: Same Model or Different Model for Verification?

Answer (from research): ALWAYS different model.

Self-preference bias is empirically demonstrated (arxiv:2410.21819). Options:

  • Cheaper model judges expensive model — GPT-4o-mini verifies Claude output. Saves cost, reduces correlation.

  • Same-tier different provider — Claude verifies GPT-4o output (or vice versa). Maximizes independence.

  • Configurable per mission — user picks verifier model, defaults to cheaper cross-provider.

Design question for PRD: Should mission_runs have a verifier_model field, or should this be workspace-level config?

Q2: Scoring Rubric Design

Options:

  • Fixed rubric — same 5 dimensions for every task (completeness, accuracy, relevance, clarity, format)

  • Dynamic rubric — generated from success_criteria JSONB per task

  • Hybrid — fixed quality dimensions + per-task criteria dimensions

Recommendation: Hybrid. Fixed quality dimensions provide baseline comparability across missions. Per-task criteria dimensions ensure specific requirements are checked.

Q3: Pass/Fail Threshold

Options:

  • Hard threshold — score ≥ 0.7 passes (configurable per mission)

  • Per-criterion threshold — each criterion must individually pass

  • Weighted aggregate — some criteria are must-pass (blockers), others are nice-to-have

  • Confidence-gated — low confidence scores → human review, high confidence → auto-pass/fail

Design question for PRD: What is the default threshold? Should it be per-workspace, per-mission, or per-task?

Q4: Human Override Flow

When verification fails:

  1. Auto-retry — coordinator retries with feedback from verifier (continuation, not retry — Symphony pattern)

  2. Escalate to human — flag for review with verifier's reasoning

  3. Accept with warning — proceed but mark as low-confidence

Design question for PRD: How many auto-retries before escalation? Should the coordinator modify the prompt based on verifier feedback?

Q5: Verification Cost Control

Verification adds 10-30% to task cost. For a 5-task mission at $2, that's $0.20-0.60.

Options:

  • Always verify — every task gets verified

  • Smart verification — only verify tasks above cost threshold or with downstream dependencies

  • Sample verification — verify a random subset for learning, verify all for high-stakes missions

  • Tiered — deterministic checks always, LLM verification only for tasks with subjective criteria

Design question for PRD: Should verification be opt-out (default on) or opt-in (default off)?

Q6: Batch vs Inline Verification

  • Inline — verify immediately after each task completes. Blocks next task if dependency. Enables fast retry.

  • Batch — verify all tasks after mission completes. Cheaper (can batch LLM calls). Delays feedback.

  • Hybrid — verify critical-path tasks inline, verify leaf tasks in batch.

Recommendation: Inline for tasks with dependents, batch for terminal tasks.


Section 5: FutureAGI Integration

What Already Exists

FutureAGI is integrated via futureagi_service.pyagent-opt-worker HTTP proxy:

Capability
Endpoint
Status
Reusable for Verification?

Prompt assessment

POST /assess

✅ Production

Partially — assesses prompt quality, not task output quality

Live traffic scoring

POST /score

✅ Production

Yes — same pattern: input + output + metrics → scores

Prompt optimization

POST /optimize

✅ Production

No — optimizes prompts, not evaluates outputs

Safety check

POST /safety

✅ Production

Yes — safety is one verification dimension

Extension Strategy

New worker endpoint: POST /verify-task

Key design: the verifier prompt

The verifier prompt is the most critical design artifact. It must:

  1. Present success criteria as a rubric (not a checklist — rubrics produce better LLM judgments)

  2. Instruct the verifier to reason before scoring (chain-of-thought improves accuracy)

  3. Explicitly instruct against position bias and verbosity bias

  4. Request confidence alongside scores

  5. Use a structured output format (JSON) for reliable parsing

Data Flow


Section 6: Acceptance Criteria for Full PRD

Must Have

Should Have

Nice to Have


Section 7: Risks & Dependencies

Risks

#
Risk
Impact
Likelihood
Mitigation

1

LLM-as-judge unreliability on subjective tasks

Medium

High

Use rubric-based scoring (9% flip rate vs 35% pairwise). Default to human review for low-confidence scores. Track human override rate to measure reliability.

2

Verification cost exceeding task cost

Medium

Medium

Deterministic checks first (zero LLM cost). Skip LLM verification for cheap tasks. Single-judge default, ensemble only for high-stakes. Budget: verification ≤ 30% of task cost.

3

Self-preference bias corrupting scores

High

High

ENFORCE cross-model verification. Never use same model for execution and verification. Default: cheaper model from different provider.

4

False positives (bad output passes)

High

Medium

Multiple criteria checked independently. Any "must-pass" criterion failing = task fails regardless of aggregate. Human review for missions above cost threshold.

5

False negatives (good output fails)

Medium

Medium

Confidence scores + human escalation path. Track false negative rate via human override data. Adjust thresholds based on telemetry (PRD-106).

6

Verifier prompt engineering is hard

High

High

Start with simple rubric template. Iterate based on human override data. Constitutional AI principles as fallback. FutureAGI's existing eval expertise.

7

Three existing quality scoring systems create confusion

Medium

Medium

Verification is the new canonical quality signal for missions. Do NOT unify existing scoring systems — they serve different scopes: recipe_quality_service = recipe-scoped rolling average (stays as-is), quality_assessor = per-execution pipeline stage (stays as-is), report grading = human 1-5 stars on agent reports (stays as-is). Verification (PRD-103) is new and separate — it evaluates mission task outputs against explicit success_criteria. Attempting to merge these would break existing consumers and conflate different evaluation contexts.

Dependencies

Dependency
From
What's Needed

success_criteria JSONB schema

PRD-101

Verification input — must be structured enough for rubric generation

Coordinator verify step

PRD-102

Where in the lifecycle verification is called, what the coordinator does with results

verifier_score telemetry field

PRD-106

Verification scores must be stored in a format telemetry can aggregate

agent-opt-worker extensibility

Existing infra

New /verify-task endpoint on the worker service

Cross-model API access

Existing infra (OpenRouter)

Must be able to call a different model for verification than was used for execution

Cross-PRD Connections

  • PRD-104 (Ephemeral Agents): Contractor agents have no memory/personality — verification is even more important as quality signal since there's no agent reputation to rely on

  • PRD-105 (Budget): Verification cost must be included in mission budget estimation. Budget enforcement must account for verification retries.

  • PRD-107 (Context Interface): Verifier may need access to shared context to assess cross-agent consistency. Context interface must support read-only verification access.

Last updated