PRD-103 — Verification & Quality

Version: 1.0 Type: Research + Design Status: Complete — Ready for Peer Review Priority: P0 Dependencies: PRD-100 (Research Master), PRD-101 (Mission Schema — success_criteria JSONB), PRD-102 (Coordinator Architecture — verify step in lifecycle) Blocks: PRD-106 (Outcome Telemetry — verifier_score feeds telemetry) Author: Gerard Kavanagh + Claude Date: 2026-03-15


1. Problem Statement

1.1 The Gap

Automatos has no automated output verification. The platform can execute tasks, but it cannot answer: "Did this agent's output actually satisfy the success criteria?" Without verification, the coordinator (PRD-102) is blind — it cannot decide whether to retry, continue, or escalate.

1.2 Existing Quality Systems

System
Location
Scope
Limitation

recipe_quality_service.py — 5-dimension scoring (completeness, accuracy, efficiency, reliability, cost) → A-F grade

orchestrator/core/services/

Recipe executions only

Not wired to mission tasks, no success criteria matching

quality_assessor.py (Stage 7) — 5-dimension weighted scoring (completeness, coherence, accuracy, professionalism, clarity)

orchestrator/modules/orchestrator/stages/

Per-execution quality

Assesses general quality, not against specific success criteria

FutureAGI live traffic scoring — completeness, is_helpful, is_concise via agent-opt-worker

orchestrator/core/services/futureagi_service.py

Prompt quality eval

Evaluates prompts, not task outcomes

RAG quality scorer — avg_similarity, source_diversity, coverage, freshness

orchestrator/modules/search/

RAG-specific metrics

Only for retrieval-augmented tasks

agent_reports with 1-5 star grading (grade SMALLINT)

orchestrator/core/models/core.py

Human manual grading

After-the-fact, not automated, not mission-integrated

BoardTask.result (TEXT) + error_message

orchestrator/core/models/board.py

Free-text result

No structured quality signal

Decision: Verification is a new, separate system. It does NOT unify existing scoring. Existing systems serve different scopes: recipe_quality_service = recipe rolling averages (stays), quality_assessor = pipeline stage quality (stays), report grading = human stars on reports (stays). Verification evaluates mission task outputs against explicit success_criteria. Merging these would break existing consumers and conflate different evaluation contexts.

1.3 What This PRD Delivers

A VerificationService that:

  1. Takes a task output + its success_criteria JSONB (from PRD-101's orchestration_tasks)

  2. Runs deterministic checks first (format, length, required sections) — zero LLM cost

  3. Evaluates remaining criteria using LLM-as-judge with a different model than the executor

  4. Produces a structured score (per-criterion + aggregate + confidence)

  5. Returns a pass/fail/partial verdict to the coordinator for decision-making

  6. Feeds results to telemetry (PRD-106) for learning


2. Prior Art: Verification Patterns

2.1 Overview

Six systems and patterns were studied to inform verification design. The core challenge: how do you reliably assess whether an LLM's output satisfies requirements, without hallucinating quality?

2.2 System-by-System Analysis

LLM-as-Judge (Zheng et al. 2023, MT-Bench/Arena; Raschka 2024)

MT-Bench achieved 80%+ agreement with human evaluators on objective tasks using rubric-based absolute scoring. The key findings:

  • Rubric-based absolute scoring (1-5 Likert per dimension) is more stable than pairwise comparison: ~9% score flip rate vs 35% for pairwise under prompt manipulation (Raschka 2024).

  • Position bias exists: LLMs prefer the first option in pairwise comparisons. Absolute scoring eliminates this.

  • Verbosity bias: LLMs rate longer outputs higher. Counter with explicit rubric instructions.

  • Self-preference bias (arxiv:2410.21819, 2024): LLMs rate their own outputs higher due to lower perplexity of self-generated text.

What we adopt: Rubric-based absolute scoring (not pairwise). Each success criterion becomes a rubric item with a 1-5 scale. This scales to single outputs (no reference output needed) and produces stable scores.

What we reject: Pairwise comparison (requires O(n^2) comparisons, doesn't work for single-output evaluation).

Constitutional AI Critique (Anthropic 2022, arxiv:2212.08073)

Constitutional AI evaluates outputs against a set of principles (the "constitution"). The critic identifies violations and suggests revisions. The key insight: principles as evaluation rubric — each success criterion can be framed as a constitutional principle the output must satisfy.

What we adopt: The principle-based evaluation framing. Success criteria are expressed as principles: "The output MUST cover all 6 EU AI Act risk categories" becomes a constitutional check.

What we reject: The revision cycle (Constitutional AI revises the output; we only evaluate, the agent retries if needed).

OpenAI Evals Framework (openai/evals)

OpenAI Evals composes evaluators: deterministic checks (exact match, regex, JSON schema) + model-graded checks. The key pattern: deterministic-first, LLM-second. Many criteria can be checked without LLM: word count, format compliance, required sections present, URL validity.

What we adopt: Deterministic-first pipeline. Check format, length, schema, required sections BEFORE calling the LLM judge. This dramatically reduces verification cost (deterministic checks are free) and catches obvious failures immediately.

What we reject: The full Evals framework infrastructure (we integrate with FutureAGI instead).

DeepEval (confident-ai/deepeval)

DeepEval's DAG evaluation pattern uses decision-tree evaluation with conditional branching: check format → check completeness → check accuracy → check quality. Each node can be deterministic or LLM-based. Failed early checks skip expensive later checks.

What we adopt: DAG evaluation pipeline. If format check fails, skip LLM quality assessment (no point evaluating quality of a malformed output). This is the "short-circuit" pattern.

What we reject: DeepEval's Pytest-like test framework (over-engineered for our in-process verification).

RAGAS (explodinggradients/ragas)

RAGAS provides specialized metrics for RAG tasks: faithfulness (95% human agreement), answer relevancy, context precision/recall. The key insight: task-type-specific verification dimensions.

What we adopt: For RAG-heavy mission tasks (task_type = "research"), add faithfulness and source grounding as verification dimensions alongside generic quality dimensions.

What we reject: Using RAGAS as the verification framework (too specialized; we need general-purpose verification).

FutureAGI (Existing — futureagi_service.py)

FutureAGI is already integrated via the agent-opt-worker HTTP proxy:

Capability
Endpoint
Status
Reusable?

Prompt assessment

POST /assess

Production

Partially — prompt quality, not task output

Live traffic scoring

POST /score

Production

Yes — same pattern: input + output + metrics → scores

Prompt optimization

POST /optimize

Production

No — optimization, not evaluation

Safety check

POST /safety

Production

Yes — safety is one verification dimension

What we adopt: Extend the agent-opt-worker with a new POST /verify-task endpoint. Same infrastructure, same HTTP proxy pattern, new verification logic.

2.3 Architectural Decisions Summary

Decision
Choice
Source
Rationale

Scoring method

Rubric-based absolute scoring (1-5 Likert)

Zheng et al. 2023, Raschka 2024

80%+ human agreement; 9% flip rate (vs 35% pairwise); scales to single outputs

Judge model

Always different model from executor

arxiv:2410.21819

Self-preference bias empirically demonstrated; cross-model eliminates correlation

Evaluation pipeline

Deterministic → LLM (DAG with short-circuit)

OpenAI Evals, DeepEval

Free deterministic checks first; skip expensive LLM if format fails

Criteria framing

Constitutional principles from success_criteria

Anthropic 2022

Structured, evaluable, maps directly to task definition

Infrastructure

FutureAGI worker extension (POST /verify-task)

Existing integration

Zero new infrastructure; same proxy pattern

Cost target

Verification ≤ 15% of task generation cost

Industry benchmarks

Single judge (not ensemble) with deterministic pre-filtering


3. VerificationService Interface

3.1 Core Interface

3.2 Verification Pipeline


4. Deterministic Check Registry

4.1 Check Types

4.2 Example: Required Sections Check


5. LLM Verification Protocol

5.1 Model Selection

Rule: verifier model MUST differ from executor model.

Self-preference bias (arxiv:2410.21819) is empirically demonstrated — LLMs systematically rate their own outputs higher due to lower perplexity of self-generated text. Cross-model verification eliminates this correlation.

Cost rationale: GPT-4o-mini and Claude Haiku are ~10-20x cheaper than their full counterparts. Using them as judges keeps verification cost at ~10-15% of task generation cost. MT-Bench showed cheaper models are adequate judges for rubric-based evaluation.

5.2 Verifier Prompt Template

"""

5.4 Verdict Computation


6. FutureAGI Worker Extension

6.1 New Endpoint: POST /verify-task

6.2 Worker Implementation

The agent-opt-worker service gets a new route that follows the existing pattern:


7. Coordinator Integration

7.1 How Verification Drives Coordinator Decisions

7.2 Retry-with-Feedback Protocol

When verification fails but retries remain, the verifier's reasoning is injected into the agent's next prompt:


8. Verification Timing Strategy

8.1 Inline vs Batch

Task Position
Strategy
Rationale

Task with dependents (critical path)

Inline — verify immediately

Blocks next task; fast feedback enables quick retry

Terminal task (no dependents)

Inline — verify immediately

Still needed for mission completion assessment

Cross-task consistency

Batch — after all related tasks complete

Requires multiple outputs to compare

Decision: All per-task verification is inline. The 2-3 second latency of an LLM verification call is negligible compared to the minutes a task takes to execute. Batch is only for cross-task consistency (optional, post-completion).

8.2 Verification Cost Model

Component
Cost
When

Deterministic checks

$0.00

Always (before LLM)

Single LLM judge call

~$0.003-0.01 per task

When deterministic checks pass

Cross-task consistency

~$0.005 per pair

Optional, after related tasks complete

For a typical 4-task mission ($2-4 total):

  • Verification cost: ~$0.012-0.04 (4 judge calls)

  • As % of mission cost: ~1-2%

  • With retries (assume 1 retry): ~2-4%

This is well within the 10-30% industry benchmark, primarily because deterministic checks filter out failures that would have required expensive LLM evaluation.

8.3 Verification Bypass Rules

Condition
Action
Rationale

Task cost < $0.05

Skip LLM verification, deterministic only

Cost of verification would exceed task cost

task_type = "simple"

Deterministic only

Simple tasks (formatting, routing) don't need LLM quality assessment

All criteria are deterministic

Skip LLM entirely

No subjective criteria to evaluate

Mission config skip_verification = true

Skip entirely

User explicitly opts out (autonomy mode with high trust)


9. Configurable Thresholds

9.1 Threshold Hierarchy

9.2 Default Configuration


10. Cross-Task Consistency Checking

10.1 When to Check

Cross-task consistency is checked when two or more tasks share a topic or produce related outputs. The coordinator identifies related task pairs based on dependency edges and task descriptions.

10.2 Consistency Verifier Prompt

"""

Last updated