PRD-82B — Mission Intelligence Layer
Version: 1.0 Type: Implementation Status: Draft Priority: P1 Depends On: PRD-82A (Sequential Mission Coordinator) Research Base: PRDs 102 (Coordinator), 103 (Verification), 104 (Model Selection), 105 (Budget Governance), 106 (Telemetry) Author: Gerard Kavanagh + Claude Date: 2026-03-16
1. Goal
Make missions smarter. 82A proved the lifecycle works (plan → assign → execute → verify → review). 82B adds the intelligence: learn from past missions, decompose with templates, replan on failure, query telemetry, and convert successful missions into reusable routines. Still sequential. Still roster agents only. The goal is quality and reliability, not scale.
2. 82A Production Learnings (What This PRD Fixes)
These are real bugs and gaps hit running missions in production on 2026-03-16:
Wrong agents assigned (ATLAS to research, QUILL to analysis)
Skill matching was too loose; tool_coverage gave 1.0 freebie when no tools required
Rebalanced weights, token-level synonym matching
History scoring: agents that verified well on similar tasks score higher
SCRIBE (gpt-4 8k) failed task 5 with upstream context overflow
No model context awareness for task position
Added _LARGE_CONTEXT_MODELS set
Model selection per task position: later tasks auto-prefer 128k+ models
Verification death loop (3 retries, all fail)
must_pass=true default too strict; LLM verification has ~15% false-fail rate
Changed default to must_pass=false
Adaptive thresholds: adjust based on historical override rate per task type
Mission fails → dead end
No replanning after max retries exhausted
N/A — mission enters failed
Replanning: redecompose failed subtree with learned context
LLM decomposition quality varies wildly
No templates, no structural guidance for common patterns
N/A — LLM-only
Template library: pre-built decomposition templates for research, analysis, content
No visibility into mission performance
Events captured but no query layer
N/A — write-only
Telemetry API + dashboard data: query events, agent scores, cost tracking
Good mission patterns lost
No way to reuse a successful mission structure
N/A
Save as routine: convert completed mission → reusable recipe
3. What Ships
History-based agent scoring
102 §6.2
Wire AgentMatcher._score_history() using accumulated verifier_score data
Decomposition template library
102 §5.3
Template dataclasses + keyword/embedding lookup + fallback to LLM
Replanning on failure
102 §5.2
MissionPlanner.replan() — redecompose failed subtree with failure context
Cross-task consistency verification
103 §3.1
Post-all-verified pass: check output coherence across tasks
Telemetry query API
106
Events by run/agent/type/time, agent score aggregates, cost breakdowns
Verification result caching
103
Don't re-verify unchanged output on retry (hash-based)
Adaptive verification thresholds
103
Track human override rate → auto-adjust quality_threshold
Save-as-routine conversion
82A deferred
Completed mission → recipe with task template
orchestration_archive table
101
Move terminal runs older than 30d to archive
4. What Does NOT Ship (Deferred)
Parallel dispatch (max_concurrent > 1)
82C
Sequential works; parallel adds race conditions
Contractor/ephemeral agents
82C
Needs AgentFactory changes
Budget enforcement gates
82C
Sequential is cheap
Multi-judge verification ensemble
82C
Single cross-model judge works for v1
Complexity detection ("this should be a mission")
82D
Needs telemetry data first
Prompt coaching
82D
Needs outcome data first
Mission planning conversation in chat
82C
Modal creation works; conversational flow needs prompt pipeline changes
5. History-Based Agent Scoring
5.1 Problem
82A's AgentMatcher has a placeholder history_score = 0.5 for all agents. The _score_history() method exists in PRD-102 spec but was deferred. Without history, the same wrong agent gets picked repeatedly.
5.2 Design
Wire _score_history() to query OrchestrationTask for completed tasks by the same agent with similar task_type or agent_role. Use the average verifier_score as the history signal.
5.3 Integration
In AgentMatcher._score_agent(), replace history_score = 0.5 with the DB query. Batch the query for all candidate agents to avoid N+1.
5.4 Cold Start
New agents or agents with < 3 verified tasks get 0.5 (neutral). This means established agents with good track records naturally float to the top. New agents still get selected via skill/tool/model scores.
5.5 Files Changed
modules/coordination/agent_matcher.py
Wire _score_history(), add batch query
modules/coordination/agent_matcher.py
Accept db parameter in _score_agent()
6. Decomposition Template Library
6.1 Problem
LLM-only decomposition quality varies. "Research top AI frameworks" sometimes produces 3 tasks, sometimes 7. Task descriptions are inconsistent. Agent roles are made up ("data collector" doesn't match any roster agent).
6.2 Template Dataclass
From PRD-102 §5.3:
6.3 Built-in Templates
research_and_report
5
Search → Deep Research → Analyse → Synthesise → Report
Goal contains "research", "compare", "evaluate"
content_pipeline
4
Research → Outline → Draft → Edit
Goal contains "write", "blog", "article", "content"
competitive_analysis
4
Identify Players → Research Each → Compare → Recommendation
Goal contains "competitive", "market analysis", "compare companies"
data_investigation
3
Gather Data → Analyse → Report Findings
Goal contains "investigate", "audit", "diagnose", "track"
6.4 Template Matching
6.5 Planner Integration
MissionPlanner.decompose() pipeline becomes:
Template match — keyword lookup against goal
If matched: use template structure, LLM fills in specifics (task titles, descriptions, tool requirements)
If no match: fall back to current LLM-only decomposition (82A behavior)
Structural validation — same as 82A (DAG validation, agent role validation)
The template constrains task count, agent roles, and output format. The LLM customizes within those constraints.
6.6 Files
modules/coordination/templates.py
NEW — Template dataclasses + registry + match function
modules/coordination/planner.py
Add template matching step before LLM decomposition
7. Replanning on Failure
7.1 Problem
82A: task fails after max_retries → task enters failed → mission enters failed. Dead end. User has to create a new mission from scratch.
7.2 Design
When a task fails after max retries, the coordinator can replan the remaining subtree:
Gather failure context: what went wrong, which task, what the agent tried
Call
MissionPlanner.replan()with the failure context + completed task outputsGenerate replacement tasks for the failed task and any dependent tasks
Insert new tasks into the DAG, mark old failed tasks as
skippedResume execution from the new tasks
7.3 Triggers
task_failed_max_retries
Reconciler
Auto-replan if config.auto_replan = true, else transition mission to failed
human_replan_request
API: POST /missions/{id}/replan
User requests replan with optional new instructions
budget_warning
Coordinator
Log warning, continue (no auto-replan on budget)
7.4 Constraints
Max replans per mission: 2 (configurable via
config.COORDINATOR_MAX_REPLANS)Replan scope: Only the failed task + its downstream dependents. Completed/verified tasks are locked.
Context injection: Replan prompt includes: original goal, completed task summaries, failure reason, user notes (if any)
7.5 Interface
7.6 State Machine Addition
New RunState: replanning (ACTIVE type)
7.7 Files
core/models/orchestration_enums.py
Add replanning to RunState
modules/coordination/planner.py
Implement replan() method
modules/coordination/reconciler.py
On task failure after max retries, trigger replan instead of failing mission
services/coordinator_service.py
Add replan_mission() method + API wiring
api/missions.py
Add POST /missions/{id}/replan endpoint
8. Cross-Task Consistency Verification
8.1 Problem
82A verifies each task independently. Task 3 might contradict Task 1's findings. With sequential execution the risk is lower (each task sees upstream outputs), but contradictions still happen when agents interpret the same data differently.
8.2 Design
After all tasks are verified, run a single cross-task consistency check:
8.3 Prompt Template
8.4 Integration
In reconciler.py, after the last task is verified:
Collect all verified task outputs
Call
verify_cross_task_consistency()If consistent → proceed to
awaiting_human(orcompletedif auto-approve)If inconsistent with high-severity issues → flag for human review with the issues
8.5 Files
modules/coordination/verification.py
Implement verify_cross_task_consistency()
modules/coordination/reconciler.py
Wire cross-task check after all tasks verified
9. Telemetry Query API
9.1 Problem
82A writes orchestration_events on every state transition but there's no way to read them. No dashboards, no cost tracking, no agent performance visibility.
9.2 Endpoints
GET
/api/missions/{id}/events
Events for a mission run, paginated, filterable by event_type
GET
/api/missions/{id}/cost
Token usage breakdown by task + total
GET
/api/missions/stats
Aggregate stats: success rate, avg duration, avg cost, top agents
GET
/api/agents/{id}/mission-history
Agent's mission performance: tasks completed, avg verifier score, failure rate
9.3 Stats Response
9.4 Files
api/missions.py
Add 4 new endpoints
services/coordinator_service.py
Add query methods for stats/events
10. Adaptive Verification Thresholds
10.1 Problem
82A uses a static quality_threshold = 0.6 for all verification. Some task types need higher bars (final reports), some need lower (intermediate data gathering). Human overrides (approving despite verification failure) indicate the threshold is too strict.
10.2 Design
Track the human override rate per agent_role / task type:
If humans approve > 30% of verification failures for a role → lower threshold by 0.05
If humans reject > 20% of verification passes → raise threshold by 0.05
Bounds:
[0.4, 0.85]
Store adjusted thresholds in a new verification_thresholds table or in config.
10.3 Implementation
Phase 1 (82B): Track override rates. Log recommendations but don't auto-adjust. Phase 2 (82C): Auto-adjust with human confirmation.
10.4 Files
modules/coordination/verification.py
Query override rate, log threshold recommendation
services/coordinator_service.py
Capture human review decisions (approve/reject) with verification context
11. Verification Result Caching
11.1 Problem
When a task retries (verification fail → retry with feedback), the agent might produce the same or very similar output. Re-running the full LLM verification wastes tokens.
11.2 Design
Hash the task output (sha256(output_text)). Before calling the LLM verifier, check if we've already verified this exact output. If so, return the cached result.
Cache key: (task_id, output_hash) Cache TTL: duration of the mission run (not persisted beyond run completion)
11.3 Files
modules/coordination/verification.py
Add hash check before LLM call
12. Save as Routine
12.1 Problem
User runs a successful "Research top AI frameworks" mission. Next week they want to do "Research top LLM providers" — same structure, different topic. Currently they have to start from scratch.
12.2 Design
After a mission completes successfully, offer a "Save as Routine" action:
Extract the task graph structure (roles, dependencies, verification criteria)
Templatize: replace specific content with
{goal}/{topic}placeholdersSave as a recipe in the existing recipes system
Next time a similar goal is detected, suggest the saved template
12.3 API
Response: the created recipe ID.
12.4 Frontend
Add a "Save as Routine" button on the mission detail page (only visible when mission is completed).
12.5 Files
api/missions.py
Add save-as-routine endpoint
services/coordinator_service.py
Extract template from completed run
Mission detail page
Add button + confirmation dialog
13. Orchestration Archive
13.1 Problem
orchestration_runs and orchestration_tasks will grow unbounded. Terminal runs (completed, failed, cancelled) older than 30 days should be archived.
13.2 Design
New table:
orchestration_archive— same schema asorchestration_runs+ JSON blob of all tasks and eventsCron job: daily, moves terminal runs older than
config.ARCHIVE_AFTER_DAYS(default 30)Archive is read-only, queryable via
GET /api/missions/archiveActive queries (list, stats) exclude archived runs
13.3 Files
core/models/orchestration.py
Add OrchestrationArchive model
Alembic migration
Create orchestration_archive table
services/coordinator_service.py
Add archive_old_runs() method
Cron job or reconciler tick
Trigger archival
14. Configuration
New config entries for 82B:
COORDINATOR_MAX_REPLANS
2
Max replan attempts per mission
COORDINATOR_AUTO_REPLAN
false
Auto-replan on task failure (vs fail mission)
COORDINATOR_HISTORY_LOOKBACK_DAYS
30
Days of history for agent scoring
COORDINATOR_HISTORY_MIN_DATAPOINTS
3
Min verified tasks before using history score
COORDINATOR_ARCHIVE_AFTER_DAYS
30
Days before terminal runs are archived
COORDINATOR_CONSISTENCY_CHECK
true
Run cross-task consistency verification
15. Implementation Order
1
History-based agent scoring
S
Needs accumulated verifier_score data from 82A runs
2
Decomposition templates
M
Standalone — improves plan quality immediately
3
Telemetry query API
S
Standalone — unlocks visibility
4
Verification caching + adaptive thresholds
S
Standalone
5
Cross-task consistency
S
After telemetry (to measure impact)
6
Replanning on failure
L
Needs state machine addition + planner changes
7
Save as routine
M
Needs recipe system integration
8
Orchestration archive
S
Can land anytime
Recommended order: 2 → 1 → 3 → 4 → 5 → 6 → 7 → 8
Templates (phase 2) has the highest immediate impact — every mission benefits from better decomposition. History scoring (phase 1) depends on having enough data from 82A runs, so it naturally comes after some missions have been executed.
16. Success Criteria
Template match rate
> 60% of missions use a template
Planner logs template_used field
Verification pass rate (first attempt)
> 70% (up from ~55% in 82A)
Query orchestration_events for verification outcomes
Agent selection accuracy
< 10% human overrides of agent assignment
Track agent reassignment rate
Mission success rate
> 85% (up from ~80% in 82A)
Terminal state ratio
Replan success rate
> 50% of replanned missions complete
Track replan → completed transitions
Avg mission duration
< 10 min for 5-task missions
Event timestamps
17. Non-Goals
This PRD does not add parallel execution — that's 82C
This PRD does not add contractor/ephemeral agents — that's 82C
This PRD does not add budget enforcement — sequential missions are cheap
This PRD does not change the frontend mission detail page (except Save as Routine button)
This PRD does not add the conversational mission planning flow in chat
Last updated

