PRD-102 Outline: Coordinator Architecture

Type: Research + Design Status: Outline (Loop 0) Depends On: PRD-100 (Research Master), PRD-101 (Mission Schema) Blocks: PRD-103 (Verification), PRD-104 (Ephemeral Agents), PRD-107 (Context Interface)


Section 1: Problem Statement

Why This PRD Exists

Automatos has no coordination layer. The closest existing component is heartbeat_service.py:_orchestrator_tick_llm() (line 382), which runs a 5-iteration tool loop with an 8,000-token budget and dispatcher_only tools — it does health checks and reporting, not goal decomposition or agent dispatch.

The Coordination Gap

What Exists
What's Missing

_orchestrator_tick_llm() — LLM tool loop for workspace health checks

Goal decomposition: breaking "Research EU AI Act compliance" into subtasks

AgentFactory.execute_with_prompt() — per-agent execution with 10-iteration tool loop

Parallel dispatch: running independent subtasks concurrently via asyncio.gather

AgentCommunicationProtocol — Redis pub/sub messaging (built, not wired to heartbeat ticks)

Cross-task data flow: passing Task 1's output as input to Task 2

BoardTask with assigned_agent_id — manual task assignment

Automatic agent selection: matching task requirements to agent capabilities

SharedContextManager — in-process shared state with Redis backing (2h TTL)

Mission state machine: tracking plan → execute → verify → review lifecycle

TaskReconciler — stall detection for recipe_executions only

Mission-scoped stall detection, dependency-aware retry, escalation on failure

ContextMode.HEARTBEAT_ORCHESTRATOR — 8k tokens, 5 sections, dispatcher tools

ContextMode.COORDINATOR — full tools, mission context section, no token cap

What This PRD Delivers

The architecture for a CoordinatorService that:

  1. Takes a natural language goal + autonomy settings

  2. Decomposes it into a dependency graph of 3-20 tasks (using PRD-101's mission_tasks schema)

  3. Assigns each task to a roster agent or contractor agent

  4. Dispatches tasks respecting dependency ordering

  5. Monitors execution, handles failures (continuation vs retry)

  6. Triggers verification (PRD-103) and human review gates

  7. Detects mission completion and offers "save as routine"


Section 2: Prior Art Research Targets

Systems to Study (each gets dedicated research)

System/Pattern
Source
Focus Areas
Key Question

Blackboard Architecture

Nii 1986 (AI Magazine); LbMAS (arxiv:2507.01701, 2025)

Shared state as coordination medium, knowledge source preconditions, event-driven activation, conflict resolution

Should the mission state object act as a blackboard that agents read/write to?

HTN Planning

Nau et al. JAIR 2003 (SHOP2); ChatHTN (arxiv:2505.11814, 2025); Hsiao et al. (arxiv:2511.07568, 2025)

Compound→primitive decomposition, method libraries, partial-order task networks, LLM as decomposition engine

Should we maintain decomposition templates that the LLM fills gaps in (ChatHTN hybrid)?

BDI Agents

Rao & Georgeff ICMAS 1995; ChatBDI (AAMAS 2025)

Belief-Desire-Intention cycle, intention commitment prevents thrashing, plan failure propagation, bold vs cautious reconsideration

Should the coordinator use BDI's intention model to prevent premature replanning?

Symphony

openai/symphony SPEC.md

WORKFLOW.md policy-as-code, reconciliation loop (dispatch + reconcile phases), continuation vs retry, workpad as progress checkpoint

Should we adopt the two-phase tick (dispatch new + reconcile running) and continuation vs retry distinction?

CrewAI

crewAIInc/crewAI

Sequential vs hierarchical process, context=[task_a, task_b] dependency declaration, guardrail validation pattern, async_execution + join

Should we adopt CrewAI's explicit context= dependency pattern for data flow between tasks?

AutoGen

microsoft/autogen

GroupChat turn-based coordination, Swarm handoff-based routing, termination composition (| / &), nested execution isolation

Should agents explicitly hand off to the next agent (Swarm pattern) or should the coordinator always decide?

LangGraph

LangChain ecosystem

Typed state schema, deterministic conditional routing, checkpointing at every superstep, interrupt() for human review, Send API for dynamic parallelism

Should we adopt LangGraph's typed state + checkpoint-per-step model for mission durability?

Automatos Codebase

heartbeat_service.py, inter_agent.py, context/service.py, agent_factory.py, task_reconciler.py

What exists today that the coordinator builds on vs replaces

Key Patterns Discovered in Research

Blackboard as mission state (Nii 1986, LbMAS 2025): The mission state object (PRD-101's mission_runs + mission_tasks) acts as a blackboard — agents write results to it, the coordinator reads it to decide next actions. LbMAS (2025) showed 5% improvement over static multi-agent systems using this pattern with LLMs. Key adoption: event-driven activation (agent activates when its dependencies complete on the blackboard) over polling.

HTN decomposition with LLM gap-filling (ChatHTN 2025): The coordinator maintains a library of decomposition templates for known mission types. For novel goals, the LLM generates a decomposition. ChatHTN proved this hybrid is provably sound — the symbolic structure validates the LLM's output. Hsiao et al. (2025) showed hand-coded HTNs enable 20-70B models to outperform 120B baselines, confirming structure improves LLM planning.

BDI intention commitment (Rao & Georgeff 1995): Once the coordinator commits to a plan, it should not replan on every tick — only when a significant belief change occurs (task failure, budget exceeded, new user input). The bold/cautious spectrum from Kinny & Georgeff maps to the autonomy toggle: approve mode = cautious (human gates), autonomous mode = bolder (replan only on failure).

Two-phase reconciliation tick (Symphony): Every coordinator tick runs: (1) dispatch phase — find tasks whose dependencies are met and assign them; (2) reconcile phase — check running tasks for stalls, external state changes, or completion. This separation is cleaner than a single monolithic loop.

Continuation vs retry (Symphony): A task that completed normally but the mission isn't done → continuation (near-zero delay, resume from workspace). A task that failed → retry (exponential backoff). The attempt_count on mission_tasks tracks retries separately from continuations. Critical distinction for AI agents where "done with my part" ≠ "mission complete."

Typed state + checkpointing (LangGraph): The coordinator's state should be a typed schema (the mission_runs + mission_tasks tables from PRD-101) with a checkpoint after every state transition. This enables crash recovery — coordinator restarts, reads last state from DB, resumes.

Explicit dependency declarations (CrewAI): task_inputs JSONB (from PRD-101) maps to CrewAI's context=[task_a, task_b] — explicit, declarative, queryable. The scheduler resolves "which tasks are ready?" by checking task_inputs references against completed task IDs.


Section 3: Coordinator Responsibilities

3.1 Plan Decomposition

The coordinator takes a natural language goal and produces a task graph:

Decomposition strategy (HTN-inspired hybrid):

  1. Check template library for matching mission type (exact match or semantic similarity)

  2. If template found → use it, let LLM customize parameters (agent assignments, specific instructions)

  3. If no template → LLM generates full decomposition from scratch

  4. Validate decomposition: no cycles in dependency graph, all referenced agents exist, budget estimate within limits

Key design question: How much planning capability do current LLMs actually have? Research must benchmark decomposition quality across models (cheap models for simple missions, expensive models for complex ones).

3.2 Agent Assignment

For each task in the plan:

Assignment Strategy
When Used

Roster match

Task requirements match a roster agent's skills/tools. Preferred — agent has memory, personality, history.

Contractor spawn

No roster agent matches, or task needs a specialist model. Ephemeral — mission-scoped lifecycle (PRD-104).

User override

In approve mode, user can reassign agents before execution starts.

Matching algorithm: Compare task requirements (tools needed, model preference, domain) against agent capabilities from DB (agents.skills, agent_tools, agents.model). Score and rank. Deterministic, not LLM-based — CrewAI's "LLM-as-manager" approach is non-deterministic and untestable.

3.3 Progress Monitoring

The coordinator monitors via the two-phase tick (Symphony pattern):

Phase A — Dispatch:

  1. Query mission_tasks where status = 'pending'

  2. For each: check if all task_inputs.__parents__ tasks are in terminal success state

  3. If ready: transition to scheduled and dispatch directly via AgentFactory.execute_with_prompt() — do NOT create a BoardTask and wait for the agent's heartbeat tick to pick it up. Direct dispatch gives the coordinator control over timing, retry, and result collection. A BoardTask is created for visibility (kanban tracking) but is NOT the dispatch mechanism.

  4. Respect concurrency limits (configurable per mission)

Design clarification: The coordinator always dispatches directly. Board tasks exist for human visibility on the kanban, not for agent scheduling. The heartbeat tick path (_agent_tick()) remains for routine/recipe work only — missions bypass it entirely.

Phase B — Reconcile:

  1. Query mission_tasks where status = 'running'

  2. Check for stalls (elapsed > stall timeout) → handle per continuation/retry logic

  3. Check for completed tasks → emit TASK_COMPLETED event, update mission state

  4. Check if all tasks done → advance mission to verifying phase

  5. Check budget → if approaching limit, emit BUDGET_WARNING event

3.4 Failure Handling

Continuation vs retry (Symphony-inspired):

Scenario
Action
Delay

Agent completed normally, mission not done

Continuation — dispatch next dependent tasks

Immediate

Agent failed (error, timeout, tool crash)

Retry — same agent, exponential backoff

min(10s × 2^(attempt-1), 5min)

Agent failed, max retries exhausted

Escalate — try different agent or model

Immediate, different assignment

All alternatives exhausted

Mission failed — notify user

Budget exceeded mid-task

Pause mission — notify user for budget increase or cancellation

BDI-inspired reconsideration policy:

  • Do NOT replan on every tick (bold agent behavior for stable missions)

  • Replan triggers: task failure after all retries, user sends new instructions, budget warning

  • Replanning increments mission_runs.plan_version and emits PLAN_REVISED event

3.5 Human Review Gates

Two human interaction points:

  1. Plan approval (approve mode): After decomposition, coordinator presents plan to user. User can approve, modify, or reject. Mission stays in awaiting_approval until human acts.

  2. Result review (all modes): After verification (PRD-103), mission enters awaiting_review. User accepts, rejects (with feedback for specific tasks), or sends back for rework.

3.6 Mission Completion

A mission is complete when:

  1. All mission_tasks are in terminal state (verified or human_accepted)

  2. Verification (PRD-103) has run and scored all outputs

  3. Human has reviewed (or autonomy mode and all verifications passed)

  4. Budget accounting is finalized

  5. User is offered "save as routine?" → creates workflow_recipe from mission structure


Section 4: Key Design Questions

Q1: LLM-Driven vs Rule-Based Planning?

Options:

  • Pure LLM: Coordinator sends goal + available agents to LLM, gets back a task graph. Flexible but non-deterministic.

  • Pure rule-based: Predefined templates for every mission type. Deterministic but brittle — can't handle novel goals.

  • Hybrid (recommended — ChatHTN pattern): Template library for known patterns + LLM for novel goals + LLM for customizing templates. Validate all plans against structural rules (no cycles, valid agents, budget estimate).

Research needed: Benchmark decomposition quality. Give 10 mission goals to GPT-4o, Claude Sonnet, DeepSeek, Qwen — measure: task count, dependency correctness, instruction clarity, time to plan.

Q2: Stateful vs Stateless Coordinator?

Options:

  • Stateful (in-process): Coordinator holds mission state in memory, writes to DB periodically. Fast but lost on crash.

  • Stateless (DB-driven, recommended): Coordinator reads state from DB on every tick, writes back after actions. Slower but crash-recoverable. Matches LangGraph's checkpoint model and Symphony's "restart recovery via tracker + filesystem."

Recommendation: Stateless. The mission_runs/mission_tasks tables from PRD-101 ARE the state. Coordinator reconstructs its understanding on every tick by querying them. This is why PRD-101's schema design is critical.

Q3: How Does the Coordinator Use ContextService?

New context mode needed: ContextMode.COORDINATOR

Section
Content

identity

Coordinator agent identity (role: mission coordinator)

mission_context (NEW)

Current mission: goal, plan, task statuses, agent assignments, budget status

agent_roster (NEW)

Available agents with their skills, tools, models, recent success rates

platform_actions

Full platform tools including new mission management tools

task_context

Current tick's focus: which tasks need dispatch, which are stalled

datetime_context

Current time for scheduling decisions

Token budget: No cap (or 128k+ cap). Coordinator needs to see full mission context to make good decisions.

Q4: Coordinator Prompt Design

The coordinator prompt must encode:

  • Role: "You are a mission coordinator. Your job is to decompose goals, assign agents, and monitor execution."

  • Available actions: Structured tool definitions for mission management

  • Current state: Injected via mission_context section

  • Decision framework: When to dispatch, when to wait, when to replan, when to escalate

Research needed: Test prompt designs. The WORKFLOW.md pattern (Symphony) of state-specific instructions is compelling — coordinator prompt could have sections for each mission state (planning, executing, verifying, reviewing).

Q5: Replanning Triggers

When should the coordinator revise its plan?

Trigger
Action

Task fails after max retries

Replan: remove failed task, find alternative path or substitute agent

User sends new instructions mid-mission

Replan: incorporate new requirements, may add/remove tasks

Budget warning (>80% spent)

Replan: cut remaining tasks to essentials, use cheaper models

Verification rejects a task output

Replan: retry with different instructions or different agent

Agent discovers new information

Replan: add tasks discovered during execution (dynamic task creation)

Key constraint: Replanning must not discard completed work. Only pending/scheduled tasks can be modified. Running tasks continue unless explicitly cancelled.

Q6: Where Does the Coordinator Live in the Module Hierarchy?

Options:

  • orchestrator/services/coordinator_service.py — alongside heartbeat_service.py and task_reconciler.py

  • orchestrator/modules/coordination/coordinator.py — new module

Recommendation: orchestrator/services/coordinator_service.py as the service, with supporting classes in orchestrator/modules/coordination/ (planner, dispatcher, reconciler). The service registers its tick on the shared UnifiedScheduler like heartbeat does.


Section 5: Integration Points

How the Coordinator Calls Existing Components

Existing Component
How Coordinator Uses It

AgentFactory.execute_with_prompt()

Dispatches each mission task to its assigned agent. Coordinator passes context_mode=ContextMode.TASK_EXECUTION, prompt=task_instructions.

ContextService.build_context()

Coordinator builds its own context with ContextMode.COORDINATOR. Also used when building agent context for task dispatch.

get_tools_for_agent() (tool_router.py:140)

Resolves tools for task agents. Coordinator may need its own tool set (mission management tools).

UnifiedToolExecutor.execute_tool()

Coordinator's own tool loop uses this for platform actions (create board task, update mission status).

BoardTask model (core/models/board.py)

Coordinator creates board tasks with source_type='mission', source_id=mission_run_id. Links via mission_tasks.board_task_id.

HeartbeatService._agent_tick()

Agent ticks pick up board tasks assigned to them. Coordinator assigns tasks → heartbeat delivers them. Alternative: coordinator calls execute_with_prompt directly for immediate dispatch.

TaskReconciler

Extended to watch mission_tasks alongside recipe_executions. Coordinator handles escalation on max-retry failure.

AgentCommunicationProtocol

Coordinator broadcasts mission context updates to assigned agents via Redis pub/sub. Optional — only if agents need real-time coordination during execution.

SharedContextManager

Stores mission-scoped shared context (accumulated results from completed tasks). Agents read it to get sibling task outputs.

workflow_recipes table

"Save as routine" converts mission structure to recipe steps.

New Components the Coordinator Introduces

Component
Purpose

CoordinatorService

Main service: tick loop, plan generation, dispatch, reconciliation

MissionPlanner

LLM-powered decomposition: goal → task graph. Template matching + LLM generation.

MissionDispatcher

Resolves ready tasks, assigns agents, calls execute_with_prompt or creates board tasks

MissionReconciler

Extends TaskReconciler pattern for mission-scoped stall detection and dependency-aware retry

ContextMode.COORDINATOR

New context mode with mission_context and agent_roster sections

ContextMode.VERIFIER

New context mode for verification agents (PRD-103)

platform_create_mission

Platform tool: user creates mission from chat

platform_approve_plan

Platform tool: user approves/modifies coordinator's plan

platform_mission_status

Platform tool: user checks mission progress

API endpoints

POST /missions, GET /missions/{id}, POST /missions/{id}/approve, POST /missions/{id}/review

Files That Must Be Modified

File
Change

orchestrator/modules/context/modes.py

Add COORDINATOR and VERIFIER to ContextMode enum and MODE_CONFIGS

orchestrator/modules/context/service.py

Add mission_context and agent_roster section renderers

orchestrator/services/task_reconciler.py

Extend _tick to query mission_tasks alongside recipe_executions

orchestrator/modules/tools/platform_actions.py

Register mission management action definitions

orchestrator/modules/tools/execution/platform_executor.py

Add handlers for mission tools

orchestrator/core/models/core.py (or new mission.py)

Import mission models (defined in PRD-101)

orchestrator/api/

New missions.py router for mission API endpoints

alembic/versions/

Migration for any coordinator-specific columns (most schema is PRD-101)


Section 6: Acceptance Criteria for Full PRD-102

The complete PRD-102 is done when:


Section 7: Risks & Dependencies

Risks

#
Risk
Impact
Mitigation

1

Coordinator complexity — too many responsibilities in one service

High

Split into focused classes: Planner, Dispatcher, Reconciler. Coordinator is the orchestrator, not the doer.

2

LLM planning reliability — decomposition quality varies by model and prompt

High

Template library for common patterns (ChatHTN hybrid). Validate all plans structurally before execution. Benchmark decomposition quality across models.

3

Cost of coordination calls — coordinator LLM calls add overhead per mission

Medium

Use cheap models for coordination (Haiku-class). Coordinator prompt should be concise. Template matching avoids LLM call entirely for known patterns.

4

Tick frequency tradeoff — too fast = wasted cycles, too slow = delayed dispatch

Medium

Start with 5-second tick (Symphony default). Make configurable. Consider event-driven activation for specific transitions (task completion → immediate dispatch of dependent tasks).

5

Parallel dispatch race conditions — two tasks complete simultaneously, both trigger dependent task

Medium

Use DB-level locking or SELECT ... FOR UPDATE when transitioning task status. Only one dispatch per tick per task.

6

Replanning destroys progress — bad replan discards valid completed work

High

Immutable completed tasks. Replanning only modifies pending/scheduled tasks. plan_version increments on every replan for audit trail.

7

Agent unavailability — assigned agent is offline or overloaded

Medium

Coordinator checks agent availability before dispatch. Fallback: reassign to different agent or spawn contractor. Stall detection catches unresponsive agents.

8

Circular dependencies in task graph — LLM generates impossible plan

Low

Validate DAG structure (topological sort) before accepting any plan. Reject plans with cycles.

9

Coordinator becomes single point of failure

Medium

Stateless design (DB-driven) means any instance can take over. No in-process state to lose.

10

Over-engineering the first version

High

PRD-100 Risk #3: "Start sequential-only. No parallel, no dynamic replanning. Get lifecycle right first." Phase the implementation: sequential missions first (82A/B), then parallel + replanning (82C).

Dependencies

Dependency
Direction
Notes

PRD-101 (Mission Schema)

Blocked by 101

Coordinator reads/writes mission_runs, mission_tasks, mission_events. Cannot build coordinator without schema.

PRD-103 (Verification)

Blocks 103

Coordinator triggers verification phase. Verification PRD needs to know coordinator's handoff interface.

PRD-104 (Ephemeral Agents)

Blocks 104

Coordinator spawns contractor agents. Contractor PRD needs coordinator's spawn interface.

PRD-105 (Budget)

Uses 105

Coordinator enforces budget limits defined in PRD-105. Can start with simple budget checks, enhance later.

PRD-106 (Telemetry)

Feeds 106

Coordinator emits mission_events that telemetry queries. Event schema must support telemetry aggregation.

PRD-107 (Context Interface)

Blocks 107

Context interface must abstract how coordinator gets/sets context. Coordinator is the primary consumer.

Existing HeartbeatService

Integration

Coordinator registers its tick alongside heartbeat. Must not conflict with heartbeat's scheduling.

Existing AgentFactory

Integration

Coordinator dispatches via execute_with_prompt(). No changes needed to AgentFactory.

Existing TaskReconciler

Extension

Must extend to cover mission tasks. Could be a new MissionReconciler or an extension of existing class.

Existing ContextService

Extension

Must add COORDINATOR mode and mission_context section. Non-breaking — adds new mode, doesn't modify existing ones.


Appendix: Research Summary Matrix

Aspect
Blackboard
HTN Planning
BDI
Symphony
CrewAI
AutoGen
LangGraph

Coordination model

Shared state + event-driven KS activation

Hierarchical decomposition of compound tasks into primitives

Belief-Desire-Intention deliberation cycle

Reconciliation loop (dispatch + reconcile) with policy-as-code

Sequential or hierarchical (LLM-as-manager) process

Turn-based group chat with LLM speaker selection

Typed state graph with deterministic conditional edges

State management

Blackboard data structure (shared, hierarchical)

World state updated at each primitive step

Belief base (agent's model of world)

External tracker (Linear) + workspace filesystem

In-memory crew state; Flows add SQLite persistence

In-memory message list (ephemeral)

Typed schema + pluggable checkpointers (Postgres, SQLite)

Planning approach

Opportunistic — no predetermined path

Method library for known decompositions; backtracking for alternatives

Plan library indexed by triggering events; LLM can generate plans dynamically

No planning — work comes from external tracker

LLM-as-manager in hierarchical mode; AgentPlanner pre-generates steps

No planning — conversation-driven emergence

Graph defined at compile time; conditional routing for branching

Failure handling

KS produces competing hypotheses; control resolves conflicts

Backtrack and try alternative method

Plan failure propagation with alternative plan selection; bold/cautious reconsideration

Continuation (1s) vs retry (exponential backoff); workspace preserved

Guardrail retry loop (max 3); soft failure — proceeds with bad output

No built-in failure handling

Checkpoint enables resume from last successful step

Human review

Not built-in

Not built-in

Not built-in (agent is autonomous)

PR review is the human gate; no mid-execution review

human_input=True per task; @human_feedback in Flows

human_input_mode on UserProxyAgent

interrupt() pauses execution; resume with human input

What we adopt

Mission state as blackboard; event-driven task activation; explicit conflict resolution

Template library + LLM gap-filling (ChatHTN); partial-order task networks for parallelism

Intention commitment (don't replan every tick); bold/cautious spectrum maps to autonomy toggle; plan failure propagation

Two-phase tick (dispatch + reconcile); continuation vs retry; WORKFLOW.md state-specific instructions

context=[] dependency declarations; guardrail validation pattern; async_execution + join

Swarm handoff pattern; termination condition composition

Typed state schema; checkpoint per step; interrupt() for human review; Send API for dynamic parallelism

What we reject

BB1 control blackboard (overkill for 3-20 tasks); distributed blackboard partitioning

Full formal HTN domain model (too rigid); hand-authored methods only

Static plan library (LLM replaces); symbolic brittleness (LLM handles fuzzy preconditions)

Linear-specific coupling; single-agent-per-task; no multi-agent coordination

LLM-as-manager for delegation (non-deterministic); soft guardrail failure; no dynamic task creation

LLM-based speaker selection per turn (expensive, non-deterministic); magic-string termination; ephemeral state

Full boilerplate burden; static graph compilation; LangSmith lock-in

Last updated