PRD-101 — Mission Schema & Data Model
Version: 1.0 Type: Research + Design Status: Complete — Ready for Peer Review Priority: P0 Dependencies: PRD-100 (Research Master) Author: Gerard Kavanagh + Claude Date: 2026-03-14
1. Problem Statement
1.1 The Gap
Automatos has a production-grade foundation: 340 LLMs, 850 tools, 11 channel adapters, 5-layer memory, a Kanban board, agent reports, scheduled heartbeats, and a recipe engine. Users can do single-agent Tasks and scheduled Routines today. What they cannot do is describe a complex goal — "Research EU AI Act compliance for our product" — and have the system decompose it into subtasks, assign agents, execute with verification, and track everything on the board.
This is the Mission gap identified in PRD-100 (Section 3). Every piece of infrastructure exists except the data layer that makes coordinated multi-agent execution persistent, traceable, and recoverable.
Specifically, the platform has no:
orchestration_runs table
No way to record a mission's goal, plan, config, cost, or status
orchestration_tasks table
No way to track subtasks with dependencies, assignment, and verification
orchestration_events table
No audit trail for state transitions — debugging requires log-grepping
Task dependency graph
No DAG, no topological sort, no "what's ready to run?" query
State machine for task lifecycle
Board tasks have 4 statuses; missions need 10+ with defined transitions
Integration contracts
No defined mapping from orchestration tables → board_tasks, agent_reports, recipes
Without this data layer, the Coordinator (PRD-102), Verifier (PRD-103), Ephemeral Agents (PRD-104), Budget Enforcement (PRD-105), and Telemetry (PRD-106) have nothing to read from or write to. The schema is the foundation everything else stands on.
1.2 What This PRD Delivers
This document is the research and design specification for the mission data layer. It delivers:
1.3 What This PRD Does NOT Cover
These are explicitly out of scope — each has its own research PRD:
How the coordinator decomposes goals into tasks
PRD-102 (Coordinator Architecture)
How verification/scoring works
PRD-103 (Verification & Quality)
Ephemeral "contractor" agent lifecycle
PRD-104 (Ephemeral Agents & Model Selection)
Budget enforcement and approval gates
PRD-105 (Budget & Governance)
Outcome telemetry queries and learning
PRD-106 (Outcome Telemetry & Learning Foundation)
Context interface for Phase 3 swap
PRD-107 (Context Interface Abstraction)
API endpoints and service layer
PRD-82A (Implementation PRD)
Frontend changes beyond existing board
PRD-82A (Implementation PRD)
This PRD designs the tables, state machine, and integration contracts. PRD-82A will take these designs and produce the Alembic migration, SQLAlchemy models, API endpoints, and service layer code.
1.4 Design Philosophy
Five principles guided every decision in this document:
Research before building. Every architectural choice in Sections 3-7 cites a production system that validated the pattern at scale. No design-by-vibes.
DB-authoritative. The database is the single source of truth for mission state. The coordinator re-derives "what's ready to run?" from DB on every tick. No in-memory state to lose on crash.
Dual-write for state. Current state denormalized on the row for O(1) dashboard queries. Every transition also appended to
orchestration_eventsfor audit trail. Both in one transaction.Additive-only integration. Three new tables. Zero changes to existing table structures. New optional columns and indexes on existing tables added via
CREATE INDEX CONCURRENTLY. No downtime. No backfill.Schema enables, implementation decides. The schema supports parallel execution, budget enforcement, and verification scoring — but whether those features are built sequentially or in parallel is an implementation decision for PRD-82A through 82D.
2. Prior Art: DAG Execution Patterns
2.1 Overview
Mission orchestration requires storing runs (missions), tasks (subtasks), dependencies between tasks, and state transitions. Before designing our schema, we studied five production systems that solve related problems at scale. Each takes a fundamentally different approach to the same core challenge: how do you persist the state of a multi-step execution with dependencies?
The systems studied:
Temporal — workflow-as-code with deterministic replay over an append-only event history
Prefect — flow/task runs with denormalized state + append-only state history tables
Apache Airflow — DAG runs with DB-authoritative scheduling and trigger rules
Dagster — event-sourced execution with the event log as the system of record
OpenAI Symphony — tracker-as-coordinator with in-memory orchestration and policy-as-code
2.2 Comparison Table
Run model
executions table; state stored as serialized protobuf blob; identity = (namespace, workflow_id, run_id)
flow_run table; state denormalized inline (state_type, state_name) + FK to flow_run_state history
dag_run table; state column (queued/running/success/failed); unique on (dag_id, run_id)
runs table; run_body TEXT = full serialized DagsterRun JSON; denormalized start_time/end_time
No persistent run table — Linear issue is the run record; in-memory claim state only
Task model
activity_info_maps table; one row per in-flight activity; serialized PendingActivityInfo proto blob
task_run table; explicit columns for state, timing, retry policy, cache, task_inputs
task_instance table; composite key (dag_id, run_id, task_id, map_index); no FK to DAG model (by design)
No task table — step state derived from event_logs entries keyed by (run_id, step_key)
No task table — each Linear issue = one task; agent workspace = execution context
Dependency model
Implicit in workflow code — replay-derived; no stored DAG
task_inputs JSON column — typed refs to upstream TaskRunResult / Parameter / Constant
Serialized in serialized_dag table; trigger rules evaluate upstream states at scheduling time
Declared in GraphDefinition dependency dict; stored in serialized JobSnapshot
Linear issue dependencies (if any); mostly single-task-per-issue model
State machine
Workflow: RUNNING → COMPLETED/FAILED/CANCELED/TERMINATED/CONTINUED_AS_NEW/TIMED_OUT/PAUSED. Activity: SCHEDULED → STARTED → (completed via event)
9 StateTypes: SCHEDULED → PENDING → RUNNING → COMPLETED/FAILED/CRASHED/CANCELLED. Sub-states: AwaitingRetry, Cached, Late, Suspended
DAG: queued → running → success/failed. Task: scheduled → queued → running → success/failed/skipped/upstream_failed + up_for_retry, deferred, removed
Run: NOT_STARTED → QUEUED → STARTING → STARTED → SUCCESS/FAILURE/CANCELED. Steps: event-derived STEP_START → STEP_SUCCESS/STEP_FAILURE/STEP_SKIPPED
Claim: Unclaimed → Claimed → Running → RetryQueued/Released. Work phases: PreparingWorkspace → BuildingPrompt → LaunchingAgent → StreamingTurn → Succeeded/Failed/TimedOut/Stalled
Failure handling
4 timeout types (schedule-to-close, schedule-to-start, start-to-close, heartbeat). Retry policy with exponential backoff. Heartbeat checkpointing for long activities
Retries with configurable delay + jitter. CRASHED state for infrastructure failures (distinct from FAILED). Cache-based skip on re-execution
retries + retry_delay per task. up_for_retry state. Trigger rules cascade failure downstream (upstream_failed). retry_exponential_backoff flag
Retry via RetryRequested event; step re-enters execution. IO managers externalize data so retries don't lose upstream results
Continuation (clean exit) = 1s delay, same workspace, no attempt increment. Retry (failure) = exponential backoff 10s × 2^(attempt-1), fresh branch possible
Inter-task data
Activity results stored as events in history; workflow code reads them via SDK await
Results stored as Artifact records; task_run_state._data JSON or external storage via result_artifact_id
XCom table: (dag_id, run_id, task_id, key) → value. Small values only; pluggable backend for large payloads
IO managers: handle_output() serializes to external storage, load_input() deserializes. No in-process passing by default
No inter-task data passing — single-agent-per-issue model. Agent reads its own prior commits/workpad
Event/audit log
history_node table — append-only event history; all state reconstructable via deterministic replay
flow_run_state / task_run_state tables — append-only state history per entity
log table for task logs; state changes tracked on task_instance row directly (mutable)
event_logs table — THE system of record. All state derived from this append-only stream
No persistent log — in-memory token accounting per session. Linear comments serve as audit trail
2.3 System-by-System Analysis
Temporal
Temporal's defining architectural choice is deterministic replay over explicit state storage. The database schema is deliberately opaque — most domain state lives in serialized protobuf blobs in the executions table, not in queryable columns. Dependencies between activities are never stored; they're encoded implicitly in deterministic workflow code. When a workflow needs to resume, Temporal replays the entire event history (history_node table) against the same code to reconstruct exact execution state.
This works brilliantly for Temporal's use case (long-running business processes with complex branching) but is explicitly wrong for our needs. We need queryable dependency structure ("show me all tasks blocked by task X"), human-readable state ("what's the mission doing right now?"), and dashboard visibility — all of which require explicit, denormalized columns rather than opaque blobs.
What we adopt: The execution chain concept (workflow_id persists across retries/continuations while run_id is unique per attempt) maps well to our mission model — a mission ID persists while individual task attempts get their own IDs. The four distinct timeout types (schedule-to-close, schedule-to-start, start-to-close, heartbeat) inform our timeout design for agent tasks.
What we reject: Serialized blob storage, deterministic replay, implicit dependency encoding. Our users need to see and query mission state directly.
Source: PostgreSQL schema at temporalio/temporal/schema/postgresql/v12/temporal/schema.sql; proto definitions in temporal/api/workflow/v1/message.proto and temporal/api/history/v1/message.proto.
Prefect
Prefect's key innovation is dual-write state tracking: current state is denormalized inline on the run/task row (state_type, state_name, state_timestamp) for O(1) query performance, while every state transition is also written as an immutable row in flow_run_state / task_run_state history tables. This gives you both fast current-state queries and a complete audit trail.
The dependency model stores task inputs as a JSON column (task_inputs) with typed references to upstream task runs, parameters, or constants — rather than a separate edges table. This is compact and self-contained per task but makes "find all downstream tasks of X" require scanning all tasks' task_inputs columns.
Prefect's 9-state model with sub-states (Cached, AwaitingRetry, Late, Suspended) is the richest among the systems studied. The distinction between FAILED (code error) and CRASHED (infrastructure failure) is particularly useful — it answers "should we retry?" differently based on failure type.
What we adopt: The dual-write pattern (denormalized current state + append-only event log) is the strongest architectural pattern across all systems studied. The CRASHED vs FAILED distinction maps directly to our agent execution — an agent hitting a timeout is different from an agent producing wrong output. The empirical_policy JSON for retry configuration per task is a clean pattern.
What we reject: JSON-encoded dependency graph in task_inputs — at our scale it works, but a join table is more queryable for "find blocked tasks." The dynamic_key uniqueness approach is over-engineered for our use case.
Source: ORM models at PrefectHQ/prefect/src/prefect/server/database/orm_models.py; state definitions at src/prefect/server/schemas/states.py.
Apache Airflow
Airflow's defining principle is the database is the single source of truth. The scheduler holds no authoritative state in memory — it reads and writes task_instance rows for every scheduling decision, using pessimistic locking (SELECT FOR UPDATE) to coordinate multiple schedulers in HA mode. Dependencies exist only in the serialized DAG definition, and the scheduler re-evaluates trigger rules against current DB state on every tick.
Airflow's trigger rule system is the most sophisticated dependency model studied. Beyond simple "all predecessors must succeed," it supports one_success (fire on first upstream success without waiting), none_failed (tolerates skips but not failures), all_done (fires regardless of upstream state), and 9 other rules. This enables complex conditional execution patterns.
The XCom mechanism for inter-task data is deliberately constrained — small values only, with pluggable backends for large payloads. This separation of metadata passing (XCom) from bulk data transfer (external storage) is a pattern worth adopting.
What we adopt: DB-authoritative scheduling — our coordinator should re-derive "what's ready to run" from DB state, not trust in-memory queues. The trigger rule concept (though we'll start with just all_success and all_done). The XCom pattern — task outputs stored separately from task metadata, with the task row pointing to the output location. The deliberate omission of ORM foreign keys between high-write tables to avoid lock contention.
What we reject: The logical_date / data interval concept is specific to batch processing, not mission orchestration. The pool / pool_slots resource scheduling is more complexity than we need initially.
Source: Models at apache/airflow/airflow-core/src/airflow/models/dagrun.py and taskinstance.py; trigger rules documented at airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html.
Dagster
Dagster's architecture is the purest event-sourcing model studied. The event_logs table is the sole system of record — there is no steps or ops table. Step state is entirely derived from the sequence of events for a given (run_id, step_key) pair. The runs table stores a run_body TEXT column containing the full serialized DagsterRun object, with denormalized columns (status, start_time, end_time) maintained as query-performance shortcuts.
IO managers decouple inter-op data passing from execution logic — outputs are serialized to external storage (S3, database, filesystem) and deserialized for downstream consumers. This means no data flows between ops in-process, making retry and re-execution safe by default.
The asset materialization model — tracking what data assets were produced, when, and by which run — is a novel concept that maps to our mission outputs. A completed research task produces a "research artifact" that downstream tasks consume.
What we adopt: The event log as an append-only audit trail (though not as the sole system of record — we'll use Prefect's dual-write pattern). The concept of separating task output storage from task metadata. The run_tags key-value table for flexible metadata without schema changes.
What we reject: Full event sourcing as the primary state model — it makes simple queries ("which tasks are currently running?") require event stream scanning. The serialized run_body blob is the same anti-pattern as Temporal's approach for our needs.
Source: Schema at dagster-io/dagster/python_modules/dagster/dagster/_core/storage/runs/schema.py and event_log/schema.py; DagsterRunStatus enum at dagster/_core/storage/dagster_run.py.
OpenAI Symphony
Symphony takes the most radical approach: no persistent orchestration database at all. The Linear issue tracker is the coordinator — Symphony polls it for eligible issues, claims them in memory, and dispatches coding agents. All durable state lives in Linear (issue status, comments, PR links) and Git (branches, commits, workpad files). The orchestrator is deliberately stateless and recovers from restart by re-polling.
The continuation vs retry distinction is Symphony's most valuable contribution. A clean agent exit (task still in progress) triggers immediate continuation — same workspace, same thread, no backoff. An abnormal exit (failure, timeout, stall) triggers exponential backoff retry with a fresh branch. This prevents thrashing on failures while keeping normal multi-turn work fast. The attempt counter is passed to the agent via the WORKFLOW.md template so the agent knows whether it's continuing or retrying.
The WORKFLOW.md policy-as-code pattern — runtime configuration (concurrency, timeouts, active states, hooks) in YAML front matter, agent prompt template in Markdown body — is elegant for teams that want version-controlled orchestration policy.
What we adopt: The continuation vs retry distinction — our coordinator should handle "agent needs more turns" differently from "agent failed." The concept of passing attempt context to agents so they can adapt behavior. Lifecycle hooks (before_run, after_run) with asymmetric failure semantics — pre-hooks abort, post-hooks are best-effort. Stall detection via elapsed time since last event.
What we reject: No persistent storage — we need queryable mission history, cost tracking, and dashboard visibility. Linear-as-coordinator — we have our own board and need the coordinator to be a first-class service. In-memory-only orchestration state.
Source: openai/symphony/SPEC.md for architecture; openai/symphony/elixir/WORKFLOW.md for policy-as-code reference implementation.
2.4 Architectural Decisions Informed by Prior Art
Based on this analysis, our mission schema adopts the following patterns:
State storage
Dual-write: denormalized current state on row + append-only event log
Prefect
Fast queries for dashboards AND complete audit trail for debugging
Dependency storage
Explicit join table (orchestration_task_dependencies)
Airflow (conceptual)
Queryable in both directions: "what blocks X?" and "what does X block?"
Scheduling authority
DB-authoritative — coordinator re-derives ready tasks from DB state each tick
Airflow
Crash-safe; no in-memory state to lose; supports future HA coordinator
Failure classification
Distinguish infrastructure failure from output quality failure
Prefect (CRASHED vs FAILED)
Different retry strategies: infra failure → retry same task; quality failure → retry with different model or escalate
Continuation vs retry
Separate continuation (more turns needed) from retry (something broke)
Symphony
Prevents backoff on normal multi-turn agent work while protecting against failure loops
Inter-task data
Separate output storage from task metadata; task row references output location
Dagster (IO managers), Airflow (XCom)
Keeps task table lean; outputs can be large (research reports, analysis docs)
Timeout model
Multiple timeout types per task (total deadline, per-attempt, stall detection)
Temporal (4 timeouts), Symphony (stall detection)
Different failure modes need different timeouts
Run identity
Mission ID persists across retries; task attempts get unique IDs
Temporal (execution chains)
Users track missions by stable ID; system tracks individual attempts for cost/debugging
Flexible metadata
Tags/labels as key-value pairs in a separate table or JSONB column
Dagster (run_tags), Airflow (conf)
Extensible without schema migration; supports filtering, grouping, search
Trigger rules
Start with all_success (default) and all_done; add more later
Airflow
Simple cases first; the framework supports richer rules when needed
2.5 What We Explicitly Avoid
Serialized blob storage (Temporal, Dagster
run_body) — our users need to query mission state from dashboards and APIs without deserializationFull event sourcing as primary state model (Dagster) — adds query complexity for common operations; we use events as audit trail, not source of truth
Implicit dependency encoding (Temporal) — we can't replay LLM calls deterministically; dependencies must be explicit and queryable
In-memory-only orchestration (Symphony) — we need persistent mission history for cost tracking, learning (PRD-106), and user review
Tracker-as-coordinator (Symphony) — our board is a visibility layer, not the control plane; the coordinator service owns execution logic
3. State Machine Design
3.1 Design Philosophy
The state machine must serve three audiences simultaneously:
The coordinator — needs to know what's ready to run, what's blocked, and what failed
The dashboard — needs human-readable status that maps to the existing board_tasks UI
The debugger — needs a complete transition history to answer "what happened?"
We adopt a two-level state model inspired by Prefect's architecture: a small, stable StateType enum drives orchestration logic, while a richer StateName provides user-facing detail. This lets us add display states (e.g., awaiting_payment) without touching coordinator code.
We also adopt the dual-write pattern validated in Section 2: every state transition updates the denormalized current-state column on the row (fast queries) AND appends an immutable event to orchestration_events (audit trail). Both writes occur in a single database transaction. This is the same pattern Prefect uses at significantly larger scale than our target (~100-500 concurrent runs).
Why not full event sourcing? We don't need deterministic replay (Temporal's use case). Our agents are non-deterministic LLMs — replaying orchestration code wouldn't reproduce the same results. Event sourcing adds projection maintenance, snapshot management, and eventual consistency complexity that isn't justified at our scale. The hybrid approach gives us O(1) current-state queries and a complete audit trail without the overhead.
Why not pure CRUD? Airflow's mutable-only approach makes debugging "why did this task get stuck?" require grepping application logs. We need structured transition history for mission observability, telemetry (PRD-106), and human review.
3.2 State Definitions
Run States (orchestration_runs)
PENDING
pending
No
Run created, plan not yet approved
System (on mission creation)
PENDING
planning
No
Coordinator is decomposing the goal into tasks
Coordinator
PENDING
awaiting_approval
No
Plan ready, waiting for human to approve
Coordinator (after planning)
RUNNING
running
No
Tasks are being executed
Human (approves plan) or System (autonomy mode)
PAUSED
paused
No
Human paused execution
Human
PAUSED
budget_exceeded
No
Hard budget cap hit, waiting for human decision
System (budget check)
TERMINAL
completed
Yes
All tasks passed verification, human accepted
Human (accepts) or System (auto-accept mode)
TERMINAL
failed
Yes
Unrecoverable failure (max retries exhausted, human rejected)
System or Human
TERMINAL
cancelled
Yes
Human cancelled the mission
Human
Task States (orchestration_tasks)
PENDING
pending
No
Task created, dependencies not yet met
Coordinator (during planning)
PENDING
queued
No
Dependencies met, waiting for agent slot
Dependency resolver
PENDING
awaiting_retry
No
Failed, scheduled for retry after backoff
System (retry logic)
RUNNING
assigned
No
Agent selected, execution starting
Coordinator
RUNNING
running
No
Agent actively working (LLM calls in progress)
Agent
RUNNING
continuing
No
Agent exited cleanly, needs more turns
Agent (clean exit)
PAUSED
verifying
No
Output submitted, verifier evaluating
Agent (submits output)
PAUSED
awaiting_human
No
Verifier or coordinator requested human review
Verifier or Coordinator
TERMINAL
completed
Yes
Passed verification (or human accepted)
Verifier or Human
TERMINAL
failed
Yes
Max retries exhausted or human rejected
System or Human
TERMINAL
cancelled
Yes
Parent run cancelled or human cancelled task
Run state change or Human
TERMINAL
skipped
Yes
Dependency failed with all_done trigger rule; task not needed
Dependency resolver
StateType Mapping
Orchestration code switches on StateType (4 values, stable). Display and logging use StateName (extensible).
3.3 Transition Diagrams
Run State Transitions
Task State Transitions
3.4 Transition Tables
Run Transitions
pending
planning
Mission created
Coordinator
Emit run_started event
planning
awaiting_approval
Plan decomposition complete
Coordinator
Create orchestration_tasks rows; emit plan_ready event
planning
running
Plan complete + autonomy mode
Coordinator
Create tasks + begin execution; emit plan_ready + run_started
awaiting_approval
running
Human approves plan
Human (API)
Begin task execution; emit human_approved event
awaiting_approval
failed
Human rejects plan
Human (API)
Emit human_rejected event
running
completed
All tasks terminal + all passed
Coordinator
Set completed_at; emit run_completed; offer "save as routine"
running
failed
Unrecoverable task failure or budget exceeded without override
Coordinator
Set completed_at; emit run_failed; cancel remaining tasks
running
paused
Human pauses
Human (API)
Pause all non-terminal tasks; emit run_paused
running
budget_exceeded
Cost exceeds hard cap
System (budget check)
Pause all non-terminal tasks; emit budget_exceeded
paused
running
Human resumes
Human (API)
Resume paused tasks; emit run_resumed
budget_exceeded
running
Human increases budget
Human (API)
Resume tasks; emit budget_increased
budget_exceeded
cancelled
Human cancels
Human (API)
Cancel all tasks; emit run_cancelled
Any non-terminal
cancelled
Human cancels
Human (API)
Cancel all non-terminal tasks; emit run_cancelled
Task Transitions
pending
queued
All dependencies in terminal success state
Dependency resolver
Emit task_queued
pending
skipped
Dependency failed + trigger rule = all_success
Dependency resolver
Emit task_skipped
queued
assigned
Agent selected by coordinator
Coordinator
Set agent_id; create board_task; emit task_assigned
assigned
running
Agent begins execution
Agent
Set started_at; update board_task → in_progress; emit task_started
running
continuing
Agent exits cleanly, needs more turns
Agent (clean exit)
Emit task_continuing; schedule continuation (1s delay, same attempt)
continuing
running
Continuation dispatched
System (timer)
Emit task_resumed
running
verifying
Agent submits output
Agent
Store output reference; update board_task → review; emit task_output_submitted
verifying
completed
Verifier passes output
Verifier agent
Set verifier_score; update board_task → done; emit verification_passed
verifying
awaiting_human
Verifier score below threshold or verifier uncertain
Verifier agent
Emit human_review_requested
verifying
awaiting_retry
Verifier fails output + retries remaining
Verifier agent
Set verifier_score; emit verification_failed; schedule retry with backoff
verifying
failed
Verifier fails output + no retries remaining
Verifier agent
Set verifier_score; update board_task → done (with error); emit task_failed
awaiting_human
completed
Human approves
Human (API)
Update board_task → done; emit human_approved
awaiting_human
awaiting_retry
Human rejects + retries remaining
Human (API)
Emit human_rejected; schedule retry
awaiting_human
failed
Human rejects + no retries
Human (API)
Update board_task → done (with error); emit human_rejected + task_failed
awaiting_retry
assigned
Backoff timer expires
System (timer)
Increment attempt_count; emit task_retrying
running
failed
Infrastructure failure (timeout, crash, OOM)
System (reconciler)
Update board_task → done (with error); emit task_crashed
running
awaiting_retry
Infrastructure failure + retries remaining
System (reconciler)
Emit task_crashed; schedule retry with backoff
Any non-terminal
cancelled
Parent run cancelled
Run state change
Update board_task → done (with error); emit task_cancelled
3.5 Continuation vs Retry (from Symphony)
The distinction between continuation and retry is critical for AI agent tasks. An agent researching a topic may need 5 LLM turns — each "exit" between turns is a continuation, not a failure.
Trigger
Agent exits cleanly, work incomplete
Infrastructure failure, verification failure, or timeout
Attempt counter
Unchanged (same attempt)
Incremented
Delay
1 second (fixed)
Exponential backoff: min(10s × 2^(attempt-1), 5min)
Agent context
Same agent, workspace preserved, prior output available
Same or different agent, fresh prompt with attempt number
State sequence
running → continuing → running
running → awaiting_retry → assigned → running
Board task status
Stays in_progress
Briefly shows retry status, then back to in_progress
Budget impact
Counts toward task budget
Counts toward task budget (coordinator may switch to cheaper model on retry)
Max turns
Configurable per task (default: 10, matching existing AgentFactory tool loop)
Configurable per task (default: 3)
Backoff progression for retries:
1
10s
10s
2
20s
30s
3
40s
70s
4
80s
150s (2.5min)
5
160s
310s (5.2min)
6+
300s (cap)
+5min each
3.6 Failure Classification
Following Prefect's CRASHED vs FAILED distinction, adapted for AI agent execution:
Infrastructure failure (CRASHED equivalent)
Agent timeout, OOM, network error, provider outage
Yes (auto)
Same task, exponential backoff
OpenRouter returns 503; agent process killed
Quality failure (FAILED equivalent)
Verifier rejects output, wrong format, incomplete work
Yes (auto)
Same or different model, with failure context in prompt
Research report missing 2 of 5 required sections
Human rejection
Human reviews and rejects
Conditional
Only if human chooses "retry" vs "fail"
Human says "this analysis is wrong, try again"
Budget exhaustion
Task cost exceeds per-task or per-run budget
No (requires human)
Human must increase budget
Task used $5 of $3 budget
Dependency failure
Upstream task failed with all_success trigger
No
Task skipped
Research task failed → analysis task can't proceed
Cancellation
Human or system cancels
No
N/A
User abandons mission
Key design decision: Infrastructure failures bypass verification (no point judging output from a crashed agent). Quality failures always go through verification. This matches Prefect's pattern where CRASHED bypasses orchestration rules via force=True.
3.7 Stall Detection
Adapted from Symphony's reconciliation loop and the existing task_reconciler.py:
Task in running with no heartbeat/event
TASK_STALL_TIMEOUT (default: 5 min)
Transition to awaiting_retry (if retries remain) or failed
Task in assigned with no start event
TASK_ASSIGN_TIMEOUT (default: 2 min)
Re-queue: transition back to queued for reassignment
Task in verifying with no verdict
VERIFY_TIMEOUT (default: 3 min)
Escalate to awaiting_human
Run in running with all tasks terminal but not resolved
RUN_RESOLVE_TIMEOUT (default: 1 min)
Coordinator re-evaluates run completion
Implementation: Extend the existing task_reconciler.py pattern. The reconciler runs on a tick (via APScheduler, matching the existing heartbeat infrastructure) and queries for stalled entities using the denormalized state column + updated_at timestamp. This is the DB-authoritative scheduling pattern from Airflow — the reconciler re-derives "what needs attention" from DB state each tick, with no in-memory state to lose on crash.
3.8 Board Task Status Mapping
The existing board_tasks table has 5 statuses: inbox, assigned, in_progress, review, done. Every orchestration task creates a corresponding board_task for UI visibility. The mapping:
pending
(no board_task yet)
Board task created on assignment
queued
(no board_task yet)
Board task created on assignment
awaiting_retry
assigned
Waiting to be re-dispatched
assigned
assigned
Agent selected
running
in_progress
Sets started_at
continuing
in_progress
Stays in progress during multi-turn
verifying
review
Output under evaluation
awaiting_human
review
Human decision needed
completed
done
Sets completed_at
failed
done
Sets completed_at + error_message
cancelled
done
Sets completed_at + error_message
skipped
done
Sets completed_at + result = "skipped: dependency failed"
Integration mechanism: Board tasks are linked via source_type='orchestration' and source_id=<orchestration_run_id> (existing fields on board_tasks). The orchestration_tasks table holds a board_task_id FK for direct reference. State synchronization is performed as a side effect of the transition_task() function — every orchestration state change updates the corresponding board_task status in the same transaction.
3.9 Concurrency Safety
State transitions must be safe under concurrent access. Two scenarios matter:
Coordinator and agent racing on the same task — coordinator tries to cancel while agent submits output
Reconciler and agent racing — reconciler detects stall while agent is about to report completion
Approach: Optimistic locking with version_id_col
Every UPDATE includes WHERE version_id = <loaded_value> and increments the version. If another transaction changed the row, SQLAlchemy raises StaleDataError. The transition function catches this and returns a conflict result rather than silently corrupting state.
For claim-style operations (assigning an agent to a queued task), use SELECT FOR UPDATE SKIP LOCKED to prevent two coordinators from claiming the same task:
3.10 Transition Enforcement
No external library needed. The transition rules are a ~30-line dict. Python state machine libraries (pytransitions, python-statemachine) don't integrate with SQLAlchemy and would add a dependency for ~10 states. We enforce transitions in application code via the transition_task() / transition_run() functions. All state changes must go through these functions — never set .state directly.
3.11 Key Design Decisions Summary
State tracking
Hybrid dual-write (CRUD + event log)
Pure event sourcing (Temporal/Dagster), pure CRUD (Airflow)
O(1) queries + audit trail, no projection maintenance overhead. Validated by Prefect at larger scale.
State model
Two-level (StateType + StateName)
Flat enum, hierarchical states
Stable orchestration code (4 StateTypes) + extensible display (add states without touching coordinator). Inspired by Prefect's StateType/state_name pattern.
Continuation vs retry
Distinct paths with different semantics
Single retry mechanism for both
AI agents frequently need multiple turns (continuation). Conflating this with failure retry causes unnecessary backoff and attempt inflation. Adopted from Symphony.
Failure classification
Infrastructure vs quality, separate handling
Single "failed" state
Infrastructure failure → auto-retry same config. Quality failure → retry with different model or escalate to human. Adapted from Prefect's CRASHED vs FAILED.
Concurrency control
Optimistic locking (version_id_col) + SELECT FOR UPDATE for claims
Pessimistic locking everywhere, eventual consistency
Low contention (state changes are seconds apart). Optimistic = no lock held during slow operations. Pessimistic only for claim-style dequeuing.
Transition enforcement
Application-level dict + function
DB triggers, state machine library
~100 lines, no dependency, testable, integrated with dual-write and board_task sync. Libraries don't integrate with SQLAlchemy.
Board task mapping
Orchestration owns lifecycle, syncs to board_task as side effect
Board task as source of truth, separate UI table
Existing UI gets mission visibility for free. No new frontend work needed for basic mission tracking.
Stall detection
DB-authoritative reconciler on tick (extending existing task_reconciler pattern)
In-memory timeouts, heartbeat-only
Crash-safe — reconciler re-derives state from DB each tick. Matches existing infrastructure (APScheduler + task_reconciler.py).
4. Data Model: orchestration_runs
The orchestration_runs table is the top-level record for every mission. It stores the user's original goal, the coordinator's decomposition plan, execution configuration, and aggregate tracking metrics. One row = one mission attempt.
4.1 Design Principles
Denormalized current state —
statecolumn for O(1) dashboard queries (dual-write pattern from Section 2.4)Immutable goal, mutable plan — the user's original
goalnever changes; theplanJSONB evolves during planningJSONB for extensible config — autonomy level, budget caps, model preferences stored as structured JSON, not as N columns that require migrations for every new setting
Workspace isolation — every query must filter by
workspace_id(FK →workspaces.id)Match existing patterns — UUID primary key,
server_default=func.now()timestamps,ondelete='CASCADE'for workspace FK (consistent withboard_tasks,agent_reports)
4.2 Column Definitions
id
UUID
No
gen_random_uuid()
PK
Stable mission identifier; persists across retries
workspace_id
UUID
No
—
FK → workspaces.id ON DELETE CASCADE
Multi-tenant isolation
title
VARCHAR(500)
No
—
—
Human-readable mission title (coordinator-generated or user-provided)
description
TEXT
Yes
NULL
—
Optional extended description
goal
TEXT
No
—
—
Original user input, verbatim. Never modified after creation.
state
VARCHAR(30)
No
'pending'
—
Current RunState value (see Section 3.2). Denormalized for fast queries.
state_type
VARCHAR(10)
No
'pending'
—
Current StateType value. Stable enum for coordinator logic.
plan
JSONB
Yes
NULL
—
Coordinator's decomposition — task list with descriptions, dependencies, agent assignments. Populated during planning state. Schema in Section 4.3.
config
JSONB
No
'{}'
—
Mission configuration — autonomy level, budget, model preferences, timeout overrides. Schema in Section 4.4.
result_summary
TEXT
Yes
NULL
—
Coordinator-generated summary of mission outcome (for completed missions)
error_message
TEXT
Yes
NULL
—
Failure reason (for failed/cancelled missions)
created_by
VARCHAR(255)
No
—
—
User ID (Clerk) or 'system' for auto-triggered missions. String type matches board_tasks.created_by_id pattern.
coordinator_agent_id
INTEGER
Yes
NULL
FK → agents.id ON DELETE SET NULL
Roster agent acting as coordinator, or NULL if using system coordinator
total_tokens
INTEGER
No
0
CHECK (total_tokens >= 0)
Aggregate token usage across all tasks
total_cost
NUMERIC(10,6)
No
0
CHECK (total_cost >= 0)
Aggregate cost in USD across all tasks
task_count
INTEGER
No
0
CHECK (task_count >= 0)
Total tasks in this mission (denormalized for dashboard)
tasks_completed
INTEGER
No
0
CHECK (tasks_completed >= 0)
Tasks in terminal success state (denormalized)
tasks_failed
INTEGER
No
0
CHECK (tasks_failed >= 0)
Tasks in terminal failure state (denormalized)
started_at
TIMESTAMPTZ
Yes
NULL
—
When first task began execution (state → running)
completed_at
TIMESTAMPTZ
Yes
NULL
—
When mission reached terminal state
duration_ms
INTEGER
Yes
NULL
—
completed_at - started_at in milliseconds. Computed on completion.
version_id
INTEGER
No
1
—
Optimistic locking counter (SQLAlchemy version_id_col)
created_at
TIMESTAMPTZ
No
NOW()
—
Row creation timestamp
updated_at
TIMESTAMPTZ
No
NOW()
—
Last modification timestamp (auto-updated)
Why NUMERIC(10,6) for cost? LLM API calls cost fractions of a cent. FLOAT introduces rounding errors on aggregation (SUM of 1000 tasks at $0.003 each). NUMERIC is exact. 10 digits with 6 decimal places supports up to $9,999.999999 per mission — more than sufficient.
Why denormalized task counts? Dashboard queries like "show all running missions with progress" would otherwise require JOIN + GROUP BY on potentially large task tables. The coordinator updates these counters atomically when task states change (same transaction as the dual-write event).
4.3 Plan JSONB Schema
The plan column stores the coordinator's task decomposition. It's populated during the planning state and serves as the blueprint for task creation.
Design notes:
temp_idis a coordinator-assigned identifier used during planning. Realorchestration_tasks.idUUIDs replace these after approval.depends_onreferencestemp_idvalues (resolved to real task IDs on task creation).suggested_agentandsuggested_modelare hints — the coordinator may override based on availability or budget.strategyis informational:"sequential","parallel", or"mixed". The actual execution order is determined by dependency resolution.The plan is immutable after approval. Re-planning creates a new version (increment
version), logged as an event.
4.4 Config JSONB Schema
The config column stores mission-level settings. Modeled after workflow_recipes.execution_config — same pattern of structured JSON for runtime configuration.
Autonomy levels:
"approve" (default)
Coordinator shows plan → human approves → execution begins
"autonomous"
Plan auto-approved if estimated cost ≤ auto_approve_threshold. Otherwise, falls back to approve.
"full_auto"
No human gates. System runs to completion or budget exhaustion. Requires explicit opt-in.
Why JSONB instead of columns? Config evolves faster than schema. Adding "notification preferences" or "priority scheduling" shouldn't require an Alembic migration. The trade-off is weaker type enforcement at the DB level — mitigated by Pydantic validation on the API layer (same pattern used by workflow_recipes.execution_config and agents.configuration).
4.5 Indexes
Partial indexes (WHERE state_type != 'terminal') keep the index small — most runs will be terminal over time. Active runs (the ones queried by dashboards and reconcilers) stay in a compact index.
4.6 Example INSERT
4.7 Alembic Migration
4.8 Design Decisions
Primary key type
UUID
Integer (SERIAL)
Missions may be created from multiple sources (API, chatbot, scheduler). UUID avoids coordination for ID generation. Matches workspaces.id and agent_reports.id patterns.
Goal storage
Immutable TEXT column
Part of config JSONB
The goal is the user's contract with the system. It should never be buried in JSON or accidentally modified. Separate column enables full-text search.
Plan storage
JSONB column on runs table
Separate orchestration_plans table
Plans are 1:1 with runs and always loaded together. A separate table adds a JOIN for every plan read with no queryability benefit (we never query across plans).
Cost tracking
NUMERIC(10,6)
FLOAT, INTEGER (cents)
FLOAT accumulates rounding errors. Integer cents loses sub-cent precision (common in LLM billing). NUMERIC is exact and supports both.
Denormalized counters
task_count, tasks_completed, tasks_failed
Computed via COUNT(*) query
Dashboard shows "3/7 tasks complete" — this query runs on every page load. Denormalized counters avoid a JOIN + GROUP BY on every render. Updated atomically with task state transitions.
Config extensibility
JSONB with Pydantic validation
Typed columns
Config changes (new autonomy levels, new notification types) shouldn't require migrations. Same pattern as agents.configuration and workflow_recipes.execution_config.
created_by type
VARCHAR(255)
INTEGER FK → users.id
Matches board_tasks.created_by_id pattern. Clerk user IDs are strings (user_2abc...). Supports 'system' for auto-triggered missions without a nullable FK.
coordinator_agent_id
Nullable FK → agents.id
Required FK, separate coordinator_type column
Most missions use the system coordinator (no specific agent). When a roster agent coordinates, reference it. SET NULL on agent deletion — the run record survives.
5. Data Model: orchestration_tasks
The orchestration_tasks table records every subtask within a mission. Each row tracks assignment, execution, verification, and result for a single unit of work. Tasks reference their parent run, their assigned agent, and their corresponding board task (for UI visibility). Dependencies between tasks are stored in a separate join table (orchestration_task_dependencies) — not as an array column — for referential integrity and clean scheduling queries.
5.1 Design Principles
Join table for dependencies — PostgreSQL's own documentation warns that "searching for specific array elements can be a sign of database misdesign" and recommends a separate table. A
task_dependenciesjoin table gives us FK enforcement, B-tree indexes for both directions (upstream/downstream), and trivial addition of edge metadata (dependency_type). At our scale (5-50 tasks per mission), the extra JOIN is negligible — and the "find ready tasks" query is cleaner thanunnest()orjsonb_array_elements().Board task bridge — every
orchestration_taskcreates aboard_taskwithsource_type='orchestration'andsource_idset to the run ID. This gives us free dashboard visibility without new UI components. Theboard_task_idFK onorchestration_taskslinks back for updates.Two-level state —
state(rich display) andstate_type(stable orchestration logic) from Section 3.2, same pattern asorchestration_runs.Continuation vs retry —
attempt_numbertracks retry attempts (backoff, fresh start).continuation_counttracks clean continuation turns (same attempt, 1s delay). Both capped byconfig.retryon the parent run.Output stored externally — large task outputs go to
output_ref(workspace file path or report ID), not inline. Onlyoutput_summary(≤2000 chars) is stored on the row for dashboard display. This follows the pattern all 5 studied systems use — none store large outputs on the task row.Match existing conventions — UUID primary key (matches
orchestration_runs),NUMERIC(10,6)for cost,TIMESTAMPTZtimestamps withserver_default, optimistic locking viaversion_id.
5.2 Column Definitions
id
UUID
No
gen_random_uuid()
PK
Stable task identifier
run_id
UUID
No
—
FK → orchestration_runs.id ON DELETE CASCADE
Parent mission
workspace_id
UUID
No
—
FK → workspaces.id ON DELETE CASCADE
Denormalized for query efficiency (avoids JOIN to runs for workspace filtering)
sequence_number
SMALLINT
No
—
—
Position in plan order (1-based). Stable after planning — used for display, not execution order.
title
VARCHAR(500)
No
—
—
Human-readable task title (from coordinator plan)
description
TEXT
Yes
NULL
—
Detailed task description / instructions for the agent
task_type
VARCHAR(30)
No
—
—
TaskType enum: research, analysis, writing, coding, verification, review, synthesis, other
state
VARCHAR(30)
No
'pending'
—
Current TaskState value (Section 3.2)
state_type
VARCHAR(10)
No
'pending'
—
Current StateType value. Stable enum for coordinator logic.
trigger_rule
VARCHAR(30)
No
'all_success'
—
When this task becomes ready. Values: all_success (default), all_done, none_failed, always. Inspired by Airflow's trigger rules — we adopt the 4 most relevant for agent orchestration.
agent_id
INTEGER
Yes
NULL
FK → agents.id ON DELETE SET NULL
Assigned roster agent. NULL if contractor or unassigned.
agent_type
VARCHAR(20)
Yes
NULL
—
roster (permanent agent) or contractor (ephemeral, mission-scoped). NULL when unassigned.
model_override
VARCHAR(255)
Yes
NULL
—
LLM model override for this task. NULL = use run-level config.model_preferences or agent default.
tools_requested
JSONB
Yes
NULL
—
Array of tool names the coordinator wants available for this task. Hint, not enforcement — agent's assigned tools take precedence.
success_criteria
TEXT
Yes
NULL
—
Plain-text description of what constitutes success. Used by the verifier (PRD-103).
output_summary
VARCHAR(2000)
Yes
NULL
—
Truncated output for dashboard display. Written by agent or coordinator on completion.
output_ref
VARCHAR(500)
Yes
NULL
—
Reference to full output: workspace file path (/reports/{agent}/{slug}.md) or agent_reports.id.
verifier_score
NUMERIC(3,2)
Yes
NULL
CHECK (verifier_score >= 0 AND verifier_score <= 1)
Verification quality score (0.00–1.00). Written by verifier agent (PRD-103).
verified_by
VARCHAR(255)
Yes
NULL
—
Who verified: agent ID, 'human', or 'auto'.
error_message
TEXT
Yes
NULL
—
Failure reason (for failed/cancelled tasks)
attempt_number
SMALLINT
No
1
CHECK (attempt_number >= 1)
Current retry attempt (incremented on retry, not on continuation)
continuation_count
SMALLINT
No
0
CHECK (continuation_count >= 0)
Number of continuation turns within current attempt
tokens_used
INTEGER
No
0
CHECK (tokens_used >= 0)
Total tokens consumed across all attempts
cost
NUMERIC(10,6)
No
0
CHECK (cost >= 0)
Total cost in USD across all attempts
board_task_id
INTEGER
Yes
NULL
FK → board_tasks.id ON DELETE SET NULL
Corresponding board task for UI visibility. Created when task is planned.
started_at
TIMESTAMPTZ
Yes
NULL
—
When agent began execution (state → running)
completed_at
TIMESTAMPTZ
Yes
NULL
—
When task reached terminal state
duration_ms
INTEGER
Yes
NULL
—
completed_at - started_at in milliseconds
version_id
INTEGER
No
1
—
Optimistic locking counter (SQLAlchemy version_id_col)
created_at
TIMESTAMPTZ
No
NOW()
—
Row creation timestamp
updated_at
TIMESTAMPTZ
No
NOW()
—
Last modification timestamp
Why denormalize workspace_id? The dashboard query "show all tasks for my workspace" would otherwise require a JOIN to orchestration_runs. Since workspace_id never changes for a task, denormalizing avoids the JOIN on every task list render.
Why SMALLINT for attempt/continuation? A task that retries 255+ times or continues 65535+ turns has a bug, not a workload. SMALLINT (2 bytes, max 32767) is more than sufficient and saves 2 bytes per row vs INTEGER.
Why NUMERIC(3,2) for verifier_score? Scores are 0.00 to 1.00. NUMERIC(3,2) stores exactly two decimal places with no floating-point rounding. FLOAT would work but invites 0.6999... display issues.
5.3 Trigger Rules
Inspired by Airflow's trigger rule system (13 rules), we adopt the 4 most relevant for LLM agent orchestration. Each rule defines when a task's dependencies are considered "met" and the task can transition from pending to queued.
all_success (default)
All upstream tasks must be in completed state
Standard pipeline: next agent runs only after all prerequisites succeed
all_done
All upstream tasks must be in any terminal state (completed, failed, cancelled, skipped)
Join/aggregation nodes that collect results regardless of individual success
none_failed
All upstream tasks must be terminal AND none may be failed (skipped/cancelled are OK)
Parallel fan-out where some branches are optional but hard failures should block
always
Skip dependency evaluation entirely — task is immediately queued when created
Cleanup, notification, or cost-tracking tasks that must always run
Why only 4 rules? Airflow's ONE_SUCCESS, ONE_FAILED, ALL_FAILED, etc. are designed for complex ETL branching with thousands of tasks. Our missions have 5-50 tasks with human oversight. Four rules cover every pattern we need:
Sequential pipeline →
all_successParallel research with synthesis →
all_successon the synthesis taskError-tolerant aggregation →
all_doneOptional branches →
none_failedGuaranteed cleanup →
always
If we discover a need for one_success (race pattern) or others, adding them requires only a new enum value and a case in the trigger rule evaluator — no schema change.
5.4 Task Dependencies (Join Table)
Dependency types:
data (default)
Downstream task consumes upstream task's output. The coordinator passes the output reference to the downstream agent's context.
ordering
Downstream task must wait for upstream to complete, but does not consume its output. Used for side-effect ordering (e.g., "write to DB before reading from DB").
Why a join table instead of an array column?
Criterion
Join Table (chosen)
UUID[] Array
JSONB Array
FK enforcement
✅ DB-enforced
❌ None
❌ None
Self-referencing cycle prevention
✅ CHECK (task_id != depends_on_id)
❌ App-only
❌ App-only
Edge metadata
✅ Add columns
❌ Requires schema change
⚠️ Add JSON keys
"Find ready tasks" query
✅ Standard NOT EXISTS + JOIN
⚠️ unnest() + JOIN
⚠️ jsonb_array_elements() + JOIN
Index type
B-tree (cheap)
GIN (overkill at scale)
GIN (overkill at scale)
"What blocks task X?"
✅ Single index lookup
❌ Full-table scan
❌ Full-table scan
"What does task X block?"
✅ Single index lookup
⚠️ ANY() scan
⚠️ jsonb_path_query()
PostgreSQL recommendation
✅ Preferred
❌ "Sign of misdesign"
❌ Not for relational edges
Cycle detection happens at planning time in Python (via graphlib.TopologicalSorter.prepare()) before rows are inserted. The CHECK (task_id != depends_on_id) constraint catches self-references at the DB level; multi-node cycles are caught by the topological sort.
5.5 Dependency Resolution Algorithm
We use Python's graphlib.TopologicalSorter (stdlib since 3.9), which implements Kahn's algorithm internally with incremental update support.
At planning time — validate the DAG:
At runtime — find ready tasks and react to completions:
Crash safety: The resolver is reconstructed from DB state on every coordinator tick (same pattern as Airflow's DB-authoritative scheduling). No in-memory state survives across restarts. The coordinator queries:
Then builds the resolver with completed tasks excluded. This is O(N) where N = number of tasks in the mission (5-50). Rebuilding from scratch on every tick is trivially fast at this scale.
Edge cases:
Empty mission (0 tasks)
get_ready() returns empty tuple. is_complete() returns True immediately.
Single task, no deps
Task is immediately ready.
Fully parallel (no deps)
All tasks returned by first get_ready().
Diamond (A→B, A→C, B→D, C→D)
A first, then B+C in parallel, then D after both complete.
Self-reference
Caught by DB constraint CHECK (task_id != depends_on_id).
Multi-node cycle
Caught by TopologicalSorter.prepare() at planning time.
5.6 Trigger Rule Evaluation
The "find ready tasks" query combines dependency resolution with trigger rule evaluation. For all_success (the default and most common), a task is ready when all its upstream dependencies have state = 'completed'. Other rules evaluate different terminal state combinations.
In practice, the coordinator uses the Python DependencyResolver (Section 5.5) rather than this SQL for the common all_success case. The SQL version is provided for:
Reconciler/stall detection (runs on APScheduler, independent of coordinator)
Debugging ("why isn't this task running?")
Dashboard queries ("show me blocked tasks")
Cascade states: When a task fails and downstream tasks have trigger_rule = 'all_success', the coordinator cascades them to skipped state. This is done in Python (loop over downstream tasks, check trigger rule, set state + emit event) rather than as a DB trigger — keeping side effects explicit and debuggable.
5.7 Indexes
Partial indexes on active states keep indexes compact. Terminal tasks accumulate over time but are rarely queried for scheduling.
5.8 Example: Creating Tasks from a Plan
After the coordinator generates a plan and the human approves, tasks are created in a single transaction:
5.9 Board Task Mapping
Every orchestration_task creates a corresponding board_task for UI visibility. The mapping:
title
title
Direct copy
description
description
Direct copy
workspace_id
workspace_id
Same FK
agent_id
assigned_agent_id
Roster agent; NULL for contractors
—
source_type
Always 'orchestration'
run_id (as string)
source_id
Links board task back to mission
—
created_by_type
'orchestration'
—
created_by_id
'system'
state → mapped
status
See mapping below
State → Board Status mapping:
pending, queued, awaiting_retry
inbox
assigned
assigned
running, continuing
in_progress
verifying, awaiting_human
review
completed
done
failed, cancelled, skipped
done (with error_message set)
The coordinator updates the board task status atomically with the orchestration task state change (same transaction as the dual-write event pattern from Section 3.1).
5.10 Alembic Migration
5.11 Design Decisions
Dependency storage
Join table (orchestration_task_dependencies)
UUID[] array column, JSONB array
PostgreSQL docs recommend against arrays for relationship storage. Join table gives FK enforcement, B-tree indexes, edge metadata, and cleaner queries. At 5-50 tasks per mission, the extra JOIN is free.
Dependency direction
task_id depends on depends_on_id
Reverse (downstream_id, upstream_id)
"This task depends on that task" reads naturally. Matches graphlib.TopologicalSorter.add(node, *predecessors) convention.
Trigger rules
4 rules (all_success, all_done, none_failed, always)
Full Airflow set (13 rules)
4 rules cover all agent orchestration patterns. Adding more is a code change, not a schema change. YAGNI.
Output storage
output_summary (2000 chars) + output_ref (path)
Full output as TEXT column
All 5 studied systems avoid inline output storage. Large agent outputs (reports, code) go to workspace files or reports table. Summary is for dashboard display only.
Board task integration
FK board_task_id on orchestration_tasks
Reverse FK on board_tasks
orchestration_tasks owns the relationship. Board tasks are created first, then linked. SET NULL on board task deletion preserves orchestration history.
Workspace_id denormalization
Denormalized on tasks
JOIN to runs
Avoids JOIN on every workspace-filtered task query. workspace_id is immutable — no consistency risk.
Cycle detection
Python graphlib.TopologicalSorter at planning time
DB trigger, recursive CTE
Cycles are a planning error, not a runtime condition. Detecting at planning time with a clear error message ("circular dependency: A → B → C → A") is better UX than a DB constraint error.
Dependency resolution at runtime
Python DependencyResolver rebuilt from DB state
Pure SQL query
Python resolver uses graphlib incremental updates (O(out-degree) per completion). SQL query is provided for reconciler/debugging. Both derive from the same DB state — crash-safe.
attempt_number vs continuation_count
Separate columns
Single attempts counter
Continuation (clean exit, 1s delay, same workspace) and retry (failure, backoff, fresh start) are fundamentally different operations (Symphony research). Conflating them makes it impossible to distinguish "agent needed 5 turns" from "agent failed 5 times."
Verifier score type
NUMERIC(3,2)
FLOAT, INTEGER (0-100)
Consistent with NUMERIC(10,6) for cost. Exact decimal arithmetic. 0.00–1.00 is a standard scoring range that avoids the "is 7/10 good or bad?" ambiguity of integer scales.
6. Event Log & Audit Trail
The orchestration_events table is the append-only audit trail for every state change, decision, and notable occurrence in a mission's lifecycle. It is the second half of the dual-write pattern established in Section 2.4: every state transition writes the new state to the entity row (fast queries) AND appends an immutable event (complete history). Both writes occur in a single database transaction.
6.1 Design Philosophy
Lightweight event sourcing, not full event sourcing. We log events for observability, debugging, and telemetry — not for state reconstruction. The orchestration_runs and orchestration_tasks tables hold the authoritative current state. Events answer "what happened and when?" without being the source of truth for "what is the current state?"
This is the same trade-off Prefect makes with its flow_run_state / task_run_state history tables alongside denormalized state on the run row. It avoids the projection maintenance and snapshot management overhead of full event sourcing (Dagster/Temporal) while giving us a complete audit trail that pure CRUD (Airflow's mutable task_instance) cannot provide.
Why not just application logs? Structured events in PostgreSQL are queryable, joinable, and indexable. "Show me every failure in the last 24 hours with the failing agent and retry count" is a SQL query, not a log grep. Events also serve as the raw data feed for PRD-106 (Outcome Telemetry & Learning Foundation) — every event is a data point for pattern analysis.
Existing patterns in Automatos: The codebase already has three audit log tables that inform our design:
skill_audit_log (PRD-22)
action + action_details JSON + status + execution_time_ms
Structured action with flexible JSON payload; timing metadata
permission_audit_logs (PRD-17)
action + details JSON + user_id + timestamp
Actor identification; indexed action column
heartbeat_results
findings[] + actions_taken[] JSONB arrays + source_type/source_id
Append-only event accumulation; source attribution
Our orchestration_events table combines the best of these: typed events (from audit logs), flexible JSONB payload (from all three), actor tracking (from permission logs), and append-only semantics (from heartbeat_results).
6.2 Event Type Taxonomy
Event types follow the {entity}_{lifecycle} naming convention from Temporal (e.g., ActivityTaskScheduled, ActivityTaskCompleted), adapted to our domain. Types are grouped by entity for readability but stored as a flat enum.
Run Events
run_created
Mission submitted by user or system
goal, config_summary, autonomy_level
run_planning_started
Coordinator begins decomposition
coordinator_model
run_plan_ready
Coordinator produces task plan
task_count, estimated_cost, strategy
run_approved
Human approves plan (or auto-approved)
approved_by, modifications (if human edited plan)
run_rejected
Human rejects plan
rejected_by, reason
run_started
First task begins execution
—
run_paused
Human pauses mission
paused_by, reason
run_resumed
Human resumes mission
resumed_by
run_budget_warning
Cost exceeds soft limit
current_cost, soft_limit, percent_used
run_budget_exceeded
Cost exceeds hard limit
current_cost, hard_limit
run_budget_increased
Human increases budget cap
old_limit, new_limit, increased_by
run_completed
All tasks terminal + success
total_cost, total_tokens, duration_ms, tasks_completed, tasks_failed
run_failed
Unrecoverable failure
reason, failing_task_id, total_cost
run_cancelled
Human cancels mission
cancelled_by, tasks_remaining
Task Events
task_created
Coordinator creates task from plan
task_type, trigger_rule, depends_on (task IDs)
task_queued
Dependencies met, ready for assignment
unblocked_by (task ID that completed last)
task_assigned
Coordinator assigns agent
agent_id, agent_type, model, board_task_id
task_started
Agent begins execution
attempt_number
task_continuing
Agent exits cleanly, needs more turns
continuation_count, tokens_this_turn
task_resumed
Continuation dispatched
continuation_count
task_output_submitted
Agent submits result
output_ref, output_summary_length, tokens_used
task_verification_started
Verifier begins evaluation
verifier_agent_id, verifier_model
task_verification_passed
Verifier approves output
score, verifier_feedback
task_verification_failed
Verifier rejects output
score, verifier_feedback, retries_remaining
task_human_review_requested
Escalated to human
reason, score
task_human_approved
Human approves output
approved_by
task_human_rejected
Human rejects output
rejected_by, reason, retries_remaining
task_retrying
Retry scheduled after failure
attempt_number, backoff_seconds, failure_type (infrastructure or quality)
task_crashed
Infrastructure failure detected
error_type, error_message, duration_ms
task_failed
Max retries exhausted or unrecoverable
reason, total_attempts, total_cost
task_skipped
Dependency failed + trigger rule prevents execution
skipped_because, failed_dependency_id
task_cancelled
Parent run cancelled
cancelled_by
System Events
stall_detected
Reconciler finds stalled task/run
entity_type, entity_id, stalled_state, stalled_since, action_taken
model_fallback
Primary model unavailable, falling back
task_id, requested_model, fallback_model, reason
cost_snapshot
Periodic cost aggregation
run_id, total_cost, total_tokens, by_task (breakdown)
Why 30+ event types instead of a generic "state_changed"? Typed events enable:
Targeted queries —
WHERE event_type = 'task_crashed'is faster thanWHERE payload->>'type' = 'crash'Payload validation — each event type has a known payload schema (enforceable via Pydantic on write)
Downstream processing — PRD-106 telemetry can subscribe to specific event types without parsing payloads
Dashboard widgets — "recent failures" widget queries
task_failed+task_crasheddirectly
Why not Temporal's 59 event types? Temporal models internal execution machinery (workflow task scheduling, timer management, deterministic replay checkpoints). We don't have an execution replay engine — our events track business-level lifecycle changes visible to users and the coordinator.
6.3 Column Definitions
id
BIGSERIAL
No
auto-increment
PK
Sequential event ID. BIGINT not UUID — events are high-volume append-only where sequential IDs are cheaper and provide natural ordering.
run_id
UUID
No
—
FK → orchestration_runs.id ON DELETE CASCADE
Which mission this event belongs to
task_id
UUID
Yes
NULL
FK → orchestration_tasks.id ON DELETE CASCADE
Which task (NULL for run-level events like run_created, run_completed)
event_type
VARCHAR(50)
No
—
—
Event type from taxonomy (Section 6.2). Indexed for filtering.
payload
JSONB
No
'{}'
—
Event-specific data. Schema varies by event_type (see payload fields in Section 6.2).
actor_type
VARCHAR(20)
No
'system'
—
Who triggered the event: system, coordinator, agent, verifier, human, reconciler
actor_id
VARCHAR(255)
Yes
NULL
—
ID of the actor: agent ID, user ID (Clerk), or NULL for system-triggered events
created_at
TIMESTAMPTZ
No
NOW()
—
When the event occurred. Indexed for time-range queries.
Why BIGSERIAL instead of UUID?
Events are append-only, never referenced by ID from other tables — no need for globally-unique identifiers
Sequential
BIGINTis 8 bytes vs UUID's 16 bytes — saves 8 bytes per row at potentially millions of rowsB-tree indexes on
BIGINTare more compact and cache-friendlyNatural ordering:
ORDER BY id= insertion order without consultingcreated_atMatches Dagster's
event_logs(integer PK) and Airflow'slogtable (serial PK)
Why VARCHAR(50) instead of a PostgreSQL ENUM for event_type? Adding new event types to a PostgreSQL ENUM requires ALTER TYPE ... ADD VALUE — a DDL operation that can't be rolled back in a transaction. VARCHAR with application-layer validation (Pydantic enum) is simpler to evolve. Same rationale as the state columns on runs/tasks.
Why no workspace_id? Events always belong to a run, and runs have workspace_id. For workspace-filtered event queries, JOIN to orchestration_runs. This avoids denormalizing workspace_id onto every event row (unlike tasks, where the denormalization saves a frequent JOIN). Event queries are less frequent and typically already filtered by run_id.
6.4 Event Immutability Contract
Events are append-only. Once written, an event row is never updated or deleted by application code. This contract enables:
Trustworthy audit trail — "what happened" is never rewritten after the fact
Safe concurrent reads — no locking needed for event queries
Simple replication — append-only tables replicate cleanly to read replicas or analytics databases
PRD-106 compatibility — telemetry pipelines can process events exactly once with a high-water-mark cursor (last processed
id)
Enforcement: No UPDATE or DELETE statements against orchestration_events in application code. The retention policy (Section 6.7) is the only mechanism that removes rows, and it operates at the DBA/cron level, not application level.
No updated_at column. Unlike runs and tasks, events have no mutable state. A single created_at timestamp is sufficient. Adding updated_at would signal that updates are expected — the opposite of our intent.
6.5 Event Creation Pattern
Events are created as a side effect of state transitions, inside the same database transaction. This is already implemented in the transition_task() function from Section 3.9:
Integration with transition_task() (from Section 3.9):
The transition_task() function already emits events via OrchestrationEvent(...). The emit_event() helper standardizes this pattern with enum validation and consistent actor tracking. The existing code in Section 3.9 line event_type=f"task_{to_state}" is replaced with explicit EventType enum values — some state transitions emit events that don't map 1:1 to the state name (e.g., entering failed from running emits task_crashed, but entering failed from verifying emits task_failed).
Events that don't correspond to state transitions:
Not every event is a state change. Some events are emitted mid-state:
run_budget_warning
running (unchanged)
Soft limit hit — informational, no state change
task_verification_started
verifying (unchanged)
Tracking when verifier begins, not a state change
stall_detected
Various
Reconciler observation before it acts
model_fallback
running (unchanged)
Model substitution during execution
cost_snapshot
running (unchanged)
Periodic aggregation
6.6 Query Examples
The event table is designed for three primary query patterns: timeline reconstruction, failure analysis, and performance metrics extraction.
Timeline Reconstruction
"Show me everything that happened in mission X, in order."
Failure Analysis
"Find all failures across my workspace in the last 24 hours."
Performance Metrics
"Calculate average time from task assignment to completion for each task type."
Retry Analysis
"Which tasks are retried most often, and what's the success rate after retry?"
PRD-106 Telemetry Feed
"Stream events since cursor for telemetry pipeline processing."
This cursor pattern (sequential BIGSERIAL ID as cursor) is why events use BIGSERIAL instead of UUID. The telemetry pipeline stores its last processed ID and polls for new events — no complex change-data-capture infrastructure needed.
6.7 Retention Policy
Events accumulate indefinitely if not managed. At our expected scale (10-50 events per mission, ~100 missions/day = ~2,500 events/day = ~1M events/year), storage is manageable but querying old events degrades performance without maintenance.
Three-tier retention strategy:
Hot
0–30 days
orchestration_events table (PostgreSQL)
Real-time queries, dashboard, debugging
Warm
30–180 days
Same table, but excluded from partial indexes
Historical analysis, PRD-106 pattern mining
Cold
180+ days
Archived (export to S3/object storage as JSONL, then DELETE)
Compliance/audit only, rare access
Implementation approach: pg_cron + batched DELETE
At our projected volume (~1M events/year), table partitioning (pg_partman) is overkill. A simple scheduled cleanup job is sufficient:
When to upgrade to partitioning: If event volume exceeds ~10M events/month (e.g., 100+ missions/day with 50+ events each), switch to PARTITION BY RANGE (created_at) with monthly partitions and pg_partman for automated partition management. The table schema supports this transition — created_at is already NOT NULL and indexed.
Archive table schema: Identical to orchestration_events but without foreign key constraints (the referenced runs/tasks may be deleted independently). Used only for compliance queries.
6.8 Indexes
Why BRIN for created_at? Events are insert-ordered and created_at correlates perfectly with physical row order. A BRIN index is ~1000x smaller than a B-tree for the same column on append-only tables. It supports time-range scans (retention cleanup, "last 24 hours" queries) efficiently. The tradeoff is slightly less precise than B-tree — acceptable for time-range filtering where exact row targeting isn't needed.
Why (run_id, id) instead of (run_id, created_at)? The id column (BIGSERIAL) provides insertion ordering identical to created_at but without timezone comparison overhead. For ORDER BY within a run, id is strictly monotonic — faster to sort and more compact in the index.
6.9 SQLAlchemy Model
6.10 Alembic Migration
6.11 Connection to PRD-106 (Outcome Telemetry)
The orchestration_events table is the primary data source for PRD-106 (Outcome Telemetry & Learning Foundation). Every event is a structured data point that the telemetry pipeline can process for pattern analysis.
What PRD-106 will extract from events:
Agent effectiveness
task_assigned + task_verification_passed/task_failed
Which agents succeed at which task types?
Model cost-effectiveness
task_assigned (model) + cost_snapshot
Does claude-sonnet at 10x the cost produce measurably better results than haiku?
Retry patterns
task_retrying + task_crashed
Which failure types are transient (worth retrying) vs persistent?
Task duration distribution
task_started + task_verification_passed
How long do different task types actually take?
Verification accuracy
task_verification_passed + task_human_approved/task_human_rejected
Does the verifier's judgment match human judgment?
Mission bottlenecks
Timeline reconstruction per run
Which tasks consistently delay mission completion?
Budget accuracy
run_plan_ready (estimated_cost) + run_completed (total_cost)
How accurate are the coordinator's cost estimates?
Design constraint for PRD-106: The event schema must remain stable — adding new event types is fine, but changing existing payload schemas or event type names breaks downstream telemetry queries. New payload fields should be additive (never remove or rename existing fields).
6.12 Design Decisions
Primary key
BIGSERIAL
UUID
Events are high-volume, append-only, never externally referenced. Sequential integers are 8 bytes vs 16, more compact in indexes, and provide natural ordering. Matches Dagster (event_logs) and Airflow (log) patterns.
Event type storage
VARCHAR(50) with Python enum
PostgreSQL ENUM type
ALTER TYPE ADD VALUE can't be rolled back in a transaction. VARCHAR with application-layer validation is simpler to evolve. New event types don't need a migration.
Payload structure
Flat JSONB column
Typed columns per event
30+ event types × 3-5 fields each = 100+ columns. JSONB keeps the table lean. Payload schema is validated by Pydantic on write (same pattern as orchestration_runs.config).
Actor tracking
actor_type + actor_id
Single actor string
Separating type from ID enables queries like "find all human actions" (WHERE actor_type = 'human') without parsing a composite string.
Workspace filtering
JOIN to orchestration_runs
Denormalize workspace_id
Unlike tasks (which need workspace filtering on every list query), event queries are typically scoped to a run_id. Denormalizing workspace_id onto every event row wastes 16 bytes/row with minimal query benefit.
Retention
pg_cron batched DELETE + archive table
pg_partman, application-layer TTL
At ~1M events/year, partitioning overhead isn't justified. Batched DELETE with RETURNING → archive is simple, transactional, and handles our scale. Upgrade path to pg_partman is documented if volume grows 10x.
Archive strategy
Separate table (same schema, no FKs)
S3 export only
Keeping archived events in PostgreSQL enables historical queries without object storage tooling. The archive table has minimal indexes (run_id + created_at only) to reduce write overhead.
created_at index
BRIN
B-tree
Append-only data with correlated physical order is the ideal BRIN use case — ~1000x smaller than B-tree with comparable query performance for range scans.
No workspace_id column
JOIN for workspace queries
Denormalize
Events are queried by run_id (detail view) or event_type (alerts). Workspace-level event queries are rare and tolerate a JOIN. Saves 16 bytes × millions of rows.
Immutability enforcement
Application convention + table comment
DB trigger blocking UPDATE/DELETE
DB trigger adds overhead on every INSERT (trigger evaluation). Convention is sufficient when all writes go through emit_event(). The table comment documents the contract.
7. Integration with Existing Schema
The orchestration tables don't exist in isolation — they must integrate cleanly with four existing systems: the Kanban board (board_tasks), agent reports (agent_reports), recipes (workflow_recipes), and workspace isolation. This section defines every integration point, the data flow between tables, and the migration strategy that adds three new tables without breaking anything.
7.1 Entity Relationship Diagram
7.2 Board Task Integration
Every orchestration run creates one parent board task (the mission), and every orchestration task creates one child board task (linked via parent_task_id). This gives the existing Kanban UI mission visibility without new frontend components.
7.2.1 Source Type Values
The board_tasks.source_type column currently holds two values:
user (default)
Manual creation via API
Most tasks
recipe
board_task_bridge.py during recipe execution
Recipe-triggered tasks
Orchestration adds two new values:
orchestration
Coordinator on mission start
One per orchestration_runs row — the parent task
orchestration_task
Coordinator during task planning
One per orchestration_tasks row — child tasks
No migration needed — source_type is VARCHAR(30), not a database enum.
7.2.2 Field Mapping: orchestration_runs → board_tasks (Parent)
workspace_id
orchestration_runs.workspace_id
550e8400-...
title
"Mission: " + orchestration_runs.title
"Mission: EU AI Act Compliance"
description
orchestration_runs.goal (verbatim user input)
"Research EU AI Act compliance for our product"
status
Mapped from orchestration_runs.state (see Section 3.8 pattern)
in_progress
priority
orchestration_runs.config.priority or 'high'
high
review_mode
'manual' (missions always need human review)
manual
assigned_agent_id
orchestration_runs.coordinator_agent_id
42 or NULL
created_by_type
'orchestration'
orchestration
created_by_id
orchestration_runs.created_by
user_clerk_abc
parent_task_id
NULL (this IS the parent)
NULL
source_type
'orchestration'
orchestration
source_id
str(orchestration_runs.id)
"a1b2c3d4-..."
tags
['mission', orchestration_runs.state]
['mission', 'running']
planning_data
See below
JSONB
started_at
orchestration_runs.started_at
Timestamp
planning_data JSONB for mission parent:
7.2.3 Field Mapping: orchestration_tasks → board_tasks (Child)
workspace_id
orchestration_tasks.workspace_id
550e8400-...
title
orchestration_tasks.title
"Research EU AI Act requirements"
description
orchestration_tasks.description
Detailed task instructions
status
Mapped from orchestration_tasks.state (Section 3.8)
assigned
priority
Inherited from parent or 'medium'
medium
review_mode
'auto' unless task has success_criteria → 'manual'
auto
assigned_agent_id
orchestration_tasks.agent_id
42
created_by_type
'orchestration'
orchestration
parent_task_id
Parent mission's board_tasks.id
1234
source_type
'orchestration_task'
orchestration_task
source_id
str(orchestration_tasks.id)
"e5f6g7h8-..."
tags
['mission_task', orchestration_tasks.task_type]
['mission_task', 'research']
planning_data
See below
JSONB
result
orchestration_tasks.output_summary (≤2000 chars, truncated to 4000 by board)
Summary text
error_message
orchestration_tasks.error_message
Failure reason
planning_data JSONB for task child:
7.2.4 State Synchronization
Board task status is updated as a side effect of transition_task() and transition_run() — in the same database transaction as the state change and event emission (the dual-write from Section 3.3). This is not a separate sync job; it's atomic.
7.2.5 Board Task Bridge Functions
Following the pattern in orchestrator/services/board_task_bridge.py, add these functions to the same file (or a new orchestration_board_bridge.py):
7.2.6 New Indexes on board_tasks
Two partial unique indexes prevent duplicate board tasks for the same orchestration entity:
These follow the existing pattern from PRD-72:
7.2.7 parent_task_id Usage
The board_tasks.parent_task_id column exists today (INTEGER FK → board_tasks.id ON DELETE SET NULL) but is unused by recipes or manual tasks. The API accepts it (POST /api/v1/tasks body, GET /api/v1/tasks?parent_task_id=N filter) but no feature populates it.
Orchestration is the first consumer: every task-level board task has parent_task_id pointing to the mission-level board task. This enables:
Dashboard query: "show all tasks in this mission" →
WHERE parent_task_id = :mission_board_idTree rendering: parent + children hierarchy in the Kanban UI
Cascade behavior: already defined as
ON DELETE SET NULL— if mission board task is manually deleted, child tasks become orphans (acceptable; orchestration tables are the source of truth)
7.3 Agent Reports Integration
PRD-76 established the agent_reports table for structured report metadata with workspace file storage. Orchestration task completion should auto-generate reports using the same system.
7.3.1 New FK on agent_reports
Add one nullable column:
This parallels the existing heartbeat_result_id FK — both are optional context references that link a report to its trigger:
heartbeat_result_id
heartbeat_results.id
Report created during heartbeat tick
orchestration_task_id
orchestration_tasks.id
Report created on task completion
Both can be NULL (standalone report). Both use ON DELETE SET NULL (report survives source deletion).
7.3.2 Report Creation Flow
When an orchestration task reaches a terminal state with output, a report is auto-created via the existing ReportService.create_report():
7.3.3 Report Type Mapping
Existing report_type values: standup, research, incident, summary, delivery, audit.
Orchestration task types map to report types:
Task Type (orchestration_tasks.task_type)
Report Type
Rationale
research
research
Direct mapping
analysis
research
Analysis is a form of research
writing
delivery
Writing produces a deliverable
coding
delivery
Code output is a deliverable
verification
audit
Verification is quality audit
review
audit
Review is quality audit
synthesis
summary
Synthesis produces a summary
other
delivery
Default to deliverable
For failed tasks, report_type is always incident (regardless of task type) to surface failures in the incident filter.
No new report_type enum values are needed — the existing six cover all orchestration task types.
7.3.4 Mission-Level Summary Report
When a run reaches a terminal state (completed, failed, cancelled), the coordinator creates one summary report for the entire mission:
7.4 Recipe Integration ("Save as Routine")
A successfully completed mission can be converted into a repeatable recipe (workflow_recipes row). This is the "Save as Routine?" button from PRD-100 Section 3.
7.4.1 Conversion Flow
7.4.2 Field Mapping: orchestration_runs → workflow_recipes
title
name
Direct copy
goal
description
Direct copy (user's original intent)
id
— (stored in template_definition.originated_from)
Lineage tracking
plan.strategy
execution_config.mode
"sequential" → "sequential", "parallel" → "parallel"
plan.tasks
steps
See task→step mapping below
config.budget
execution_config.budget
Copy budget config
config.model_preferences
Per-step model config
Distributed to steps
config.retry
execution_config.max_retries
Copy retry policy
workspace_id
workspace_id
Direct copy
created_by
created_by
Direct copy
—
template_id
Auto-generated: "mission-{run_id[:8]}-{timestamp}"
—
owner_type
'workspace'
—
tags
['from_mission', 'auto_generated']
7.4.3 Field Mapping: orchestration_tasks → recipe steps
Each orchestration task becomes a recipe step. Dependencies are flattened to pass_to references:
sequence_number
order
Direct mapping
title + description
prompt_template
Combined into agent instructions
agent_id
agent_id
Roster agent preserved; contractor agents replaced with null (user must assign)
model_override
model_override
Preserved if set
task_type
task_type
New field on steps (not in current recipe schema)
tools_requested
tools_needed
Hint list preserved
success_criteria
Embedded in prompt_template
Injected as template variable
Dependencies (join table)
depends_on array
Simplified to step_id references
trigger_rule
error_handling.on_failure
all_success → "stop", all_done → "continue", always → "always_run"
7.4.4 Limitations
Contractor agents don't survive conversion. Ephemeral agents are mission-scoped — the recipe stores
agent_id: nullwith a note that the user must assign a roster agent or let the coordinator pick one.DAG → sequence flattening. Complex parallel DAGs are topologically sorted into a linear sequence for the recipe's
orderfield. Parallel execution is noted inexecution_config.mode: "parallel"but individual step parallelism is lost. This is acceptable — recipes are simpler than missions by design.No input parameterization. The first conversion is a snapshot — the user's original goal is hardcoded in the description. To make the recipe truly reusable, the user must edit the steps to add
{{variable}}placeholders in theprompt_templatefields and define correspondinginputsschema. This is a manual step, not automated.
7.4.5 execution_config JSONB for Mission-Derived Recipes
execution_config JSONB for Mission-Derived RecipesThe originated_from block preserves lineage — when the recipe runs, the system can compare actual performance against the originating mission to detect drift.
7.5 Workspace Isolation
All orchestration queries MUST filter by workspace_id. This is enforced at the application layer (no PostgreSQL RLS), matching the pattern used by every other table in the platform.
7.5.1 Auth Pattern
Workspace ID is resolved from the request via hybrid.py:
Header:
X-Workspace-ID(preferred)Header:
X-Workspace(fallback)Query param:
workspace_idEnvironment:
WORKSPACE_ID/DEFAULT_WORKSPACE_ID
The _user_has_workspace_access() function validates the user owns or is a member of the requested workspace, preventing X-Workspace-ID spoofing.
7.5.2 Query Pattern
Every API endpoint follows this pattern (from board_tasks.py):
Critical: Never filter on id alone — always include workspace_id in the WHERE clause for single-record lookups. This prevents a user in workspace A from accessing workspace B's missions by guessing UUIDs.
7.5.3 Denormalized workspace_id on orchestration_tasks
As noted in Section 5.1, orchestration_tasks denormalizes workspace_id (copied from the parent run) to avoid a JOIN on the most common query: "show all tasks for my workspace." The denormalization is safe because:
workspace_id never changes on a run
The coordinator copies it at task creation time
The FK constraint on
orchestration_runs.workspace_idensures the workspace exists
7.5.4 Cascade Delete
All orchestration tables use ON DELETE CASCADE for the workspace FK:
When a workspace is deleted, all its runs, tasks, dependencies, events, board tasks, and reports are cascade-deleted. This matches the pattern on board_tasks, agent_reports, workflow_recipes, and every other workspace-scoped table.
7.5.5 No Admin Override Needed
Unlike agents and skills (which have marketplace items visible across workspaces), orchestration is strictly workspace-scoped. There is no admin_all_workspaces bypass for mission data — missions belong to exactly one workspace, always.
7.6 Index Summary
All indexes across the orchestration tables and their integration points:
New Tables (from Sections 4-6)
orchestration_runs
ix_orch_runs_ws_created
B-tree (workspace_id, created_at DESC)
Dashboard timeline: "my recent missions"
orchestration_runs
ix_orch_runs_active
Partial B-tree (workspace_id) WHERE state_type IN ('scheduled','running')
Reconciler: "find active runs"
orchestration_runs
ix_orch_runs_created_by
B-tree (created_by)
"My missions" filter
orchestration_tasks
ix_orch_tasks_run
B-tree (run_id)
Task list for a mission
orchestration_tasks
ix_orch_tasks_ws_state
B-tree (workspace_id, state)
Dashboard: "tasks by status"
orchestration_tasks
ix_orch_tasks_agent
B-tree (agent_id) WHERE agent_id IS NOT NULL
"Tasks assigned to this agent"
orchestration_tasks
ix_orch_tasks_board
B-tree (board_task_id) WHERE board_task_id IS NOT NULL
Reverse lookup: board → orchestration
orchestration_task_dependencies
PK covers (task_id, depends_on_id)
B-tree (composite PK)
Dependency lookup in both directions
orchestration_task_dependencies
ix_orch_deps_reverse
B-tree (depends_on_id)
"What depends on this task?" (downstream lookup)
orchestration_events
ix_orch_events_run
B-tree (run_id)
Event timeline for a mission
orchestration_events
ix_orch_events_task
B-tree (task_id) WHERE task_id IS NOT NULL
Events for a specific task
orchestration_events
ix_orch_events_created
BRIN (created_at)
Time-range queries on append-only data
orchestration_events
ix_orch_events_type
B-tree (event_type)
Filter by event type
Existing Tables (new indexes)
board_tasks
uq_board_tasks_orchestration_run
Unique partial (source_id) WHERE source_type = 'orchestration'
Idempotent mission board task creation
board_tasks
uq_board_tasks_orchestration_task
Unique partial (source_id) WHERE source_type = 'orchestration_task'
Idempotent task board task creation
agent_reports
ix_agent_reports_orch_task
B-tree (orchestration_task_id) WHERE orchestration_task_id IS NOT NULL
"Reports from this task"
7.7 Migration Safety
The orchestration migration creates only new tables and adds optional columns/indexes to existing tables. No existing columns are modified or removed.
7.7.1 Migration Strategy
One migration file: prd101_orchestration_tables.py (following the PRD-prefixed naming convention from recent migrations like prd76_agent_reports.py, prd77_agent_scheduled_tasks.py).
Standalone migration (down_revision = None) — same pattern as prd76, prd77, prd79. No dependency on other migrations. Safe to run against any database state.
7.7.2 Safety Guarantees
Table already exists (re-run)
CREATE TABLE IF NOT EXISTS on all tables
Index already exists
CREATE INDEX IF NOT EXISTS on all indexes
FK target doesn't exist
Tables created in dependency order: runs → tasks → dependencies → events
Existing board_tasks rows
No column changes to board_tasks — only new indexes added
Existing agent_reports rows
New orchestration_task_id column is NULL by default — no backfill needed
Downgrade safety
DROP TABLE IF EXISTS in reverse dependency order: events → dependencies → tasks → runs
Production lock time
CREATE INDEX CONCURRENTLY for indexes on existing tables (board_tasks, agent_reports) to avoid locking
7.7.3 Changes to Existing Tables
board_tasks
2 new partial unique indexes
Low — additive only
Use CREATE INDEX CONCURRENTLY to avoid table lock
agent_reports
1 new nullable column + 1 partial index
Low — nullable column, no default
ALTER TABLE ADD COLUMN IF NOT EXISTS is fast (metadata-only for nullable columns in PostgreSQL)
No changes to agents, workspaces, workflow_recipes, or any other existing table. The workflow_recipes table gains no new columns — the "save as routine" flow creates a new row using existing columns, with mission lineage stored in the execution_config JSONB.
7.7.4 Raw SQL Style
Following the codebase convention (prd76, prd79), the migration uses op.execute() with raw SQL rather than op.create_table():
7.8 Design Decisions Summary
Board task granularity
1 parent (mission) + N children (tasks)
1 flat board task per mission
Children use existing parent_task_id FK for hierarchy. Individual task visibility matches PRD-100's "every task visible on board" requirement.
Board task sync
Side effect in transition_task()
Separate sync job / CDC
Same-transaction update is atomic and simple. No eventual consistency lag. Matches recipe bridge pattern.
source_type values
orchestration + orchestration_task
Single orchestration value for both
Separate values enable "show only missions" vs "show only tasks" filters. Partial unique indexes need distinct values.
Report auto-creation
On terminal state in transition_task()
Agent calls platform_submit_report explicitly
Auto-creation ensures every task gets a report. Agent can still call platform_submit_report for richer content — the auto-report is a fallback.
Report FK
New orchestration_task_id on agent_reports
Reuse heartbeat_result_id
Explicit FK enables "reports for this mission" queries. Heartbeat and orchestration are separate execution contexts.
Recipe conversion
Snapshot with manual editing
Parameterized template auto-generation
First version is a snapshot — auto-parameterization requires prompt analysis that's out of scope for PRD-101. Users edit the recipe to add variables.
Workspace isolation
Application-layer filtering only
PostgreSQL RLS
Matches existing pattern across all tables. RLS would be a platform-wide decision, not per-feature.
Migration strategy
Standalone (down_revision = None)
Chained to previous migration
Standalone migrations are the recent convention (prd76+). Avoids merge head conflicts.
Existing table changes
Additive only (new columns, indexes)
Modify existing columns
Zero risk to existing functionality. No backfill migrations. No downtime.
8. Open Questions
These are unresolved design decisions surfaced during research. Each needs input before or during implementation (PRD-82A).
8.1 Schema Design Questions
Q1
Should orchestration_events use BIGSERIAL or UUID for PK?
BIGSERIAL is more natural for append-only tables (monotonic, BRIN-friendly, smaller indexes). UUID matches every other table in the codebase.
(a) BIGSERIAL — optimal for append-only workload (b) UUID — consistency with codebase
BIGSERIAL — event tables are different from entity tables. Append-only semantics favor monotonic keys. BRIN indexes on BIGSERIAL are ~100x smaller than B-tree on UUID.
Q2
Should plan JSONB on orchestration_runs be immutable after planning phase?
Currently the design allows plan mutation during execution (e.g., coordinator adds tasks dynamically). But immutable plans are easier to reason about and debug.
(a) Immutable after RUNNING — changes create a new plan version (b) Mutable — coordinator updates in place (c) Append-only plan history (array of versions)
Mutable for v1 — dynamic replanning is a Phase 2 feature (PRD-102) and immutability would block it. Revisit when PRD-102 defines coordinator behavior.
Q3
How should large task outputs be stored?
Section 5.1 says outputs go to output_ref (workspace file path or report ID). But the workspace file system is per-agent, and missions span multiple agents.
(a) Workspace files under /missions/{run_id}/ directory (b) Agent reports via platform_submit_report (c) S3 directly (d) JSONB column with size limit
(a) or (b) — workspace files for raw output, agent reports for structured results. PRD-76 already built the report pipeline. Avoid S3 for v1 complexity.
Q4
Should orchestration_task_dependencies support weighted edges?
Current design has dependency_type (strict/soft). Weighted edges could express "80% confidence this dependency is needed" for AI-planned graphs.
(a) Keep simple — strict/soft only (b) Add weight FLOAT column
(a) — YAGNI. AI-planned dependency confidence is a PRD-102 concern. The schema can add a column later without migration risk.
8.2 Integration Questions
Q5
Should board task creation be synchronous (in transition_task) or async (event-driven)?
Section 7.2.5 defines synchronous board task creation inside transition_task(). This couples orchestration to board logic. An event-driven approach (listen to orchestration_events) decouples them but adds eventual consistency.
(a) Synchronous — same transaction, guaranteed consistency (b) Async — event consumer creates board tasks
(a) for v1 — the board task bridge is 3 functions and matches the existing board_task_bridge.py pattern from recipes. Decoupling adds complexity without benefit at our scale.
Q6
Should recipe conversion (Section 7.4) preserve agent assignments or parameterize them?
A mission assigns specific roster agents. When converted to a recipe, should it lock to those agents or use role-based placeholders?
(a) Snapshot — lock agent IDs (b) Parameterize — convert to role placeholders like {researcher} (c) Snapshot with manual editing
(c) — snapshot first, let users edit. Auto-parameterization requires prompt analysis that's out of scope.
Q7
Should workspace_id on orchestration_tasks be denormalized?
Section 7.5.3 argues for denormalization (avoids JOIN to orchestration_runs on every task query). But it's redundant data.
Already decided: Yes, denormalize.
Confirmed — every task query needs workspace scoping. The JOIN cost is small but avoidable, and the pattern matches board_tasks which also denormalizes workspace_id.
8.3 Operational Questions
Q8
What's the event retention period?
Section 6.7 proposes 90 days hot + archive. But we don't have an archive mechanism today.
(a) 90 days, then DELETE (b) 90 days, then archive to S3/cold storage (c) Keep everything (small scale)
(c) for v1 — at <1000 missions/month, event volume is negligible. Revisit when PRD-106 telemetry analysis defines data retention needs.
Q9
Should the stall detection reconciler be a new service or extend task_reconciler.py?
Section 3.7 proposes extending the existing reconciler. But orchestration stall detection has different logic than recipe task reconciliation.
(a) Extend task_reconciler.py — add orchestration-specific tick (b) New orchestration_reconciler.py — separate concerns
(b) — separate file, same APScheduler infrastructure. The reconciliation logic is different enough to warrant its own module. Register it as a new scheduled tick alongside the existing one.
Q10
Should we add a tags JSONB column or a separate orchestration_run_tags table?
Section 2.4 notes Dagster uses run_tags table for filterable metadata. JSONB is simpler but harder to index for arbitrary key lookups.
(a) JSONB column on orchestration_runs (b) Separate tags table with (run_id, key, value)
(a) for v1 — GIN index on JSONB handles our query patterns. Separate table is warranted only if we need cross-run tag aggregation queries, which is a PRD-106 concern.
9. Risk Register
R1
JSONB schema drift — plan and config JSONB columns evolve without validation, leading to coordinator crashes on old data
High
High
Pydantic models with model_validate() on read. Version field in plan JSONB enables migration logic. Always add fields as optional with defaults.
Section 4.3 plan schema design
R2
Denormalized counter desync — task_count, tasks_completed, tasks_failed on orchestration_runs drift from actual orchestration_tasks counts due to bugs or partial transaction failures
Medium
Medium
(1) Counters updated in same transaction as task state change. (2) Reconciler periodically re-derives counts from SELECT COUNT(*) ... GROUP BY state and corrects drift. (3) Dashboard queries can fall back to live COUNT if discrepancy detected.
Section 4.2 denormalized counters
R3
Optimistic locking contention — high-frequency state transitions cause StaleDataError storms, especially during parallel task execution with fast completion
Medium
Low
At our scale (5-50 tasks/mission), state changes are seconds apart — contention is minimal. transition_task() returns (False, refreshed_task) on conflict — caller retries once with fresh state. If contention grows, switch hot-path transitions to SELECT FOR UPDATE SKIP LOCKED.
Section 3.9 concurrency safety
R4
Board task coupling — synchronous board task updates inside transition_task() add latency and create a failure coupling between orchestration and board systems
Medium
Low
Board task sync is a simple UPDATE on an already-loaded row (same transaction, no extra round-trip). If board_task_id is NULL (task not yet planned), sync is skipped. Fallback: if board sync fails, orchestration state still commits — board is eventually consistent via reconciler.
Section 7.2.4 state synchronization
R5
Event table growth — append-only orchestration_events grows unbounded if retention policy isn't implemented, degrading query performance on time-range scans
Low
High
BRIN index on created_at keeps time-range scans efficient regardless of table size. At projected volume (~1M events/year), PostgreSQL handles this comfortably. Archive strategy (Section 6.7) is documented but deferred to PRD-106 which defines retention needs. Monitor table size via pg_total_relation_size().
Section 6.7 retention policy
R6
Migration lock on board_tasks — adding indexes to the existing board_tasks table during migration could lock the table in production
High
Medium
Use CREATE INDEX CONCURRENTLY for all indexes on existing tables. This requires the index creation to run outside a transaction (op.execute("COMMIT") before CREATE INDEX CONCURRENTLY). Test migration on staging with production-sized data before deploying.
Section 7.7.4 raw SQL style
R7
Trigger rule complexity creep — starting with 4 trigger rules invites requests for more (one_success, one_failed, branching logic), increasing coordinator complexity
Low
Medium
4 rules are stored as VARCHAR(30) — adding new values requires only a code change in the trigger rule evaluator, not a schema migration. Gate new rules behind feature flags. Document that complex conditional logic belongs in the coordinator's planning prompt (PRD-102), not in trigger rules.
Section 5.3 trigger rules
R8
Recipe conversion loses fidelity — converting a mission DAG to a recipe flattens parallel execution into sequential steps, losing the execution structure that made the mission succeed
Medium
High
Document the limitation explicitly (Section 7.4.4). Recipe execution_config.mode preserves the strategy hint. Full DAG-aware recipes are a future enhancement — the recipe engine would need to understand dependencies natively (not in scope for PRD-101 or 82A).
Section 7.4.4 limitations
R9
Circular dependency at runtime — coordinator dynamically adds tasks during execution that create a cycle not caught at planning time
High
Low
graphlib.TopologicalSorter.prepare() is called after every dynamic task addition, not just at initial planning. If a cycle is detected, the new task is rejected with an error event. The CHECK (task_id != depends_on_id) constraint catches self-references at the DB level.
Section 5.5 cycle detection
R10
output_ref path breakage — task output stored at a workspace file path is moved or deleted, leaving output_ref pointing to nothing
Low
Medium
output_ref is a reference, not a guarantee. The report service (PRD-76) also stores content in agent_reports — the report is the durable artifact, the file is a convenience copy. Downstream tasks receive output_summary (inline text), not the file path.
Section 5.1 output storage
10. Implementation Acceptance Criteria
These are testable criteria for PRD-82A (the implementation PRD that follows this research document). The schema is "done" when all criteria pass.
10.1 Database Schema
AC-1
orchestration_runs table exists with all 22 columns from Section 4.2
\d orchestration_runs matches column definitions
AC-2
orchestration_tasks table exists with all 27 columns from Section 5.2
\d orchestration_tasks matches column definitions
AC-3
orchestration_task_dependencies table exists with composite PK and self-reference CHECK
\d orchestration_task_dependencies shows PK + constraint
AC-4
orchestration_events table exists with BIGSERIAL PK and all 8 columns from Section 6.3
\d orchestration_events matches column definitions
AC-5
orchestration_events_archive table exists (identical schema, no FKs)
\d orchestration_events_archive
AC-6
All FK constraints enforced: runs→workspaces, runs→agents, tasks→runs, tasks→workspaces, tasks→agents, tasks→board_tasks, deps→tasks (both directions), events→runs, events→tasks
SELECT conname FROM pg_constraint WHERE conrelid = ...
AC-7
All CHECK constraints enforced: non-negative tokens/cost/counts, score range 0-1, attempt ≥ 1, no self-referencing deps
INSERT violating each CHECK returns error
AC-8
All indexes from Section 7.6 exist, including partial indexes with correct WHERE clauses
\di + pg_indexes query for WHERE clause text
AC-9
agent_reports.orchestration_task_id column exists (nullable UUID FK → orchestration_tasks)
\d agent_reports shows new column
AC-10
Partial unique indexes on board_tasks for source_type = 'orchestration' and source_type = 'orchestration_task'
Duplicate INSERT returns unique violation
10.2 SQLAlchemy Models
AC-11
OrchestrationRun model importable from orchestrator.core.models
from orchestrator.core.models import OrchestrationRun succeeds
AC-12
OrchestrationTask model importable with relationships to run, agent, board_task
Model introspection shows relationships
AC-13
OrchestrationTaskDependency model importable with composite PK
Model metadata shows composite PK
AC-14
OrchestrationEvent model importable with BIGSERIAL PK
Model metadata shows autoincrement Integer PK
AC-15
version_id_col configured on OrchestrationRun and OrchestrationTask for optimistic locking
__mapper_args__ includes version_id_col
AC-16
All Python enums (StateType, RunState, TaskState, EventType, ActorType, TaskType, TriggerRule) defined and importable
Import and iterate all enum values
10.3 State Machine
AC-17
transition_task() enforces allowed transitions from Section 3.10 — invalid transitions raise InvalidTransition
Unit test: attempt every invalid transition, assert error
AC-18
transition_run() enforces allowed transitions — invalid transitions raise InvalidTransition
Unit test: attempt every invalid transition, assert error
AC-19
Every state transition emits an OrchestrationEvent in the same transaction
Unit test: transition + query events in same session, assert event exists
AC-20
Terminal state transitions set completed_at and duration_ms
Unit test: transition to completed/failed/cancelled, assert timestamps set
AC-21
StaleDataError on concurrent modification returns (False, refreshed_task) instead of raising
Unit test: load task in two sessions, modify in one, attempt transition in other
AC-22
Board task status synced on every orchestration task state change (per mapping in Section 3.8)
Integration test: transition task, query board_task, assert status matches
10.4 Integration
AC-23
create_mission_board_task() creates parent board_task with source_type='orchestration'
Integration test: create run, call function, query board_tasks
AC-24
create_task_board_task() creates child board_task with parent_task_id linking to mission board_task
Integration test: verify parent_task_id is set
AC-25
Duplicate board_task creation is idempotent (returns existing ID)
Call create_mission_board_task() twice, assert same board_task.id
AC-26
Auto-report creation on task completion writes to agent_reports with orchestration_task_id FK
Integration test: complete task, query agent_reports for matching FK
AC-27
Workspace isolation: query for run_id belonging to workspace A with workspace B context returns 404
Integration test: cross-workspace access attempt fails
10.5 Migration
AC-28
Migration runs successfully on empty database
alembic upgrade head on fresh DB
AC-29
Migration runs successfully on production-like database (existing board_tasks, agent_reports, agents)
alembic upgrade head on staging with data
AC-30
Downgrade cleanly removes all orchestration tables and the agent_reports.orchestration_task_id column
alembic downgrade -1 + verify tables gone
AC-31
Migration is re-runnable (IF NOT EXISTS on all CREATE statements)
Run migration twice without error
AC-32
Existing board_tasks and agent_reports rows unaffected by migration
Row count before = row count after
11. Dependencies & Sequencing
11.1 What Must Be Built First
The schema is the foundation — everything else depends on it:
11.2 Implementation Order for PRD-82A
Within the implementation PRD, build in this order:
1. Migration
Alembic migration file creating all 4 tables + archive table + existing table alterations
Nothing
—
2. Models
SQLAlchemy model classes in orchestrator/core/models/orchestration.py
Phase 1 (tables must exist)
—
3. Enums
Python enums (StateType, RunState, TaskState, etc.) in orchestrator/core/models/orchestration_enums.py
Nothing (pure Python)
Phase 1
4. State machine
transition_task(), transition_run(), emit_event() in orchestrator/services/orchestration_state.py
Phase 2, 3
—
5. Board bridge
create_mission_board_task(), create_task_board_task() in orchestrator/services/orchestration_board_bridge.py
Phase 2
Phase 4
6. Dependency resolver
DependencyResolver class, validate_task_graph() in orchestrator/services/orchestration_deps.py
Phase 3 (uses TaskState enum)
Phase 4, 5
7. API endpoints
CRUD endpoints for runs, tasks, events in orchestrator/api/missions.py
Phase 2, 4, 5, 6
—
11.3 What Can Be Deferred
These features are designed in this PRD but can be implemented incrementally:
Trigger rules beyond all_success
5.3
PRD-82C (parallel execution)
Sequential missions only need all_success. Other rules enable parallel patterns.
Recipe conversion ("Save as Routine")
7.4
PRD-82D or later
Nice-to-have. Core mission execution works without it.
Event archive/retention
6.7
PRD-106 (telemetry defines needs)
At <1M events/year, no urgency.
orchestration_events_archive table
6.10
Same as above
Table exists (migration creates it) but archival job is deferred.
Report auto-creation
7.3.2
PRD-82B (coordinator builds this flow)
Depends on coordinator knowing when tasks complete.
Stall detection reconciler
3.7
PRD-82B (coordinator)
Reconciler is part of coordinator's tick loop.
11.4 Cross-PRD Interface Contracts
These are the columns/fields that downstream PRDs will write to. The schema must support them even if this PRD doesn't populate them:
orchestration_tasks.verifier_score
Verifier agent
PRD-103
orchestration_tasks.verified_by
Verifier agent or human
PRD-103
orchestration_tasks.agent_type = 'contractor'
Coordinator
PRD-104
orchestration_tasks.model_override
Coordinator (model routing)
PRD-104
orchestration_runs.total_cost / total_tokens
Budget enforcer
PRD-105
orchestration_runs.config.budget.*
User / Budget enforcer
PRD-105
orchestration_events (all event types)
Telemetry pipeline reads
PRD-106
12. Appendix: Full SQL DDL
Complete CREATE TABLE statements ready to convert to Alembic. These follow the codebase convention of raw SQL via op.execute().
13. Appendix: SQLAlchemy Models
Complete Python model classes matching the DDL above. These follow the codebase conventions documented in the model audit: Base from core.database.base, PGUUID(as_uuid=True) for UUIDs, server_default for all defaults, DateTime(timezone=True) for timestamps, no PostgreSQL ENUM types.
13.1 Model Registration
Add to orchestrator/core/models/__init__.py:
13.2 Enum Definitions
Place in orchestrator/core/models/orchestration_enums.py (separate file to avoid circular imports):
Last updated

