PRD-59: Workflow Engine V2 — From 9-Stage Pipeline to Neural Swarm Architecture
Version: 1.0 Status: Draft Date: February 18, 2026 Author: Claude Code (with Gavin Kavanagh) Prerequisites: PRD-10 (Workflow Orchestration), PRD-16 (LLM-Driven Orchestrator), PRD-50 (Universal Router), PRD-51 (Orchestrator Unification), PRD-56 (Infrastructure Scaling), PRD-58 (Prompt Management)
Executive Summary
The Automatos 9-stage workflow engine runs end-to-end but produces inconsistent results because Stages 1-2 (Task Decomposition + Agent Selection) are only as good as their prompts, and those prompts have never been evaluated or optimized. When decomposition is wrong, everything downstream fails — the best execution engine in the world can't fix a badly sliced task.
This PRD does three things:
Stabilize the current engine — Fix the 6 critical issues that make the 9-stage pipeline unreliable today (Stage 7 scores metadata not outputs, learning loop doesn't close, context optimization disabled, etc.)
Evolve to dynamic stages — Replace the fixed 9-stage sequence with a stage selector that skips unnecessary stages for simple tasks and adds inter-agent negotiation stages for complex ones. Not every task needs all 9 stages.
Bridge to distributed execution — Introduce the
TaskRunnerabstraction (from PRD-56) so each agent subtask can run as an independent worker today (asyncio), a queued job tomorrow (Redis/ARQ), and a K8s pod next quarter — without changing any orchestration logic. This is the path to microagent swarms.
Why Now
The platform has evolved massively since the original 9-stage design (PRD-10, November 2025):
~20 LLM models
350+ models via OpenRouter + 8 providers
Basic agent skills
40+ skill categories, personas, marketplace
No tool integration
400+ MCP tools, Composio, tool catalog
Simple memory
4-tier hierarchical + Mem0 persistent memory
No routing
Universal Router with 4-tier classification (PRD-50)
Hardcoded prompts
Prompt Registry with FutureAGI evaluation (PRD-58)
Single process
TaskRunner abstraction ready for K8s (PRD-56)
The engine's infrastructure has grown 10x. The engine's orchestration logic hasn't kept up.
Relationship to Existing PRDs
PRD-10 (Workflow Engine)
Original 8-stage design. PRD-59 is its successor.
PRD-16 (LLM-Driven Orchestrator)
Proposed LLM-first stages. PRD-59 implements the Master Orchestrator pattern selectively.
PRD-50 (Universal Router)
Router sits above the engine. PRD-59 focuses on what happens after routing decides to orchestrate.
PRD-51 (Orchestrator Unification)
Unifies tool loading + execution paths. PRD-59 depends on this being clean.
PRD-56 (Infrastructure Scaling)
TaskRunner abstraction. PRD-59 integrates it as the execution layer.
PRD-58 (Prompt Management)
Prompt Registry + FutureAGI. PRD-59 depends on optimized Stage 1-2 prompts.
PRD-04 (Inter-Agent Communication)
SharedContext + Redis pub/sub. PRD-59 evolves this into field-based coordination.
Part 1: Current State Audit
What Works
Stage 1: Task Decomposition
Working
RealTaskDecomposer makes real LLM calls, returns JSON with subtasks, validates dependency graph via GraphTheory
Stage 2: Agent Selection
Working
LLMAgentSelector does batch LLM selection for all subtasks in one call
Stage 3: Context Engineering
Partial
RAG retrieval works when documents exist; CodeGraph works when indexed; mathematical optimization (knapsack/MMR) is DISABLED
Stage 4: Agent Execution
Working
AgentExecutionManager runs parallel groups via asyncio, tools work, SharedContextManager passes results between agents
Stage 5: Result Aggregation
Working
5-dimension heuristic scoring (completeness, accuracy, efficiency, reliability, coherence)
Stage 6: Learning Update
Partial
Updates agent.performance_metrics in DB, but LLMAgentSelector (the live path) may not read these metrics back
Stage 7: Quality Assessment
Broken
Always uses heuristics (use_llm=False), scores a metadata summary string — not the actual agent outputs
Stage 8: Memory Storage
Working
Mem0 storage works; hierarchical consolidation stubs (collective memory returns [])
Stage 9: Response Generation
Working
Builds structured output_data, stores analytics
The 6 Critical Issues
Issue 1: Stage 7 scores metadata, not outputs
Location: api/workflows.py:2234-2248
The heuristic assesses a summary string containing subtask count, token count, and status. It does NOT evaluate the actual LLM responses. Stage 7 scores are meaningless — they'll hover around 0.65-0.75 regardless of output quality.
Impact: Quality gate doesn't work. Bad outputs pass. Good outputs don't score higher.
Issue 2: Learning loop doesn't fully close
Location: modules/learning/engine/core.py → api/workflows.py:1663
Stage 6's LearningSystemUpdater writes updated performance_metrics to the Agent DB record (exponential moving average, learning rate 0.1). But the live execution path uses LLMAgentSelector (a batch LLM call), which constructs a prompt about available agents — it's unclear whether it includes each agent's performance_metrics.success_rate in that prompt. If not, the learning loop writes data that nothing reads.
Impact: The system doesn't get smarter over time. Agent selection doesn't improve from experience.
Issue 3: Context optimization disabled
Location: modules/orchestrator/stages/context_engineering.py:170
The Shannon Entropy filtering, MMR diversity selection, and Knapsack token budget optimization — the mathematical foundations described in the Platform Guide — are all disabled. Context engineering falls back to basic RAG retrieval.
Impact: Context quality is unoptimized. Token budgets are not managed. The mathematical differentiation described in the ebook doesn't actually run.
Issue 4: requires_context field ignored
requires_context field ignoredLocation: api/workflows.py:1898
Stage 1's decomposer returns a requires_context field per subtask, but the execution path ignores it and forces all subtasks through context engineering. This wastes tokens on subtasks that don't need context and adds latency.
Impact: Unnecessary RAG calls. Wasted tokens. Slower execution.
Issue 5: Two disconnected orchestrator implementations
Location: api/workflows.py vs modules/orchestrator/service.py
execute_workflow_with_progress() (2700+ lines inline in workflows.py) is the live path. EnhancedOrchestratorService.execute_workflow() (in service.py) is a cleaner implementation with 4-dimensional agent scoring, LLM quality assessment, and WorkflowMemoryIntegrator — but it's disconnected (import commented out in api/orchestrator.py:23,30).
Impact: The better implementation isn't used. Bug fixes happen in the wrong place.
Issue 6: No graph metadata propagation
Location: modules/agents/execution/execution_manager.py:351-357
The execution manager looks for subtask['graph_metadata'] but the decomposer puts graph analysis in result["graph_analysis"] (top-level), not per-subtask. Graph dependency information computed in Stage 1 never reaches Stage 4.
Impact: Parallel execution grouping works (via parallel_groups), but individual subtasks don't know their position in the dependency graph.
Part 2: Should We Keep 9 Stages?
Analysis
The 9 stages emerged organically — PRD-10 had 8, PRD-16 grew it to 9. The number isn't grounded in theory. Looking at the actual execution flow, what matters is:
That's 5 phases, not 9 stages. Some tasks need all 5. A simple single-agent chat response needs only phases 3-4 (prepare + do). A recipe with pre-assigned agents skips phases 1-2 entirely (already decided).
Recommendation: Dynamic Phase Selection
Replace the fixed 9-stage sequence with 5 phases that expand into the specific stages needed:
What's New
Stage 2b: Inter-Agent Negotiation — Before execution, selected agents review the task plan and can propose adjustments. From PRD-04's collaborative problem-solving algorithm. Only triggers for 3+ agent workflows.
Stage 3b: Prompt Optimization — If PRD-58's Prompt Registry has evaluated the relevant system prompt and FutureAGI has an optimized variant, use it. Check PromptRegistry.get_ab_variant() for active A/B tests.
Stage 4b: Inter-Agent Coordination — During parallel execution, agents write findings to SharedContextManager. This is already partially implemented but formalized here as a distinct coordination step between parallel groups.
The Phase Selector
This maps directly to the Progressive Complexity Model from the Platform Guide:
Atom (simple task)
EXECUTE + LEARN
3 stages (4, 9, partial 8)
50-200
Molecule (needs examples)
PREPARE + EXECUTE + LEARN
5 stages (3, 4, 5, 8, 9)
500-2,000
Cell (agent memory)
PREPARE + EXECUTE + EVALUATE + LEARN
7 stages (3, 4, 5, 6, 7, 8, 9)
2,000-4,000
Organ (multi-agent)
All 5 phases
All stages including 2b, 4b
4,000-8,000
Organism (enterprise)
All 5 phases + meta-learning
All stages + cross-workflow learning
8,000-16,000
Part 3: The Fixes (Priority Order)
Fix 1: Make Stage 7 evaluate real outputs
Priority: CRITICAL Effort: 2 days
Validation: Run a workflow, verify Stage 7 score changes meaningfully between a good and bad execution.
Fix 2: Close the learning loop (Stage 6 → Stage 2)
Priority: CRITICAL Effort: 1 day
Verify and fix that LLMAgentSelector includes agent performance data in its selection prompt.
Location: core/llm/llm_agent_selector.py
The agent selection prompt must include for each candidate agent:
If the selector prompt doesn't include these, the LLM has no performance data to reason about. The learning loop writes to /dev/null.
Formula (existing, from PRD-10):
The LLM should receive these 4 signals for each candidate agent.
Fix 3: Re-enable context optimization
Priority: HIGH Effort: 3 days
Location: modules/orchestrator/stages/context_engineering.py
Re-enable the mathematical optimization pipeline:
Shannon Entropy Filter:
Remove context items with H(X) < 4.0 (low information content — boilerplate, repetitive text).
MMR Diversity Selection:
Where λ=0.7 (70% relevance, 30% diversity). Prevents redundant context items.
Knapsack Token Budget:
Where value(cᵢ) = cosine_similarity × information_density.
Why it was disabled: Likely the ContextOptimizer class had an initialization issue or missing dependency. Debug, fix, re-enable behind a feature flag (ENABLE_CONTEXT_OPTIMIZATION=true).
Fix 4: Respect requires_context from decomposer
requires_context from decomposerPriority: HIGH Effort: 0.5 days
Fix 5: Unify orchestrator implementations
Priority: HIGH Effort: 5 days
Extract the inline 2700-line execute_workflow_with_progress() into a proper service class that uses the stage components from modules/orchestrator/stages/. Either:
Option A: Refactor execute_workflow_with_progress() to delegate to EnhancedOrchestratorService (clean, but risk of breaking the working path)
Option B: Gradually replace inline stage logic with calls to the stage components (safer, incremental)
Recommend Option B — replace one stage at a time, test between each.
Fix 6: Propagate graph metadata to subtasks
Priority: MEDIUM Effort: 0.5 days
Part 4: Stage 1-2 Optimization Strategy (PRD-58 Integration)
Why Stages 1-2 Are the Highest Leverage
Improving Stage 1 to 90% and Stage 2 to 90%:
PRD-58 Integration Plan
Once the Prompt Registry (PRD-58 Phase 1A) is live:
Evaluate current Stage 1-2 prompts — Run FutureAGI evaluation on
task-decomposerandagent-selectorslugs against a test dataset of 30+ real workflow requestsOptimize with FutureAGI — Use Bayesian optimization (10 iterations) to improve both prompts. Target: +15% instruction adherence on decomposer, +20% task completion on selector
A/B test — Route 20% traffic to optimized prompts, measure quality score differences in Stage 7 (now that it evaluates real outputs)
Activate — When optimized prompts show statistically significant improvement, activate them
Test Dataset for Stage 1 (Task Decomposer)
Part 5: The TaskRunner Bridge (PRD-56 Integration)
Architecture: From Monolith to Distributed
AgentTask Model (from PRD-56)
TaskRunner Interface
Implementation Priority
The LocalTaskRunner is a pure refactor — zero behavior change, just wrapping the existing asyncio.gather() in the TaskRunner interface. This is the keystone that unlocks everything.
Part 6: Path to Neural Swarm Architecture
From SharedContextManager to Neural Field
The SharedContextManager in Stage 4 is the embryo of the neural field from the Context-Engineering research. Currently it's an in-memory dict scoped to an execution. Here's how it evolves:
From Learning Update to Attractor Dynamics
Stage 6 currently uses exponential moving average:
This can evolve into attractor dynamics:
From Fixed Pipeline to Swarm Orchestration
The end state — microagents in K8s with neural field shared consciousness:
Key properties of the swarm:
Ephemeral: Pods spin up for a task and die after. KEDA scales from zero.
Heterogeneous: Different models for different subtasks (Claude for research, GPT-4 for code, etc.)
Field-coordinated: Agents don't message each other. They read/write to the shared neural field. Knowledge propagates via field dynamics, not explicit routing.
Self-improving: Attractor dynamics mean the system converges on optimal agent-model-task combinations over time.
Part 7: Implementation Plan
Phase 1: Stabilize (Weeks 1-3)
Goal: Make the current 9-stage pipeline reliable
1.1
Fix Stage 7: evaluate real outputs (LLM-based)
2 days
—
api/workflows.py, stages/quality_assessor.py
1.2
Close learning loop: verify Stage 6 → Stage 2 data flow
1 day
—
core/llm/llm_agent_selector.py, learning/engine/core.py
1.3
Re-enable context optimization (entropy + MMR + knapsack)
3 days
—
stages/context_engineering.py, search/optimization/
1.4
Respect requires_context from decomposer
0.5 days
—
api/workflows.py
1.5
Propagate graph metadata to subtasks
0.5 days
—
api/workflows.py
1.6
Integrate PRD-58 prompts for Stages 1-2
2 days
PRD-58 Phase 1A
stages/task_decomposer.py, llm_agent_selector.py
1.7
Run FutureAGI evaluation on Stage 1-2 prompts
1 day
1.6
Eval datasets
1.8
Optimize Stage 1-2 prompts via FutureAGI
2 days
1.7
Prompt versions
Phase 1 total: ~12 days
Phase 2: Dynamic Phases (Weeks 4-6)
Goal: Replace fixed 9-stage sequence with PhaseSelector
2.1
Build PhaseSelector class
2 days
Phase 1
modules/orchestrator/phase_selector.py (NEW)
2.2
Extract stages into composable pipeline
3 days
2.1
modules/orchestrator/pipeline.py (NEW)
2.3
Wire execute_workflow_with_progress() to use pipeline
3 days
2.2
api/workflows.py
2.4
Add Stage 2b: Inter-Agent Negotiation
2 days
2.2
stages/agent_negotiation.py (NEW)
2.5
Add Stage 3b: Prompt Optimization check
1 day
2.2, PRD-58
stages/prompt_optimization.py (NEW)
2.6
SSE streaming for dynamic phases
2 days
2.3
consumers/workflows/streaming.py
Phase 2 total: ~13 days
Phase 3: TaskRunner Bridge (Weeks 7-10)
Goal: Extract execution into TaskRunner interface
3.1
Define TaskRunner interface + AgentTask model
1 day
—
core/task_runner/base.py (NEW)
3.2
Implement LocalTaskRunner (wraps current asyncio)
2 days
3.1
core/task_runner/local.py (NEW)
3.3
Refactor AgentExecutionManager to use TaskRunner
3 days
3.2
modules/agents/execution/execution_manager.py
3.4
Move SharedContextManager to Redis
2 days
3.3
modules/orchestrator/shared_context.py
3.5
Implement QueuedTaskRunner (Redis + ARQ)
5 days
3.3, 3.4
core/task_runner/queued.py (NEW)
3.6
Deploy worker containers on Railway
2 days
3.5
Dockerfile.worker, Railway config
Phase 3 total: ~15 days
Phase 4: Neural Field Prototype (Months 3-4)
Goal: Implement field-based agent coordination
4.1
Extend SharedContext with vector embeddings
3 days
Phase 3
core/neural_field/field_store.py (NEW)
4.2
Implement field read (semantic retrieval)
2 days
4.1
core/neural_field/field_reader.py (NEW)
4.3
Implement field write (contribution + diffusion)
2 days
4.1
core/neural_field/field_writer.py (NEW)
4.4
Replace explicit agent messaging with field ops
3 days
4.2, 4.3
execution_manager.py
4.5
Implement attractor dynamics for learning
3 days
4.4
core/neural_field/attractor.py (NEW)
4.6
Field coherence metric (do agents agree?)
2 days
4.4
core/neural_field/coherence.py (NEW)
Phase 4 total: ~15 days
Phase 5: K8s Microagent Swarms (Months 4-6)
Goal: Full distributed execution
5.1
Implement KubernetesTaskRunner
5 days
Phase 3
core/task_runner/kubernetes.py (NEW)
5.2
KEDA ScaledJob configuration
2 days
5.1
K8s manifests
5.3
Workspace namespace isolation
2 days
5.1
K8s RBAC
5.4
Multi-model pod selection (right model per subtask)
3 days
5.1
core/task_runner/model_router.py (NEW)
5.5
Neural field across pods (Redis + pgvector)
3 days
Phase 4, 5.1
Field store adaptation
5.6
Swarm monitoring dashboard
5 days
5.5
Frontend + API
Phase 5 total: ~20 days
Part 8: Mathematical Foundations Reference
Context Engineering (used in Stage 3)
Shannon Entropy — Filter low-information content:
Threshold: H(X) > 4.0 bits for inclusion.
Cosine Similarity — Semantic relevance:
Threshold: cos(θ) > 0.7 for relevant context.
MMR (Maximal Marginal Relevance) — Balance relevance and diversity:
λ = 0.7 (70% relevance, 30% diversity).
Knapsack Optimization — Maximize information within token budget:
Agent Selection (used in Stage 2)
Multi-dimensional scoring:
Exponential Moving Average for learning:
Where α = 0.1 (learning rate).
Quality Assessment (used in Stage 7)
Weighted quality score:
Confidence interval (from ProbabilityTheory):
Where z = 1.96 for 95% confidence.
Neural Field Dynamics (Phase 4-5)
Field evolution equation:
Where:
Ψ(x,t) ∈ ℝⁿ — field state in embedding space
V(Ψ) — task potential (objective function gradient)
D — diffusion coefficient (knowledge sharing rate, tunable)
Aᵢ — agent i's contribution (injected at semantic position xᵢ)
Field coherence metric:
C → 1 when all agents converge (agreement). C → 0 when agents diverge (conflict).
Attractor dynamics:
Where:
A — attractor strength for (agent, task_type) pair
Q — observed quality score
Q_threshold = 0.7 (minimum acceptable)
α = 0.1 (learning rate)
η ~ N(0, 0.01) — noise term (prevents local optima)
Part 9: New Files
1
modules/orchestrator/phase_selector.py
Dynamic phase selection based on complexity + mode
2
2
modules/orchestrator/pipeline.py
Composable stage pipeline executor
2
3
modules/orchestrator/stages/agent_negotiation.py
Stage 2b: Inter-agent task review
2
4
modules/orchestrator/stages/prompt_optimization.py
Stage 3b: PRD-58 prompt variant check
2
5
core/task_runner/base.py
TaskRunner interface + AgentTask model
3
6
core/task_runner/local.py
LocalTaskRunner (asyncio wrapper)
3
7
core/task_runner/queued.py
QueuedTaskRunner (Redis + ARQ)
3
8
core/task_runner/factory.py
TaskRunner factory (env-based selection)
3
9
core/neural_field/field_store.py
Redis + pgvector neural field storage
4
10
core/neural_field/field_reader.py
Semantic field retrieval
4
11
core/neural_field/field_writer.py
Field contribution + diffusion
4
12
core/neural_field/attractor.py
Attractor dynamics for learning
4
13
core/neural_field/coherence.py
Field coherence metric
4
14
core/task_runner/kubernetes.py
KubernetesTaskRunner (K8s Jobs)
5
Modified Files
1
api/workflows.py
Stage 7 fix, context fix, graph propagation, pipeline integration
1, 2
2
modules/orchestrator/stages/context_engineering.py
Re-enable optimization
1
3
modules/orchestrator/stages/quality_assessor.py
Enable LLM assessment
1
4
core/llm/llm_agent_selector.py
Include performance metrics in prompt
1
5
modules/learning/engine/core.py
Verify metric persistence
1
6
modules/agents/execution/execution_manager.py
TaskRunner integration
3
7
consumers/workflows/streaming.py
Dynamic phase SSE events
2
Success Metrics
Phase 1 (Stabilize)
Stage 7 quality score variance
~0 (always 0.65-0.75)
Meaningful range (0.3-0.95)
Stage 1-2 prompt eval score
Unknown
> 85% instruction adherence
Learning loop closure
Unverified
Agent with 10+ executions selected 20% more for matching tasks
Context token waste
Unknown (all subtasks get context)
30% reduction via requires_context gating
Phase 2 (Dynamic Phases)
Simple task latency
3-10s (all 9 stages)
< 2s (3 stages for Atom tasks)
Token cost per simple task
~12,000 tokens
< 3,000 tokens (skip PLAN + EVALUATE)
Multi-agent coordination quality
N/A
Measurable coherence score > 0.7
Phase 3 (TaskRunner)
Max concurrent subtasks
~3 (asyncio, single process)
10+ (worker pool)
Execution isolation
None (shared process)
Full (separate workers)
Failure blast radius
All subtasks die
Only failed subtask retries
Phase 5 (K8s Swarms)
Scale-to-zero time
< 30 seconds
Pod spin-up latency
< 10 seconds
Max concurrent agents per workspace
20+
Field coherence on multi-agent tasks
> 0.75
Cost per workflow (10 subtasks)
< $0.15 compute + LLM costs
Risks & Mitigations
Re-enabling context optimization introduces regressions
Degraded context quality
Feature flag ENABLE_CONTEXT_OPTIMIZATION. A/B test before full rollout.
LLM-based Stage 7 adds cost
~$0.005/evaluation
Only use LLM quality for Organ+ complexity. Heuristic for Atom/Molecule.
TaskRunner refactor breaks execution
Workflows stop working
LocalTaskRunner is a pure wrapper — zero behavior change. Integration tests.
Redis-backed SharedContext adds latency
Slower inter-agent coordination
Redis HSET/HGET is < 1ms. Net effect is negligible.
K8s adds infrastructure complexity
Ops burden
Start with managed K8s (GKE Autopilot). KEDA handles scaling automatically.
Neural field math is too theoretical
Wasted effort
Phase 4 is optional. Phases 1-3 deliver concrete value independently.
Open Questions
PRD-58 timeline: Phase 1 fixes depend on PRD-58's Prompt Registry for Stage 1-2 optimization. Is PRD-58 Phase 1A (registry + seeding) on track?
Context optimizer debug: Why was the mathematical optimization disabled? Is it a dependency issue, a performance issue, or a quality issue? Need to investigate before re-enabling.
QueuedTaskRunner infrastructure: Should workers run on the same Railway project (service-level scaling) or a separate compute provider (e.g., Fly.io, Modal)?
Neural field complexity: Is the field dynamics math from Context-Engineering ready for implementation, or does it need more research? The diffusion equation requires discretization choices (grid resolution, time step) that affect both accuracy and performance.
Backward compatibility: When we switch from fixed 9-stage to dynamic phases, do existing workflow execution records need migration? The
WorkflowExecution.input_dataJSON stores per-stage metadata keyed by stage name.
Glossary
Neural Field
Continuous semantic vector space shared by multiple agents. Replaces explicit message-passing with implicit field dynamics.
Attractor
A stable pattern in the learning landscape that the system converges toward. High-quality agent-task combinations become attractors.
Field Coherence
Measure of agreement between agents' contributions to the shared field. High coherence = agents are aligned.
Progressive Complexity
Automatos' 5-level hierarchy: Atom → Molecule → Cell → Organ → Organism. Each level adds context sophistication only when needed.
TaskRunner
Abstract interface for task execution. Implementations: LocalTaskRunner (asyncio), QueuedTaskRunner (Redis), KubernetesTaskRunner (K8s Jobs).
Phase Selector
Component that determines which workflow phases and stages to execute based on task complexity and execution mode.
MMR
Maximal Marginal Relevance. Algorithm that balances relevance and diversity when selecting context items.
Knapsack
Optimization algorithm that maximizes information value within a token budget constraint.
KEDA
Kubernetes Event-Driven Autoscaler. Scales pods from zero based on queue depth.
Last updated

