PRD-59: Workflow Engine V2 — From 9-Stage Pipeline to Neural Swarm Architecture

Version: 1.0 Status: Draft Date: February 18, 2026 Author: Claude Code (with Gavin Kavanagh) Prerequisites: PRD-10 (Workflow Orchestration), PRD-16 (LLM-Driven Orchestrator), PRD-50 (Universal Router), PRD-51 (Orchestrator Unification), PRD-56 (Infrastructure Scaling), PRD-58 (Prompt Management)


Executive Summary

The Automatos 9-stage workflow engine runs end-to-end but produces inconsistent results because Stages 1-2 (Task Decomposition + Agent Selection) are only as good as their prompts, and those prompts have never been evaluated or optimized. When decomposition is wrong, everything downstream fails — the best execution engine in the world can't fix a badly sliced task.

This PRD does three things:

  1. Stabilize the current engine — Fix the 6 critical issues that make the 9-stage pipeline unreliable today (Stage 7 scores metadata not outputs, learning loop doesn't close, context optimization disabled, etc.)

  2. Evolve to dynamic stages — Replace the fixed 9-stage sequence with a stage selector that skips unnecessary stages for simple tasks and adds inter-agent negotiation stages for complex ones. Not every task needs all 9 stages.

  3. Bridge to distributed execution — Introduce the TaskRunner abstraction (from PRD-56) so each agent subtask can run as an independent worker today (asyncio), a queued job tomorrow (Redis/ARQ), and a K8s pod next quarter — without changing any orchestration logic. This is the path to microagent swarms.

Why Now

The platform has evolved massively since the original 9-stage design (PRD-10, November 2025):

Then (PRD-10)
Now (February 2026)

~20 LLM models

350+ models via OpenRouter + 8 providers

Basic agent skills

40+ skill categories, personas, marketplace

No tool integration

400+ MCP tools, Composio, tool catalog

Simple memory

4-tier hierarchical + Mem0 persistent memory

No routing

Universal Router with 4-tier classification (PRD-50)

Hardcoded prompts

Prompt Registry with FutureAGI evaluation (PRD-58)

Single process

TaskRunner abstraction ready for K8s (PRD-56)

The engine's infrastructure has grown 10x. The engine's orchestration logic hasn't kept up.

Relationship to Existing PRDs

PRD
Relationship

PRD-10 (Workflow Engine)

Original 8-stage design. PRD-59 is its successor.

PRD-16 (LLM-Driven Orchestrator)

Proposed LLM-first stages. PRD-59 implements the Master Orchestrator pattern selectively.

PRD-50 (Universal Router)

Router sits above the engine. PRD-59 focuses on what happens after routing decides to orchestrate.

PRD-51 (Orchestrator Unification)

Unifies tool loading + execution paths. PRD-59 depends on this being clean.

PRD-56 (Infrastructure Scaling)

TaskRunner abstraction. PRD-59 integrates it as the execution layer.

PRD-58 (Prompt Management)

Prompt Registry + FutureAGI. PRD-59 depends on optimized Stage 1-2 prompts.

PRD-04 (Inter-Agent Communication)

SharedContext + Redis pub/sub. PRD-59 evolves this into field-based coordination.


Part 1: Current State Audit

What Works

Component
Status
Evidence

Stage 1: Task Decomposition

Working

RealTaskDecomposer makes real LLM calls, returns JSON with subtasks, validates dependency graph via GraphTheory

Stage 2: Agent Selection

Working

LLMAgentSelector does batch LLM selection for all subtasks in one call

Stage 3: Context Engineering

Partial

RAG retrieval works when documents exist; CodeGraph works when indexed; mathematical optimization (knapsack/MMR) is DISABLED

Stage 4: Agent Execution

Working

AgentExecutionManager runs parallel groups via asyncio, tools work, SharedContextManager passes results between agents

Stage 5: Result Aggregation

Working

5-dimension heuristic scoring (completeness, accuracy, efficiency, reliability, coherence)

Stage 6: Learning Update

Partial

Updates agent.performance_metrics in DB, but LLMAgentSelector (the live path) may not read these metrics back

Stage 7: Quality Assessment

Broken

Always uses heuristics (use_llm=False), scores a metadata summary string — not the actual agent outputs

Stage 8: Memory Storage

Working

Mem0 storage works; hierarchical consolidation stubs (collective memory returns [])

Stage 9: Response Generation

Working

Builds structured output_data, stores analytics

The 6 Critical Issues

Issue 1: Stage 7 scores metadata, not outputs

Location: api/workflows.py:2234-2248

The heuristic assesses a summary string containing subtask count, token count, and status. It does NOT evaluate the actual LLM responses. Stage 7 scores are meaningless — they'll hover around 0.65-0.75 regardless of output quality.

Impact: Quality gate doesn't work. Bad outputs pass. Good outputs don't score higher.

Issue 2: Learning loop doesn't fully close

Location: modules/learning/engine/core.pyapi/workflows.py:1663

Stage 6's LearningSystemUpdater writes updated performance_metrics to the Agent DB record (exponential moving average, learning rate 0.1). But the live execution path uses LLMAgentSelector (a batch LLM call), which constructs a prompt about available agents — it's unclear whether it includes each agent's performance_metrics.success_rate in that prompt. If not, the learning loop writes data that nothing reads.

Impact: The system doesn't get smarter over time. Agent selection doesn't improve from experience.

Issue 3: Context optimization disabled

Location: modules/orchestrator/stages/context_engineering.py:170

The Shannon Entropy filtering, MMR diversity selection, and Knapsack token budget optimization — the mathematical foundations described in the Platform Guide — are all disabled. Context engineering falls back to basic RAG retrieval.

Impact: Context quality is unoptimized. Token budgets are not managed. The mathematical differentiation described in the ebook doesn't actually run.

Issue 4: requires_context field ignored

Location: api/workflows.py:1898

Stage 1's decomposer returns a requires_context field per subtask, but the execution path ignores it and forces all subtasks through context engineering. This wastes tokens on subtasks that don't need context and adds latency.

Impact: Unnecessary RAG calls. Wasted tokens. Slower execution.

Issue 5: Two disconnected orchestrator implementations

Location: api/workflows.py vs modules/orchestrator/service.py

execute_workflow_with_progress() (2700+ lines inline in workflows.py) is the live path. EnhancedOrchestratorService.execute_workflow() (in service.py) is a cleaner implementation with 4-dimensional agent scoring, LLM quality assessment, and WorkflowMemoryIntegrator — but it's disconnected (import commented out in api/orchestrator.py:23,30).

Impact: The better implementation isn't used. Bug fixes happen in the wrong place.

Issue 6: No graph metadata propagation

Location: modules/agents/execution/execution_manager.py:351-357

The execution manager looks for subtask['graph_metadata'] but the decomposer puts graph analysis in result["graph_analysis"] (top-level), not per-subtask. Graph dependency information computed in Stage 1 never reaches Stage 4.

Impact: Parallel execution grouping works (via parallel_groups), but individual subtasks don't know their position in the dependency graph.


Part 2: Should We Keep 9 Stages?

Analysis

The 9 stages emerged organically — PRD-10 had 8, PRD-16 grew it to 9. The number isn't grounded in theory. Looking at the actual execution flow, what matters is:

That's 5 phases, not 9 stages. Some tasks need all 5. A simple single-agent chat response needs only phases 3-4 (prepare + do). A recipe with pre-assigned agents skips phases 1-2 entirely (already decided).

Recommendation: Dynamic Phase Selection

Replace the fixed 9-stage sequence with 5 phases that expand into the specific stages needed:

What's New

Stage 2b: Inter-Agent Negotiation — Before execution, selected agents review the task plan and can propose adjustments. From PRD-04's collaborative problem-solving algorithm. Only triggers for 3+ agent workflows.

Stage 3b: Prompt Optimization — If PRD-58's Prompt Registry has evaluated the relevant system prompt and FutureAGI has an optimized variant, use it. Check PromptRegistry.get_ab_variant() for active A/B tests.

Stage 4b: Inter-Agent Coordination — During parallel execution, agents write findings to SharedContextManager. This is already partially implemented but formalized here as a distinct coordination step between parallel groups.

The Phase Selector

This maps directly to the Progressive Complexity Model from the Platform Guide:

Complexity Level
Phases Used
Estimated Stages
Token Budget

Atom (simple task)

EXECUTE + LEARN

3 stages (4, 9, partial 8)

50-200

Molecule (needs examples)

PREPARE + EXECUTE + LEARN

5 stages (3, 4, 5, 8, 9)

500-2,000

Cell (agent memory)

PREPARE + EXECUTE + EVALUATE + LEARN

7 stages (3, 4, 5, 6, 7, 8, 9)

2,000-4,000

Organ (multi-agent)

All 5 phases

All stages including 2b, 4b

4,000-8,000

Organism (enterprise)

All 5 phases + meta-learning

All stages + cross-workflow learning

8,000-16,000


Part 3: The Fixes (Priority Order)

Fix 1: Make Stage 7 evaluate real outputs

Priority: CRITICAL Effort: 2 days

Validation: Run a workflow, verify Stage 7 score changes meaningfully between a good and bad execution.

Fix 2: Close the learning loop (Stage 6 → Stage 2)

Priority: CRITICAL Effort: 1 day

Verify and fix that LLMAgentSelector includes agent performance data in its selection prompt.

Location: core/llm/llm_agent_selector.py

The agent selection prompt must include for each candidate agent:

If the selector prompt doesn't include these, the LLM has no performance data to reason about. The learning loop writes to /dev/null.

Formula (existing, from PRD-10):

The LLM should receive these 4 signals for each candidate agent.

Fix 3: Re-enable context optimization

Priority: HIGH Effort: 3 days

Location: modules/orchestrator/stages/context_engineering.py

Re-enable the mathematical optimization pipeline:

Shannon Entropy Filter:

Remove context items with H(X) < 4.0 (low information content — boilerplate, repetitive text).

MMR Diversity Selection:

Where λ=0.7 (70% relevance, 30% diversity). Prevents redundant context items.

Knapsack Token Budget:

Where value(cᵢ) = cosine_similarity × information_density.

Why it was disabled: Likely the ContextOptimizer class had an initialization issue or missing dependency. Debug, fix, re-enable behind a feature flag (ENABLE_CONTEXT_OPTIMIZATION=true).

Fix 4: Respect requires_context from decomposer

Priority: HIGH Effort: 0.5 days

Fix 5: Unify orchestrator implementations

Priority: HIGH Effort: 5 days

Extract the inline 2700-line execute_workflow_with_progress() into a proper service class that uses the stage components from modules/orchestrator/stages/. Either:

Option A: Refactor execute_workflow_with_progress() to delegate to EnhancedOrchestratorService (clean, but risk of breaking the working path)

Option B: Gradually replace inline stage logic with calls to the stage components (safer, incremental)

Recommend Option B — replace one stage at a time, test between each.

Fix 6: Propagate graph metadata to subtasks

Priority: MEDIUM Effort: 0.5 days


Part 4: Stage 1-2 Optimization Strategy (PRD-58 Integration)

Why Stages 1-2 Are the Highest Leverage

Improving Stage 1 to 90% and Stage 2 to 90%:

PRD-58 Integration Plan

Once the Prompt Registry (PRD-58 Phase 1A) is live:

  1. Evaluate current Stage 1-2 prompts — Run FutureAGI evaluation on task-decomposer and agent-selector slugs against a test dataset of 30+ real workflow requests

  2. Optimize with FutureAGI — Use Bayesian optimization (10 iterations) to improve both prompts. Target: +15% instruction adherence on decomposer, +20% task completion on selector

  3. A/B test — Route 20% traffic to optimized prompts, measure quality score differences in Stage 7 (now that it evaluates real outputs)

  4. Activate — When optimized prompts show statistically significant improvement, activate them

Test Dataset for Stage 1 (Task Decomposer)


Part 5: The TaskRunner Bridge (PRD-56 Integration)

Architecture: From Monolith to Distributed

AgentTask Model (from PRD-56)

TaskRunner Interface

Implementation Priority

The LocalTaskRunner is a pure refactor — zero behavior change, just wrapping the existing asyncio.gather() in the TaskRunner interface. This is the keystone that unlocks everything.


Part 6: Path to Neural Swarm Architecture

From SharedContextManager to Neural Field

The SharedContextManager in Stage 4 is the embryo of the neural field from the Context-Engineering research. Currently it's an in-memory dict scoped to an execution. Here's how it evolves:

From Learning Update to Attractor Dynamics

Stage 6 currently uses exponential moving average:

This can evolve into attractor dynamics:

From Fixed Pipeline to Swarm Orchestration

The end state — microagents in K8s with neural field shared consciousness:

Key properties of the swarm:

  • Ephemeral: Pods spin up for a task and die after. KEDA scales from zero.

  • Heterogeneous: Different models for different subtasks (Claude for research, GPT-4 for code, etc.)

  • Field-coordinated: Agents don't message each other. They read/write to the shared neural field. Knowledge propagates via field dynamics, not explicit routing.

  • Self-improving: Attractor dynamics mean the system converges on optimal agent-model-task combinations over time.


Part 7: Implementation Plan

Phase 1: Stabilize (Weeks 1-3)

Goal: Make the current 9-stage pipeline reliable

#
Task
Effort
Depends On
Files

1.1

Fix Stage 7: evaluate real outputs (LLM-based)

2 days

api/workflows.py, stages/quality_assessor.py

1.2

Close learning loop: verify Stage 6 → Stage 2 data flow

1 day

core/llm/llm_agent_selector.py, learning/engine/core.py

1.3

Re-enable context optimization (entropy + MMR + knapsack)

3 days

stages/context_engineering.py, search/optimization/

1.4

Respect requires_context from decomposer

0.5 days

api/workflows.py

1.5

Propagate graph metadata to subtasks

0.5 days

api/workflows.py

1.6

Integrate PRD-58 prompts for Stages 1-2

2 days

PRD-58 Phase 1A

stages/task_decomposer.py, llm_agent_selector.py

1.7

Run FutureAGI evaluation on Stage 1-2 prompts

1 day

1.6

Eval datasets

1.8

Optimize Stage 1-2 prompts via FutureAGI

2 days

1.7

Prompt versions

Phase 1 total: ~12 days

Phase 2: Dynamic Phases (Weeks 4-6)

Goal: Replace fixed 9-stage sequence with PhaseSelector

#
Task
Effort
Depends On
Files

2.1

Build PhaseSelector class

2 days

Phase 1

modules/orchestrator/phase_selector.py (NEW)

2.2

Extract stages into composable pipeline

3 days

2.1

modules/orchestrator/pipeline.py (NEW)

2.3

Wire execute_workflow_with_progress() to use pipeline

3 days

2.2

api/workflows.py

2.4

Add Stage 2b: Inter-Agent Negotiation

2 days

2.2

stages/agent_negotiation.py (NEW)

2.5

Add Stage 3b: Prompt Optimization check

1 day

2.2, PRD-58

stages/prompt_optimization.py (NEW)

2.6

SSE streaming for dynamic phases

2 days

2.3

consumers/workflows/streaming.py

Phase 2 total: ~13 days

Phase 3: TaskRunner Bridge (Weeks 7-10)

Goal: Extract execution into TaskRunner interface

#
Task
Effort
Depends On
Files

3.1

Define TaskRunner interface + AgentTask model

1 day

core/task_runner/base.py (NEW)

3.2

Implement LocalTaskRunner (wraps current asyncio)

2 days

3.1

core/task_runner/local.py (NEW)

3.3

Refactor AgentExecutionManager to use TaskRunner

3 days

3.2

modules/agents/execution/execution_manager.py

3.4

Move SharedContextManager to Redis

2 days

3.3

modules/orchestrator/shared_context.py

3.5

Implement QueuedTaskRunner (Redis + ARQ)

5 days

3.3, 3.4

core/task_runner/queued.py (NEW)

3.6

Deploy worker containers on Railway

2 days

3.5

Dockerfile.worker, Railway config

Phase 3 total: ~15 days

Phase 4: Neural Field Prototype (Months 3-4)

Goal: Implement field-based agent coordination

#
Task
Effort
Depends On
Files

4.1

Extend SharedContext with vector embeddings

3 days

Phase 3

core/neural_field/field_store.py (NEW)

4.2

Implement field read (semantic retrieval)

2 days

4.1

core/neural_field/field_reader.py (NEW)

4.3

Implement field write (contribution + diffusion)

2 days

4.1

core/neural_field/field_writer.py (NEW)

4.4

Replace explicit agent messaging with field ops

3 days

4.2, 4.3

execution_manager.py

4.5

Implement attractor dynamics for learning

3 days

4.4

core/neural_field/attractor.py (NEW)

4.6

Field coherence metric (do agents agree?)

2 days

4.4

core/neural_field/coherence.py (NEW)

Phase 4 total: ~15 days

Phase 5: K8s Microagent Swarms (Months 4-6)

Goal: Full distributed execution

#
Task
Effort
Depends On
Files

5.1

Implement KubernetesTaskRunner

5 days

Phase 3

core/task_runner/kubernetes.py (NEW)

5.2

KEDA ScaledJob configuration

2 days

5.1

K8s manifests

5.3

Workspace namespace isolation

2 days

5.1

K8s RBAC

5.4

Multi-model pod selection (right model per subtask)

3 days

5.1

core/task_runner/model_router.py (NEW)

5.5

Neural field across pods (Redis + pgvector)

3 days

Phase 4, 5.1

Field store adaptation

5.6

Swarm monitoring dashboard

5 days

5.5

Frontend + API

Phase 5 total: ~20 days


Part 8: Mathematical Foundations Reference

Context Engineering (used in Stage 3)

Shannon Entropy — Filter low-information content:

Threshold: H(X) > 4.0 bits for inclusion.

Cosine Similarity — Semantic relevance:

Threshold: cos(θ) > 0.7 for relevant context.

MMR (Maximal Marginal Relevance) — Balance relevance and diversity:

λ = 0.7 (70% relevance, 30% diversity).

Knapsack Optimization — Maximize information within token budget:

Agent Selection (used in Stage 2)

Multi-dimensional scoring:

Exponential Moving Average for learning:

Where α = 0.1 (learning rate).

Quality Assessment (used in Stage 7)

Weighted quality score:

Confidence interval (from ProbabilityTheory):

Where z = 1.96 for 95% confidence.

Neural Field Dynamics (Phase 4-5)

Field evolution equation:

Where:

  • Ψ(x,t) ∈ ℝⁿ — field state in embedding space

  • V(Ψ) — task potential (objective function gradient)

  • D — diffusion coefficient (knowledge sharing rate, tunable)

  • Aᵢ — agent i's contribution (injected at semantic position xᵢ)

Field coherence metric:

C → 1 when all agents converge (agreement). C → 0 when agents diverge (conflict).

Attractor dynamics:

Where:

  • A — attractor strength for (agent, task_type) pair

  • Q — observed quality score

  • Q_threshold = 0.7 (minimum acceptable)

  • α = 0.1 (learning rate)

  • η ~ N(0, 0.01) — noise term (prevents local optima)


Part 9: New Files

#
File
Purpose
Phase

1

modules/orchestrator/phase_selector.py

Dynamic phase selection based on complexity + mode

2

2

modules/orchestrator/pipeline.py

Composable stage pipeline executor

2

3

modules/orchestrator/stages/agent_negotiation.py

Stage 2b: Inter-agent task review

2

4

modules/orchestrator/stages/prompt_optimization.py

Stage 3b: PRD-58 prompt variant check

2

5

core/task_runner/base.py

TaskRunner interface + AgentTask model

3

6

core/task_runner/local.py

LocalTaskRunner (asyncio wrapper)

3

7

core/task_runner/queued.py

QueuedTaskRunner (Redis + ARQ)

3

8

core/task_runner/factory.py

TaskRunner factory (env-based selection)

3

9

core/neural_field/field_store.py

Redis + pgvector neural field storage

4

10

core/neural_field/field_reader.py

Semantic field retrieval

4

11

core/neural_field/field_writer.py

Field contribution + diffusion

4

12

core/neural_field/attractor.py

Attractor dynamics for learning

4

13

core/neural_field/coherence.py

Field coherence metric

4

14

core/task_runner/kubernetes.py

KubernetesTaskRunner (K8s Jobs)

5

Modified Files

#
File
Changes
Phase

1

api/workflows.py

Stage 7 fix, context fix, graph propagation, pipeline integration

1, 2

2

modules/orchestrator/stages/context_engineering.py

Re-enable optimization

1

3

modules/orchestrator/stages/quality_assessor.py

Enable LLM assessment

1

4

core/llm/llm_agent_selector.py

Include performance metrics in prompt

1

5

modules/learning/engine/core.py

Verify metric persistence

1

6

modules/agents/execution/execution_manager.py

TaskRunner integration

3

7

consumers/workflows/streaming.py

Dynamic phase SSE events

2


Success Metrics

Phase 1 (Stabilize)

Metric
Current
Target

Stage 7 quality score variance

~0 (always 0.65-0.75)

Meaningful range (0.3-0.95)

Stage 1-2 prompt eval score

Unknown

> 85% instruction adherence

Learning loop closure

Unverified

Agent with 10+ executions selected 20% more for matching tasks

Context token waste

Unknown (all subtasks get context)

30% reduction via requires_context gating

Phase 2 (Dynamic Phases)

Metric
Current
Target

Simple task latency

3-10s (all 9 stages)

< 2s (3 stages for Atom tasks)

Token cost per simple task

~12,000 tokens

< 3,000 tokens (skip PLAN + EVALUATE)

Multi-agent coordination quality

N/A

Measurable coherence score > 0.7

Phase 3 (TaskRunner)

Metric
Current
Target

Max concurrent subtasks

~3 (asyncio, single process)

10+ (worker pool)

Execution isolation

None (shared process)

Full (separate workers)

Failure blast radius

All subtasks die

Only failed subtask retries

Phase 5 (K8s Swarms)

Metric
Target

Scale-to-zero time

< 30 seconds

Pod spin-up latency

< 10 seconds

Max concurrent agents per workspace

20+

Field coherence on multi-agent tasks

> 0.75

Cost per workflow (10 subtasks)

< $0.15 compute + LLM costs


Risks & Mitigations

Risk
Impact
Mitigation

Re-enabling context optimization introduces regressions

Degraded context quality

Feature flag ENABLE_CONTEXT_OPTIMIZATION. A/B test before full rollout.

LLM-based Stage 7 adds cost

~$0.005/evaluation

Only use LLM quality for Organ+ complexity. Heuristic for Atom/Molecule.

TaskRunner refactor breaks execution

Workflows stop working

LocalTaskRunner is a pure wrapper — zero behavior change. Integration tests.

Redis-backed SharedContext adds latency

Slower inter-agent coordination

Redis HSET/HGET is < 1ms. Net effect is negligible.

K8s adds infrastructure complexity

Ops burden

Start with managed K8s (GKE Autopilot). KEDA handles scaling automatically.

Neural field math is too theoretical

Wasted effort

Phase 4 is optional. Phases 1-3 deliver concrete value independently.


Open Questions

  1. PRD-58 timeline: Phase 1 fixes depend on PRD-58's Prompt Registry for Stage 1-2 optimization. Is PRD-58 Phase 1A (registry + seeding) on track?

  2. Context optimizer debug: Why was the mathematical optimization disabled? Is it a dependency issue, a performance issue, or a quality issue? Need to investigate before re-enabling.

  3. QueuedTaskRunner infrastructure: Should workers run on the same Railway project (service-level scaling) or a separate compute provider (e.g., Fly.io, Modal)?

  4. Neural field complexity: Is the field dynamics math from Context-Engineering ready for implementation, or does it need more research? The diffusion equation requires discretization choices (grid resolution, time step) that affect both accuracy and performance.

  5. Backward compatibility: When we switch from fixed 9-stage to dynamic phases, do existing workflow execution records need migration? The WorkflowExecution.input_data JSON stores per-stage metadata keyed by stage name.


Glossary

Term
Definition

Neural Field

Continuous semantic vector space shared by multiple agents. Replaces explicit message-passing with implicit field dynamics.

Attractor

A stable pattern in the learning landscape that the system converges toward. High-quality agent-task combinations become attractors.

Field Coherence

Measure of agreement between agents' contributions to the shared field. High coherence = agents are aligned.

Progressive Complexity

Automatos' 5-level hierarchy: Atom → Molecule → Cell → Organ → Organism. Each level adds context sophistication only when needed.

TaskRunner

Abstract interface for task execution. Implementations: LocalTaskRunner (asyncio), QueuedTaskRunner (Redis), KubernetesTaskRunner (K8s Jobs).

Phase Selector

Component that determines which workflow phases and stages to execute based on task complexity and execution mode.

MMR

Maximal Marginal Relevance. Algorithm that balances relevance and diversity when selecting context items.

Knapsack

Optimization algorithm that maximizes information value within a token budget constraint.

KEDA

Kubernetes Event-Driven Autoscaler. Scales pods from zero based on queue depth.

Last updated