PRD-104 — Ephemeral Agents & Model Selection
Version: 1.0 Type: Research + Design Status: Complete — Ready for Peer Review Priority: P0 Dependencies: PRD-100 (Research Master), PRD-101 (Mission Schema — contractor_config JSONB), PRD-102 (Coordinator Architecture — agent assignment) Blocks: PRD-82C (Parallel Execution + Budget + Contractors) Author: Gerard Kavanagh + Claude Date: 2026-03-15
1. Problem Statement
1.1 The Gap
Automatos agents are permanent residents. Every agent occupies a row in the agents table (45+ columns), has a heartbeat config, skills, tool assignments, persona, voice profile, and semantic embeddings. Creating one means a DB write, tool resolution, optional LLM verification, and caching. Deleting one cascades through 11 dependent tables.
This is correct for roster agents — permanent team members with personality, memory, and ongoing responsibilities.
It is completely wrong for mission work. When the coordinator decomposes a goal into 4 subtasks, it needs focused agents in <100ms each, executing in parallel, reporting results, and disappearing. No persona. No heartbeat. No marketplace category. No voice profile.
1.2 What This PRD Delivers
Contractor Agent Lifecycle — spawn, configure, execute, report, destroy — with <100ms in-memory creation
Model-Per-Role Strategy — which models for which agent roles, with cost/quality tradeoffs
Dynamic Tool Scoping — coordinator specifies tools per contractor, no DB assignment needed
Mission-Scoped Memory — contractors share mission context but nothing persists after
Auto-Cleanup — TTL-based and mission-completion-based destruction
Integration Design — how contractors flow through existing
AgentFactory.execute_with_prompt()
2. Prior Art: Ephemeral Agent Patterns
2.1 System-by-System Analysis
Agent Zero (frdel/agent-zero)
Agent Zero spawns subordinates via Agent(number+1, fresh_config, SHARED_context) — only customization is profile (a prompt directory). Key characteristics:
Single subordinate at a time (linked list, not fan-out)
Memory is SHARED (same FAISS index) — no isolation
Conversation sealing via
history.new_topic()— progressive compression (50% current / 30% topics / 20% bulks)Utility model for compression — cheap model handles internal coordination
No timeouts, no budgets, no explicit destruction
What we adopt: Conversation sealing pattern (progressive context compression). Utility model for coordination overhead.
What we reject: Single-subordinate limitation (we need parallel fan-out). Shared memory (we need mission-scoped isolation). No lifecycle management.
AutoGen (microsoft/autogen)
AutoGen's agent is fully described by a config dict: (name, system_message, llm_config, tools, description). The GroupChatManager selects speakers via LLM or deterministic rules. Swarm handoff priority: tool-returned agent → OnCondition → AFTER_WORK fallback. context_variables dict is shared mutable state.
What we adopt: Agent-as-config-dict pattern — this IS our contractor config model. Swarm handoff priority ordering for coordinator task transitions. context_variables as mission-scoped shared state.
What we reject: No explicit cleanup (Python GC). LLM-based speaker selection (expensive, non-deterministic).
Kubernetes Jobs
K8s Jobs define the infrastructure pattern for ephemeral workloads:
ttlSecondsAfterFinished= auto-cleanup after completionactiveDeadlineSeconds= hard timeout (overrides retries)backoffLimit= retry cap with exponential backoffpodFailurePolicyrules: FailJob (fatal error codes), Ignore (infra disruption), Count (normal retry)Artifact preservation: write results to external storage BEFORE container exit
What we adopt: TTL-based cleanup. Hard timeout. Backoff limit. Failure classification (fatal vs retryable). Result persistence before destruction.
What we reject: Full container lifecycle management (premature — Phase 3/K8s scope).
2.2 Model Routing Research
RouteLLM (ICLR 2025, UC Berkeley/Anyscale/Canva)
Matrix factorization router achieves 75% cost reduction at 95% quality on MT-Bench. Key finding: math/reasoning tasks need expensive models; conversational/summarization routes cheap. Static role-based mapping captures 80% of the routing value without any ML infrastructure.
BudgetMLAgent (AIMLSystems 2024)
Cascade pattern: free model → cheap model → expensive model, escalating only when output quality is insufficient. Achieved 96% cost reduction vs single GPT-4 agent. Proves cascade/escalation is viable for multi-agent budgets.
OpenRouter (Existing Infrastructure)
Already integrated with 340 models. Provider routing params available:
sort:'price'|'throughput'|'latency'max_price: Cost ceiling per callpreferred_min_throughput: Min tokens/sec
Decision: Static role→model mapping for v1. No ML routing. RouteLLM proves static mapping captures 80% of value. OpenRouter's sort/max_price params are the v1 selection interface. Telemetry (PRD-106) provides data for future dynamic routing.
3. Contractor Agent Lifecycle
3.1 State Machine
SPAWNING
<100ms
In-memory AgentRuntime created from contractor_config
READY
<10ms
Tools resolved, context prepared
EXECUTING
3-300s
Agent running LLM calls via execute_with_prompt()
REPORTING
<100ms
Result written to orchestration_tasks.result_reference
CLEANUP
<50ms
Evict from active_agents, delete Redis keys, soft-delete DB row
DESTROYED
Terminal
Agent no longer exists
SPAWN_FAILED
Terminal
Config validation failed, tools unavailable, etc.
FAILED
Terminal
Execution crashed, timeout, or max retries exhausted
3.2 Contractor Config Schema
The coordinator specifies contractor configuration in orchestration_tasks.contractor_config JSONB (defined in PRD-101):
3.3 DB Record Strategy: Hybrid
Decision: In-memory execution + async DB audit row.
Creation
In-memory AgentRuntime only
<50ms
During execution
In-memory; async DB write of minimal audit row
DB write: ~200ms (non-blocking)
After completion
orchestration_tasks has the result; audit row has agent metadata
Already written
Cleanup
Evict from active_agents; soft-delete audit row (is_active = False)
<50ms
The audit row is minimal:
Why not skip the DB row entirely? Board tasks need assigned_agent_id for display. The admin UI should show active contractors. The telemetry system (PRD-106) needs agent_id for attribution. The async write doesn't block execution.
4. AgentFactory Integration
4.1 New Method: create_ephemeral_agent()
4.2 What Does NOT Change
execute_with_prompt()tool loop (lines ~838-862) — same 10-iteration tool loop_execute_tool_calls()(lines ~958-1028) — same tool dispatchunified_executor.execute_tool()— same prefix-based routingHeartbeat tick pattern — roster agents continue unchanged
Agent API endpoints — contractors created by coordinator, not user API
4.3 Hard Constraint: No Sub-Contractors
Contractors cannot spawn sub-contractors. This is architectural, not a simplification:
Bounded cost: Sub-contractors create unbounded agent trees. Budget enforcement becomes impossible — the coordinator can't pre-estimate cost for a tree of unknown depth.
Observability: The coordinator must see every executing agent. Sub-contractors would be invisible to the reconciliation tick.
Debugging: Flat coordinator→contractor traces (2 levels) are tractable. N-level traces are exponentially harder.
Alternative: If a task is too complex, the coordinator should decompose it into smaller tasks (replanning per PRD-102 Section 9), not delegate decomposition to the contractor.
5. Model-Per-Role Strategy
5.1 Role Taxonomy
researcher
Web search, document analysis, data gathering
Mid-tier + large context
Process lots of text, synthesize findings
analyst
Data analysis, comparison, structured output
Mid-tier
Good reasoning, structured generation
writer
Reports, documentation, content creation
Mid-tier
Good prose at high volume
coder
Code generation, debugging, refactoring
Top-tier or specialized
Code quality is critical
reviewer
Quality review, fact-checking, verification
Mid-tier, different family from coder
Cognitive diversity catches different bugs
simple
Classification, formatting, routing, extraction
Cheap
Minimal reasoning needed
5.2 Default Model Mapping
5.3 Cognitive Diversity Enforcement
Hard rule: reviewer model MUST be from a different model family than the task executor.
This isn't a preference — it's a quality requirement. Different model families have different failure modes. A Claude-generated analysis reviewed by Claude misses the same blind spots. A GPT review catches different issues.
5.4 User Override Surface
Users can override model selection at mission creation:
Override priority: user preference > mission config > workspace defaults > role defaults.
5.5 Cost Estimation
6. Memory Isolation
6.1 What Contractors Can Access
Mission context (prior task results)
READ
Injected by coordinator via task prompt. Contractor sees outputs from earlier tasks.
Shared mission context (SharedContextPort)
READ/WRITE
Via PRD-107 interface. Contractors inject findings; later agents query them.
Redis session memory
NONE
Mission-scoped, not session-scoped. Contractors have no chat history.
Postgres short-term memory
NONE
No L2 memory for ephemeral agents.
Mem0 long-term memory
NONE
Contractors do not read or write to Mem0. Mission-scoped only.
RAG / document search
READ
Via workspace tools (workspace_read_file, platform_search_documents).
NL2SQL
NONE
No workspace data queries for contractors.
6.2 How Context Flows to Contractors
7. Cleanup Automation
7.1 Three Cleanup Triggers
Mission completion
All tasks terminal (completed/failed/cancelled)
cleanup_ephemeral_agents(mission_id)
TTL expiry
expires_at timestamp passed
Periodic GC sweep (every 5 min)
Explicit cancel
Human cancels mission
Same as mission completion
7.2 GC Sweep
7.3 What Persists After Cleanup
Task output
orchestration_tasks.result_reference → workspace file
Yes
Execution trace
orchestration_events
Yes
Cost/token metrics
llm_usage rows with mission_task_id
Yes
Verifier score
orchestration_tasks.verifier_score
Yes
Agent DB row
agents table (soft-deleted)
Yes (queryable for audit)
In-memory runtime
AgentFactory.active_agents
No (evicted)
Redis keys
Contractor-specific Redis entries
No (expired or deleted)
8. Concurrency Control
8.1 Limits
Max concurrent contractors per mission
3
Yes (mission config)
Dispatcher checks before spawn
Max concurrent contractors per workspace
5
Yes (workspace settings)
Matches heartbeat_service.max_concurrent_per_workspace
Max total contractors per mission
20
No (hard limit)
Validation in plan decomposition
8.2 Backpressure
When all contractor slots are full, the coordinator queues tasks in queued state. The next tick's dispatch phase picks them up when a slot opens.
9. Failure Classification
Following K8s podFailurePolicy and Prefect's CRASHED/FAILED distinction:
Infrastructure (CRASHED)
LLM timeout, rate limit 429, network error, OOM
Yes (auto)
Same config, exponential backoff
Config (FATAL)
Invalid model, tool not found, auth failure
No
Fail immediately, report to coordinator
Quality (FAILED)
Verifier rejects output
Yes (auto)
Same or different model, with verifier feedback
Budget (FATAL)
Budget exhausted pre-call
No
Coordinator decides (downgrade, pause, abort)
Timeout (CRASHED)
activeDeadlineSeconds exceeded
Yes (auto)
Retry with longer timeout or simpler instructions
10. Acceptance Criteria
Must Have
Should Have
Nice to Have
11. Risk Register
1
Spawn overhead too high
High
Medium
Hybrid: in-memory first, async DB. Pre-warm LLM connections.
2
Model routing accuracy — wrong model degrades quality
Medium
Medium
Static mapping (conservative). PRD-106 telemetry detects model-quality correlation.
3
Contractor quality — no personality or memory
High
Medium
Rich system prompts from ContextService (mission context, role instructions, success criteria). Quality comes from prompt, not persistence.
4
Cleanup failures — resource leaks
Medium
Medium
Defense in depth: mission completion + TTL + GC sweep.
5
Tool scoping edge cases
Medium
Low
Validate tool names at spawn time. Fail fast.
6
Unbounded parallelism — overwhelm LLM rate limits
High
High
Hard caps: per-mission (3), per-workspace (5). Queue excess.
7
Cost blowout — parallel expensive models
High
Medium
PRD-105 budget gate runs pre-check before each spawn.
8
Model deprecation mid-mission
Low
Low
3-model fallback chain per role. OpenRouter handles within-model fallback.
12. Dependencies
PRD-101 (Mission Schema)
Uses
contractor_config JSONB, mission_id FK on agents table
PRD-102 (Coordinator)
Blocked by
Coordinator decides when/what contractors to spawn
PRD-103 (Verification)
Informs
Verification cost affects model selection for reviewer role
PRD-105 (Budget)
Uses
Budget gate wraps contractor creation
PRD-106 (Telemetry)
Feeds
Per-contractor metrics: model, tokens, cost, duration, score
PRD-107 (Context Interface)
Informs
Context interface determines how contractors receive mission context
AgentFactory
Extension
New methods: create_ephemeral_agent(), cleanup_ephemeral_agents()
tool_router.py
Extension
New _resolve_explicit_tools() path for contractor tool resolution
Appendix: Research Sources
Agent Zero (frdel/agent-zero)
Conversation sealing, utility model, shared memory limitations
AutoGen (microsoft/autogen)
Agent-as-config-dict, Swarm handoff priority, context_variables
Kubernetes Jobs (kubernetes.io)
TTL cleanup, hard timeout, backoff limit, pod failure policy
RouteLLM (ICLR 2025, arxiv:2406.18665)
75% cost reduction with static routing, role→tier mapping
BudgetMLAgent (AIMLSystems 2024)
Cascade pattern, 96% cost reduction
OpenRouter (openrouter.ai)
Provider routing params, 340 model catalog, Auto Router
Automatos AgentFactory (agent_factory.py)
execute_with_prompt() accepts AgentRuntime, tool loop pattern
Automatos heartbeat_service.py
_agent_tick() pattern, max_concurrent_per_workspace
Automatos config.py
PREMIUM_MODELS, BUDGET_MODELS, OpenRouter config
Last updated

