PRD-104 Outline: Ephemeral Agents & Model Selection
Type: Research + Design Status: Outline (Loop 0) Depends On: PRD-100 (Research Master), PRD-101 (Mission Schema — contractor_config JSONB), PRD-102 (Coordinator Architecture — agent assignment in lifecycle) Blocks: PRD-82C (Parallel Execution + Budget + Contractors)
Section 1: Problem Statement
Why This PRD Exists
Automatos agents are permanent residents. Every agent lives as a row in the agents table (45+ columns), has a heartbeat config, skills, tool assignments, persona, voice profile, and semantic embeddings. Creating an agent means a DB write, tool resolution queries, optional LLM verification, and caching in AgentFactory.active_agents. Deleting one requires cascading through 11 dependent tables.
This is correct for roster agents — permanent team members with personality, memory, and ongoing responsibilities.
It is completely wrong for mission work. When a coordinator decomposes "Research EU AI Act compliance" into 4 subtasks, it needs to spawn 4 focused agents in <100ms each, execute them (possibly in parallel), collect results, and destroy them. No persona. No heartbeat. No marketplace category. No voice profile.
The Gap
agents table — 45+ columns, all loaded on every instantiation
Lightweight agent config: just (role, model, tools, prompt)
AgentFactory.create_agent() — DB write + LLM verification (~500ms)
Fast in-memory agent creation (<100ms)
agent_type column — supports custom strings but no lifecycle semantics
Mission-scoped lifecycle: spawn → execute → report → destroy
model_config JSON — per-agent, manually set
Per-role model selection: coordinator picks model based on task type
Tool resolution — DB-backed via get_tools_for_agent()
Dynamic tool scoping: coordinator assigns tools per mission task
No cleanup automation — agents are hard-deleted manually
Auto-cleanup on mission completion or TTL expiry
No concurrent agent limit per mission
Bounded parallelism: max N agents executing simultaneously
What This PRD Delivers
Contractor Agent Lifecycle — spawn, configure, execute, report, destroy — with <100ms creation
Model-Per-Role Strategy — which models for which agent roles, with cost/quality tradeoffs
Dynamic Tool Scoping — coordinator specifies tools per contractor, no DB assignment needed
Mission-Scoped Memory — contractors share mission context but nothing persists after
Auto-Cleanup — TTL-based and mission-completion-based destruction
Integration Design — how contractors flow through existing
AgentFactory.execute_with_prompt()
Section 2: Prior Art Research Targets
Systems to Study (each gets dedicated research in the full PRD)
Agent Zero (frdel/agent-zero)
call_subordinate delegation, conversation sealing, utility model, memory sharing
Only hierarchical agent system with progressive context compression. Sealing pattern directly applicable.
AutoGen (microsoft/autogen)
ConversableAgent constructor, GroupChatManager speaker selection, Swarm handoff priority, context_variables shared state
Most mature spawning API. Agent-as-config-dict pattern is exactly what we need. Swarm handoffs inform how coordinator transfers control.
Kubernetes Jobs
Job lifecycle (create → active → complete → cleanup), ttlSecondsAfterFinished, activeDeadlineSeconds, backoffLimit, podFailurePolicy
Infrastructure pattern for ephemeral workloads. TTL cleanup, timeout enforcement, retry policies — all map to contractor lifecycle. Future Phase 3 will containerize agents.
Martian Model Router
Model Mapping interpretability technique, expert orchestration architecture, automatic model indexing
Most advanced model routing. Uses interpretability (weight matrix analysis) to predict model performance without running it. Validates per-prompt routing is feasible.
Unify.ai
Neural network router, per-prompt model selection, custom router training on user data, 10-minute benchmark refresh
Proves custom-trained routers outperform default. Their "train on your own eval data" approach maps to our telemetry → model selection loop (PRD-106).
OpenRouter
Provider routing algorithm (inverse-square price weighting), Auto Router (Not Diamond), sort/max_price/preferred_min_throughput params, capability-based routing, Exacto endpoints
We already use OpenRouter with 340 models. Their provider config params are our v1 model selection interface. Auto Router is free fallback for uncertain cases.
RouteLLM (ICLR 2025)
Matrix factorization router, 75% cost reduction at 95% quality, MT-Bench/MMLU/GSM8K benchmarks
Academic validation that model routing works. Identifies which task types need expensive models (math/reasoning) vs cheap ones (conversational).
BudgetMLAgent (AIMLSystems 2024)
Cascade pattern (free → cheap → expensive), 96% cost reduction vs GPT-4 single-agent
Proves cascade/escalation pattern for multi-agent budgets. Relevant to PRD-105 budget enforcement.
Key Research from Loop 0
From the research agents' findings:
Agent Zero:
Spawns subordinates via
Agent(number+1, fresh_config, SHARED_context)— only customization isprofile(prompt directory)Single subordinate at a time (linked list, not fan-out) — we need parallel fan-out
Memory is SHARED (same FAISS index) — we need mission-scoped isolation
Conversation sealing via
history.new_topic()— progressive compression (50% current / 30% topics / 20% bulks). Adopt this.Utility model for compression — cheap model handles internal coordination. Adopt this.
No timeouts, no budgets, no explicit destruction — we must have all three
AutoGen:
Agent =
(name, system_message, llm_config, tools, description)— fully described by a config dict. This is our contractor config model.GroupChatManagerselects speaker via LLM or deterministic rules — our coordinator uses deterministic assignmentSwarm handoff priority: tool-returned agent → OnCondition → AFTER_WORK fallback. Adopt priority ordering for coordinator task transitions.
context_variablesdict is shared mutable state across agents — maps to our mission-scoped contextNo explicit cleanup — agents are Python objects, GC'd. We need TTL enforcement.
Kubernetes Jobs:
ttlSecondsAfterFinished= auto-cleanup. Adopt for contractor TTL.activeDeadlineSeconds= hard timeout, takes precedence over retries. Adopt for contractor timeout.backoffLimit= retry cap with exponential backoff. Adopt for contractor retry policy.podFailurePolicyrules: FailJob (fatal error codes), Ignore (infra disruption), Count (normal retry). Adopt for contractor failure classification.Artifact preservation: write results to external storage BEFORE exit. Our pattern: contractor writes to mission_tasks.result before cleanup.
Model Routing:
Static role-based mapping captures 80% of value — no ML needed for v1
OpenRouter's
sort,max_price,preferred_min_throughputparams are the v1 interfaceCognitive diversity: reviewer MUST use different model family than coder (different failure modes)
RouteLLM: 75% cost reduction at 95% quality on MT-Bench. Math/reasoning needs expensive models; conversational routes cheap.
BudgetMLAgent: cascade pattern (free → cheap → expensive) achieves 96% cost reduction
Section 3: Contractor Agent Lifecycle
Lifecycle States
Research Questions for Full PRD
Spawn latency budget: What's the acceptable time from coordinator decision to contractor executing? Target: <100ms for in-memory, <500ms with DB audit row.
What persists after destruction? Mission context says:
mission_tasks.result(output),mission_events(execution trace),mission_tasks.cost/tokens(telemetry). Agent row itself is deleted or marked expired.Parallel fan-out limit: How many contractors can execute simultaneously per mission? Per workspace? Must integrate with heartbeat_service's existing
max_concurrent_per_workspace = 5.Failure classification: Which errors are retryable (LLM timeout, rate limit) vs fatal (invalid config, auth failure, budget exhausted)? Adopt K8s podFailurePolicy pattern.
Memory during execution: Contractor gets mission context (prior task results, shared findings) but does NOT write to Mem0 long-term memory. Mission-scoped Redis or in-memory only.
Key Design Decision: DB Row or In-Memory Only?
In-memory only
<50ms spawn, no migration, no cleanup query
No audit trail, lost on crash, invisible to admin UI
DB row with is_ephemeral=True
Audit trail, visible in admin, survives crash, queryable
~200-500ms spawn overhead, migration needed, cleanup job needed
Hybrid (in-memory + async DB write)
Fast spawn + eventual audit trail
Complexity, possible inconsistency if crash before write
Recommendation from research: Hybrid. Create AgentRuntime in-memory for immediate execution. Async-write a minimal DB row (just id, name, agent_type='contractor', workspace_id, mission_id, model_config, is_ephemeral=True, expires_at). If crash occurs before async write, mission_events table captures the execution trace anyway.
Integration Points
AgentFactory.execute_with_prompt()
Primary execution path — contractors use same tool loop, retry logic, response synthesis
get_tools_for_agent()
Needs new explicit_tools param — contractor tools specified by coordinator, not DB lookup
ContextService.build_context()
New ContextMode.MISSION_EXECUTION mode — mission-specific system prompt + task context
unified_executor.execute_tool()
No change — same dispatch by prefix (platform_*, workspace_*, composio_execute)
inter_agent.py
Optional — contractors could use Redis pub/sub for real-time coordination within a mission
Section 4: Model-Per-Role Strategy
Research Targets for Full PRD
Role taxonomy: Define the standard roles a coordinator can assign:
planner,researcher,coder,reviewer,writer,analyst,simple(formatting/routing)Model tier mapping: For each role, what model tier is appropriate? Based on RouteLLM and BudgetMLAgent findings:
planner
Mid-tier
Good reasoning, runs once per mission
Sonnet 4.6, GPT-4o
researcher
Mid-tier + large context
Process lots of text, synthesize
Gemini 3 Pro, Sonnet 4.6
coder
Top-tier or specialized
Code quality is critical
Opus 4.6, GPT-4
reviewer
Different family from coder
Cognitive diversity catches different bugs
GPT-5.1 (if coder=Claude), DeepSeek
writer
Mid-tier
Good prose, high volume
Sonnet 4.6
analyst
Mid-tier + structured output
Data analysis, table generation
GPT-4o, Gemini 3 Pro
simple
Cheap
Classification, formatting, routing
Haiku 4.5, GPT-4o-mini
OpenRouter integration: Use
providerobject params per role:sort:'price'for simple,'throughput'for coder,'latency'for reviewermax_price: Cost ceiling per role tierpreferred_min_throughput: Min tokens/sec for interactive roles
Cascade/escalation: Start cheap, escalate on low confidence. BudgetMLAgent pattern: free → $0.50/M → $15/M → $60/M
Cognitive diversity enforcement: Reviewer model family MUST differ from coder model family. This is a hard constraint, not a preference.
Key Design Questions
Where is the role→model mapping stored? Options: (a) hardcoded in coordinator prompt, (b) workspace-level config table, (c) mission-level override. Recommendation: workspace config with mission override.
Who decides the model? Coordinator LLM suggests, but mapping table constrains. Coordinator can't assign Opus to a
simplerole.How does cost estimation work pre-execution? Use OpenRouter's pricing API + average token counts per role to estimate mission cost before human approval.
What about user model preferences? PRD-100 says users can set "model preferences per role." How does this compose with workspace defaults and coordinator selection?
Fallback when preferred model is unavailable? OpenRouter handles provider fallback, but what if the model itself is deprecated? Need a fallback chain per role.
Existing Codebase Touchpoints
agents.model_configJSON — per-agent model settings. Contractors get this from role mapping, not DB.config.py:LLM_MODEL— system default. Contractors should NOT use this — they use role-specific models.config.py:OPENROUTER_BASE_URL— all contractor LLM calls route through OpenRouter.AgentFactory._create_llm_manager()— creates LLM client from config. Must support contractor model configs without DB lookup.
Section 5: Key Design Questions
Contractor Lifecycle
DB record for contractors or ephemeral only? Hybrid: in-memory execution + async DB audit row. See Section 3 analysis.
Memory scope: mission-only or none? Mission-only. Contractor sees prior task results from same mission (injected by coordinator). Does NOT read/write Mem0. Does NOT accumulate short-term memory across tasks (single-shot).
Tool assignment: inherit from mission or custom per task? Custom per task. Coordinator specifies
tools: ["platform_search_web", "workspace_read_file"]incontractor_configJSONB (defined in PRD-101'smission_tasks).Spawn latency budget: <100ms for in-memory creation. Actual LLM call (first response) adds 1-5s. Total spawn-to-first-output: <6s.
What triggers cleanup? Three triggers: (a) mission marked complete/failed, (b) TTL expiry (
expires_at), (c) explicit coordinatorCLEANUPcommand. Cleanup = evict fromactive_agents+ soft-delete DB row if exists.
Model Selection
Static mapping vs dynamic routing for v1? Static role-based mapping. No per-prompt ML routing in v1. OpenRouter's Auto Router as optional fallback.
Cost tracking granularity: Per-contractor-per-task. Each
execute_with_prompt()already returnstokens_usedand model info. Aggregate at mission level.User override surface: Mission creation UI shows recommended models per role. User can override any role's model. Override stored in
mission_runs.configJSONB.
Integration
How do contractors appear on the board? Each contractor task creates a
board_taskwithsource_type='mission'and the mission's project label. Contractor agent_id is set onboard_tasks.assigned_agent_id(requires minimal DB row).How does the coordinator communicate with contractors? NOT via inter_agent Redis pub/sub. Coordinator calls
execute_with_prompt()directly and awaits result. Simpler, debuggable, matches heartbeat tick pattern.Can a contractor spawn sub-contractors? No. This is a hard architectural constraint, not a simplification. Reasons:
Bounded recursion: If contractors could spawn sub-contractors, a single mission could produce unbounded agent trees. Budget enforcement becomes impossible — the coordinator can't pre-estimate cost for a tree of unknown depth.
Observability: The coordinator must see every executing agent. Sub-contractors would be invisible to the coordinator's reconciliation tick — stalls, failures, and budget overruns go undetected.
Debugging: A flat coordinator→contractor structure means every task trace is 2 levels deep. Sub-contractors create N-level traces that are exponentially harder to debug.
Alternative: If a task is too complex for one contractor, the coordinator should decompose it into smaller tasks (replanning), not delegate decomposition to the contractor. The coordinator IS the decomposition engine.
Section 6: Integration with AgentFactory
Current Flow (Roster Agents)
Proposed Flow (Contractor Agents)
Required Changes to AgentFactory
New create_ephemeral_agent() method
~50 lines, new method
Low — doesn't touch existing paths
get_tools_for_agent() accepts explicit_tools param
~10 lines, new code path
Low — existing path unchanged when param absent
execute_with_prompt() accepts AgentRuntime directly (already does!)
0 lines — line 711 already checks isinstance(agent, AgentRuntime)
None
New cleanup_ephemeral_agents(mission_id) method
~30 lines, new method
Low — only affects ephemeral agents
New ContextMode.MISSION_EXECUTION
~20 lines in ContextService
Low — additive
What Does NOT Change
execute_with_prompt()tool loop (lines 838-862) — same 10-iteration tool loop_execute_tool_calls()(lines 958-1028) — same tool dispatchunified_executor.execute_tool()— same prefix-based routingHeartbeat tick pattern — roster agents continue unchanged
Agent API endpoints — contractors created by coordinator, not user API
Section 7: Acceptance Criteria for Full PRD
PRD-104 is done when:
Contractor lifecycle is fully specified — state machine with transitions, triggers, error handling, and cleanup for all terminal states
Creation API defined —
create_ephemeral_agent()method signature with all params, defaults, and validation rulesModel-per-role mapping defined — role taxonomy, model tier assignments, OpenRouter config per role, cost estimates
Tool scoping designed — how coordinator specifies contractor tools, how
get_tools_for_agent()handles explicit tool listsMemory isolation designed — what mission context a contractor receives, what it can read/write, what persists after destruction
Cleanup automation designed — TTL-based, mission-completion-based, and explicit cleanup triggers with failure modes
AgentFactory integration specified — exact methods to add/modify, with pseudocode and interface signatures
Board integration specified — how contractor tasks appear on kanban, what
board_taskscolumns are setCost estimation model — how to predict mission cost before execution based on role→model mapping and historical averages
Failure handling specified — retryable vs fatal errors, escalation to coordinator, budget-exhaustion behavior
Concurrency control specified — max parallel contractors per mission, per workspace, backpressure strategy
Data model defined —
contractor_configJSONB schema (referenced in PRD-101'smission_tasks), any new columns onagentstablePrior art cited — every design decision references specific systems/papers studied, with tradeoff analysis
Section 8: Risks & Dependencies
Risks
1
Spawn overhead too high — DB write + LLM init adds unacceptable latency for parallel fan-out
High
Medium
Hybrid approach: in-memory first, async DB. Pre-warm LLM connections. Pool LLM managers per model.
2
Model routing accuracy — wrong model for role degrades output quality without clear signal
Medium
Medium
Start with static mapping (conservative). Add telemetry to PRD-106 to detect model-quality correlation. Iterate.
3
Contractor agent quality — without personality, memory, or training, contractors may produce worse output than roster agents
High
Medium
Contractors get rich system prompts from ContextService (mission context, role instructions, success criteria). Quality comes from prompt, not persistence.
4
Cleanup failures — missed cleanup leads to resource leaks (memory, DB rows, Redis keys)
Medium
Medium
Defense in depth: TTL expiry (hard cap), mission-completion cleanup, periodic GC sweep. Belt and suspenders.
5
Tool scoping complexity — dynamic tool lists create edge cases in tool execution (missing tools, wrong params)
Medium
Low
Validate tool names against ToolRegistry at spawn time. Fail fast with clear error.
6
Unbounded parallelism — too many contractors overwhelm LLM rate limits or DB connections
High
High
Hard caps: per-mission (configurable, default 5), per-workspace (matches heartbeat limit). Queue excess tasks.
7
Cost blowout — parallel contractors with expensive models exceed budget before coordinator can react
High
Medium
PRD-105 budget enforcement runs pre-check before each contractor spawn. Coordinator holds budget lock.
8
Model deprecation — OpenRouter model removed mid-mission
Low
Low
Fallback chain per role (3 models deep). OpenRouter's provider routing handles within-model fallback.
Dependencies
PRD-101 (Mission Schema)
Blocks 104
mission_tasks.contractor_config JSONB schema must be defined. Contractor needs mission_id FK.
PRD-102 (Coordinator)
Blocks 104
Coordinator decides WHEN and WHAT contractors to spawn. Contractor lifecycle is called BY coordinator.
PRD-103 (Verification)
Informs 104
Verification cost budget (10-30% of task cost) affects model selection for reviewer role.
PRD-105 (Budget)
Uses 104
Budget enforcement wraps contractor creation — checks remaining budget before spawn.
PRD-106 (Telemetry)
Uses 104
Telemetry captures per-contractor metrics: model, tokens, cost, duration, verifier_score.
PRD-107 (Context Interface)
Informs 104
Context interface determines how contractors receive/share mission context. Phase 2 = direct injection. Phase 3 = field read/write.
Cross-PRD Discoveries
AgentFactory.execute_with_prompt()already acceptsAgentRuntimedirectly (line 711) — no modification needed for the execution pathheartbeat_service._agent_tick()pattern is the structural template for contractor execution — same flow: load context → build prompt → call factory → collect resultinter_agent.py'sAgentCommunicationProtocolandCollaborativeReasonerexist but are NOT wired into live code — future consideration for contractor-to-contractor coordination within a missionSharedContextManagerhas 2h Redis TTL — insufficient for multi-day missions. Contractors on short missions are fine, but PRD-107 must address this for the context interface.Existing
model_usage_statsJSON on agents table tracks lifetime metrics — contractors need per-execution metrics instead (already inexecute_with_prompt()return value)
Appendix: Research Sources
Systems Studied
Agent Zero — github.com/frdel/agent-zero — delegation model, conversation sealing, utility model
AutoGen 0.2 & 0.4+ — github.com/microsoft/autogen — ConversableAgent, GroupChatManager, Swarm handoffs
Kubernetes Jobs — kubernetes.io/docs/concepts/workloads/controllers/job/ — lifecycle, TTL, backoff, pod failure policy
Martian — withmartian.com — Model Mapping, expert orchestration AI architecture
Unify.ai — unify.ai — neural network router, custom router training, 10-min benchmark refresh
OpenRouter — openrouter.ai — provider routing, Auto Router (Not Diamond), capability-based routing, Exacto endpoints
Papers & Benchmarks
RouteLLM (ICLR 2025, UC Berkeley/Anyscale/Canva) — matrix factorization router, 75% cost reduction, arxiv.org/abs/2406.18665
RouterArena (2025) — benchmark of 12 routers, arxiv.org/html/2510.00202v1
BudgetMLAgent (AIMLSystems 2024) — cascade pattern, 96% cost reduction, dl.acm.org/doi/10.1145/3703412.3703416
Claude Haiku 4.5 announcement — Anthropic — "Sonnet orchestrates Haiku team" pattern
Automatos Codebase Files
orchestrator/modules/agents/factory/agent_factory.py— AgentFactory (1,425 lines), execute_with_prompt(), tool looporchestrator/services/heartbeat_service.py— _agent_tick() pattern, rate limiting, concurrent executionorchestrator/core/models/core.py:172-280— Agent table schema (45+ columns)orchestrator/modules/tools/tool_router.py:129-250— get_tools_for_agent(), tool resolutionorchestrator/modules/agents/communication/inter_agent.py— Redis pub/sub, shared context, consensusorchestrator/config.py:144-226— OpenRouter config, LLM model defaultsorchestrator/api/agents.py:404-483— Agent creation API endpoint
Last updated

