PRD-69: Agent Intelligence Layer — Instincts, Evaluation & Strategic Context
Version: 1.0 Status: Draft Priority: P1 Author: Gar Kavanagh + Auto CTO Created: 2026-03-03 Updated: 2026-03-03 Dependencies: PRD-68 (Progressive Complexity Routing — IN PROGRESS), PRD-50 (Universal Router — COMPLETE), PRD-10 (Workflow Engine — COMPLETE), PRD-05 (Memory & Knowledge — COMPLETE), PRD-06 (Monitoring & Analytics — COMPLETE) Branch: feat/agent-intelligence-layer
Executive Summary
Automatos has sophisticated infrastructure: a 7-tier routing engine, a 9-stage orchestration pipeline, a 5-level memory hierarchy, 850+ tools via Composio, and a Progressive Complexity model (PRD-68) that routes atoms through organisms. But the intelligence layer — the system that makes all of this learn, evaluate, and improve — is largely stubbed.
Concrete evidence:
PlaybookMiner
Returns hardcoded demo sequences
modules/learning/playbooks/miner.py:29-34 — _fetch_sequences() returns 4 static lists
feedback/ directory
Empty
modules/learning/feedback/__init__.py — 0 lines
patterns/ directory
Empty
modules/learning/patterns/__init__.py — 0 lines
Agent evaluation fields
8 of 10 never populated
core/models/core.py — specialization_score, reliability_score, adaptation_rate, collaboration_score etc. are declared but never written to outside seed data
Memory hierarchy
Exists but not integrated into routing
Mem0 stores memories, but LearningSystemUpdater doesn't read them back to influence routing or tool selection
Cost tracking
Comprehensive but passive
analytics_engine.py, llm_analytics.py track spend — but cost data never feeds back into model routing or budget enforcement
Context compaction
Token-count-based
context_guard.py compacts at 80% token usage regardless of workflow phase — mid-reasoning compaction loses critical chain-of-thought
The affaan-m/everything-claude-code repository (58k+ stars, MIT license) has solved many of these problems for the Claude Code CLI: instinct-based learning, structured agent handoffs, eval harnesses, iterative retrieval, and strategic compaction. This PRD adapts 10 of those patterns into Automatos's multi-tenant SaaS architecture.
What We're Building
An intelligence layer that makes Automatos learn from every execution and get smarter over time:
Instinct-Based Learning — Observe execution patterns, build confidence, auto-promote proven instincts into routing rules and skills
Agent Pipeline Orchestration — Structured handoff documents and verdict systems so multi-agent workflows pass real context instead of dumping raw output
Cost-Aware Evaluation — Route complexity tiers to appropriate models, enforce budgets, and measure agent quality with pass@k scoring
Strategic Context Management — Iterative retrieval that refines queries and phase-aware compaction that respects workflow boundaries
What We're NOT Building
A new routing engine (PRD-68 handles that)
A new memory system (Mem0 integration is complete)
A new pipeline architecture (PRD-59 pipeline.py is solid)
Human-in-the-loop eval UI (Phase 3 future work)
1. Current State
1.1 Learning System — Wired but Empty
The LearningSystemUpdater.update_from_execution() is called from the pipeline but its downstream effects are minimal — it updates total_executions, avg_response_time, and success_rate on the Agent model. The remaining evaluation fields (specialization_score, reliability_score, adaptation_rate, collaboration_score, avg_quality_score, performance_score, readiness_score, discriminatory_power) are never computed.
1.2 Pipeline — Composable but No Inter-Agent Context
modules/orchestrator/pipeline.py defines a composable stage executor with StageStatus, ErrorStrategy, and SSE event emission. Stages receive a WorkflowContext and return a StageResult. But when multiple agents participate in a workflow:
Agent B receives Agent A's raw text output, not structured context
No verdict system exists — there's no way for an agent to say "this needs more work" vs. "ship it"
Pipeline progress is emitted as SSE events but with stage-level granularity, not reasoning-step granularity
1.3 Cost Tracking — Tracks but Doesn't Act
The platform tracks LLM spend per workspace/agent/model through llm_analytics.py, analytics_engine.py, and statistics.py. But:
No budget limits exist — a runaway workflow can burn unlimited tokens
Cost data doesn't influence model selection — PRD-68's complexity tiers map to pipeline depth but not model cost
No quality evaluation — we track whether executions complete, not whether they're good
1.4 Context Engineering — One-Shot Retrieval, Token-Based Compaction
modules/orchestrator/stages/context_engineering.py does single-pass RAG retrieval: one query → one set of results → inject into prompt. No refinement loop if results are poor.
core/context_guard.py compacts at 80% token usage using a flat strategy: summarize older turns, keep recent ones. It doesn't know about workflow phases — compaction mid-reasoning is as likely as compaction between phases.
2. Feature Area A: Instinct-Based Learning
Adapted from: everything-claude-code/skills/continuous-learning-v2/SKILL.md, commands/evolve.md, commands/skill-create.md
Concept
An instinct is an observed execution pattern that the system learns through repetition and reinforcement. Unlike hardcoded rules, instincts:
Start with low confidence (0.3) and strengthen through repeated observation
Decay over time if not reinforced (prevents stale patterns)
Auto-promote into concrete routing rules or skills at high confidence (≥ 0.8)
Are workspace-scoped (tenant A's patterns don't bleed into tenant B)
A.1 Instinct Data Model
New file: orchestrator/core/models/instincts.py
Migration: Add instincts table. Alembic migration in orchestrator/alembic/versions/.
A.2 Observation Pipeline
New file: orchestrator/modules/learning/patterns/instinct_observer.py
The observer hooks into LearningSystemUpdater.update_from_execution() and extracts instincts from completed workflow executions:
Key design decisions:
Observer is read-only on the execution path — it processes data after execution completes, never blocks the hot path
Pattern extraction uses the existing
SubtaskExecutionmetadata (tool calls, durations, quality scores) — no new data collection neededDeduplication by
(workspace_id, trigger_type, trigger_pattern, successful_action)compound uniqueness
A.3 Confidence Scoring Engine
New file: orchestrator/modules/learning/feedback/confidence.py
Confidence lifecycle:
0.3 — New observation, unverified
0.5 — Seen 3+ times with >70% success rate
0.7 — Reliable pattern, starts influencing suggestions
0.8 — Promotion threshold — auto-promoted to routing rule or skill
0.9 — Maximum practical confidence (never reaches 1.0 — always room for learning)
< 0.2 — Decayed, marked
status="decayed"and excluded from queries
Decay schedule: Confidence decreases by 0.005 per day without reinforcement. An instinct that isn't reinforced for 60 days decays from 0.8 → 0.5. After 120 days without reinforcement, it decays below 0.2 and is archived.
A.4 Auto-Promotion (Instincts → Skills/Routing Rules)
New file: orchestrator/modules/learning/patterns/instinct_promoter.py
When an instinct reaches confidence ≥ 0.8 with ≥ 10 observations:
intent
Routing rule in routing table
AutoBrain uses this to bypass regex matching
tool_sequence
Playbook template
Suggested as a workflow recipe
error_recovery
Error handler in pipeline
Auto-retry with the learned fallback
routing
Complexity tier override
Overrides default tier for known patterns
Promotion is reversible — if the promoted rule starts failing (tracked via the same observation pipeline), the instinct is demoted back to active status and confidence is reduced.
A.5 Evolve Endpoint
Adapted from: everything-claude-code/commands/evolve.md
New API endpoint: POST /api/v1/learning/evolve
Clusters related instincts and proposes higher-order patterns:
Clustering uses LLM-based semantic grouping (not k-means — patterns are natural language, not vectors). The LLM receives instinct descriptions and groups by functional theme.
A.6 Populate Agent Evaluation Fields
Modify: orchestrator/modules/learning/engine/core.py
Wire the 8 unused evaluation fields on the Agent model:
avg_quality_score
Weighted moving average of QualityAssessor scores
quality_assessor.py stage output
specialization_score
Entropy of task-type distribution (low entropy = specialist)
subtask_executions — count tasks by type per agent
reliability_score
Success rate weighted by recency
success_count / total with exponential recency weighting
adaptation_rate
Improvement slope over last 20 executions
Linear regression on quality scores over time
collaboration_score
Success rate in multi-agent workflows vs. solo
Compare quality when agent works alone vs. with others
performance_score
Composite: 40% quality + 30% speed + 30% token efficiency
Blend of existing metrics
readiness_score
Current availability × reliability × recency
Is this agent warmed up and reliable right now?
discriminatory_power
Variance in quality across task types
High = good at some things, bad at others. Low = consistent.
These fields feed into agent_selector.py and llm_agent_selector.py for smarter agent assignment.
3. Feature Area B: Agent Pipeline Orchestration
Adapted from: everything-claude-code/agents/planner.md, agents/code-reviewer.md
B.1 Handoff Document Schema
Modify: orchestrator/modules/orchestrator/pipeline.py
When Agent A completes a stage and Agent B picks up, the pipeline currently passes raw text output. Replace with a structured HandoffDocument:
The WorkflowContext (already the pipeline's shared state) gains a handoff_chain: List[HandoffDocument] field. Each stage appends its handoff document. Downstream stages can read the full chain or just the most recent handoff.
B.2 Pipeline Verdict System
Modify: orchestrator/modules/orchestrator/pipeline.py
Add a verdict protocol so stages can express outcomes beyond success/failure:
Pipeline behavior per verdict:
CONTINUE
Advance to next stage (current behavior)
NEEDS_WORK
Re-execute current stage with feedback from HandoffDocument.open_questions. Max 2 retries before ESCALATE.
BLOCKED
Log blocker, emit SSE event, skip to next stage with ErrorStrategy.SKIP
SHIP
Short-circuit remaining stages, deliver result immediately
ESCALATE
Pause pipeline, emit SSE event for human review. Resume on API call.
This replaces the current binary StageStatus.COMPLETED / StageStatus.FAILED for inter-stage communication. StageStatus remains for lifecycle tracking (pending/running/completed/failed).
B.3 Pipeline Visualization via SSE
Modify: orchestrator/modules/orchestrator/pipeline.py
Extend existing SSE event emission with richer data:
Current SSE events emit stage_started and stage_completed. Add stage_verdict, stage_retry, pipeline_short_circuit (for SHIP), and pipeline_escalation event types.
B.4 Context Mode Switching
Adapted from: everything-claude-code/contexts/dev.md, contexts/review.md, contexts/research.md
New file: orchestrator/consumers/chatbot/context_modes.py
Context modes adjust the system prompt, tool selection, and evaluation criteria based on the user's current intent:
dev
Build features, write code, execute tasks
Code tools, Composio actions, platform tools
Correctness, completion
review
Analyze existing work, find issues, suggest improvements
CodeGraph search, RAG, analytics
Thoroughness, accuracy
research
Gather information, compare options, summarize findings
Web search, RAG, document retrieval
Breadth, source quality
plan
Create plans, estimate effort, identify risks
Platform awareness, memory, analytics
Completeness, feasibility
Activation: Auto-detected by AutoBrain based on intent (e.g., "review this PR" → review mode, "build a webhook handler" → dev mode). Can also be set explicitly via chat command /mode review or API parameter.
Integration point: Context modes inject a mode-specific system prompt prefix before the personality prompt in service.py. They also adjust tool_hints (PRD-68) to prioritize mode-relevant tools.
4. Feature Area C: Cost-Aware Evaluation
Adapted from: everything-claude-code/rules/common/performance.md, skills/eval-harness/SKILL.md
C.1 Complexity-Based Model Routing
Modify: orchestrator/core/llm/manager.py, orchestrator/consumers/chatbot/service.py
PRD-68 defines 5 complexity tiers (ATOM → ORGANISM) and maps them to pipeline depth. This feature adds model cost tiers to that mapping:
ATOM
Direct response
economy
Haiku, GPT-4o-mini, Gemini Flash
$0.001-0.005
MOLECULE
Single agent
standard
Sonnet, GPT-4o, Gemini Pro
$0.01-0.03
CELL
Agent + tools
standard
Sonnet, GPT-4o, Gemini Pro
$0.01-0.03
ORGAN
Multi-agent
premium
Opus, GPT-4.5, Gemini Ultra
$0.05-0.15
ORGANISM
Full swarm
premium
Opus, GPT-4.5, Gemini Ultra
$0.05-0.15
Implementation: Add model_category field to ComplexityAssessment (PRD-68's dataclass). LLMManager.get_model() accepts a category parameter and selects from the workspace's configured models for that tier. Falls back to config.LLM_MODEL if no category-specific model is configured.
Workspace model configuration: New workspace_model_preferences table:
C.2 Budget Tracking with Alerts
New file: orchestrator/core/services/cost_governor.py
Budget data sources: existing llm_usage table tracked by analytics_engine.py. No new tracking needed — just aggregation + enforcement.
C.3 Eval Harness with pass@k Scoring
Adapted from: everything-claude-code/skills/eval-harness/SKILL.md
New file: orchestrator/modules/orchestrator/stages/eval_harness.py
For high-complexity tasks (ORGAN/ORGANISM), optionally run the agent pipeline k times and select the best result:
When to use: Controlled by workspace setting. Default is k=1 (no eval harness). Premium workspaces can set k=3 for critical workflows. Budget enforcement (C.2) applies — k runs cost k× tokens.
Quality threshold: Configurable per workspace, default 0.7 (on the 0-1 scale from quality_assessor.py).
5. Feature Area D: Strategic Context Management
Adapted from: everything-claude-code/skills/iterative-retrieval/SKILL.md, skills/strategic-compact/SKILL.md
D.1 Iterative Retrieval
Modify: orchestrator/modules/orchestrator/stages/context_engineering.py
Replace single-pass RAG with a DISPATCH → EVALUATE → REFINE → LOOP cycle:
Scoring functions:
Relevance: Cosine similarity between query embedding and chunk embeddings (already computed by RAG pipeline)
Coverage: LLM-assessed — "Does this context contain enough information to answer the query?" Binary yes/no with a brief explanation
Performance guard: Each cycle adds ~500ms (one RAG query + one LLM coverage check). Max 3 cycles = ~1.5s additional latency. Only activated for CELL+ complexity (ATOM and MOLECULE use single-pass retrieval).
D.2 Phase-Aware Compaction
Modify: orchestrator/core/context_guard.py
Current behavior: Compact at 80% token usage regardless of what's happening in the pipeline.
New behavior: Compact at workflow phase boundaries, preserving the full context within each phase.
Integration with HandoffDocument (B.1): Handoff documents provide natural compaction units. After a stage completes, its handoff document's summary replaces the full stage output in the conversation context. The detailed_output is available in the handoff_chain if a downstream stage needs to look back.
Token budget allocation:
6. Phased Implementation
Phase 1 — MVP (Weeks 1-3)
Core learning loop + basic pipeline improvements + cost routing.
A.1
Instinct data model + migration
core/models/instincts.py (NEW), alembic migration
A.2
Observation pipeline (intent→tool + routing patterns only)
modules/learning/patterns/instinct_observer.py (NEW), modules/learning/engine/core.py (MODIFY)
A.3
Confidence scoring (basic: success rate + time decay)
modules/learning/feedback/confidence.py (NEW)
B.1
HandoffDocument schema + wire into pipeline
modules/orchestrator/pipeline.py (MODIFY)
B.2
CONTINUE/SHIP verdicts only (no NEEDS_WORK loops yet)
modules/orchestrator/pipeline.py (MODIFY)
C.1
Model category on ComplexityAssessment + LLMManager routing
core/llm/manager.py (MODIFY), consumers/chatbot/service.py (MODIFY)
D.1
Simplified iterative retrieval (2 cycles max, relevance-only scoring)
modules/orchestrator/stages/context_engineering.py (MODIFY)
Phase 1 success criteria:
Instincts are being created from real executions
HandoffDocuments are passing between pipeline stages
ATOM tasks use economy models, ORGANISM tasks use premium models
RAG retrieval does at least 1 refinement cycle when initial results score below 0.7
Phase 2 — Enhancement (Weeks 4-6)
Auto-promotion, full verdict system, budget enforcement, phase-aware compaction.
A.4
Auto-promotion (instincts → routing rules)
modules/learning/patterns/instinct_promoter.py (NEW)
A.5
Evolve endpoint (cluster + propose skills)
api/learning.py (MODIFY)
A.6
Populate all 8 agent evaluation fields
modules/learning/engine/core.py (MODIFY)
B.2+
NEEDS_WORK + BLOCKED + ESCALATE verdicts with retry loops
modules/orchestrator/pipeline.py (MODIFY)
B.3
Enriched SSE events for pipeline visualization
modules/orchestrator/pipeline.py (MODIFY)
B.4
Context modes (dev/review/research/plan)
consumers/chatbot/context_modes.py (NEW), consumers/chatbot/service.py (MODIFY)
C.2
Budget tracking + alerts + auto-downgrade
core/services/cost_governor.py (NEW), core/models/workspaces.py (MODIFY)
C.3
Eval harness with pass@k scoring
modules/orchestrator/stages/eval_harness.py (NEW)
D.1+
Full iterative retrieval (3 cycles, coverage scoring)
modules/orchestrator/stages/context_engineering.py (MODIFY)
D.2
Phase-aware compaction
core/context_guard.py (MODIFY)
Phase 2 success criteria:
High-confidence instincts auto-promote to routing rules and are used by AutoBrain
Agent eval fields are populated and visible in agent analytics
Budget limits enforce model downgrades when workspace spend exceeds threshold
Context compaction never occurs mid-reasoning
Phase 3 — Future Work (Not Scoped)
Multi-model trust rules
Per-model confidence tracking — "trust Opus for code review, use Haiku for summarization"
Instinct marketplace
Share high-confidence instincts across workspaces (opt-in)
Human-in-loop eval
UI for reviewing ESCALATE verdicts and providing feedback
Context mode UI
Frontend mode selector with visual indicator
Eval dashboard
Visualize pass@k metrics, agent performance trends, instinct lifecycle
Cross-workspace instinct federation
Platform-level instincts learned from aggregate patterns (privacy-preserving)
7. File Impact Table
New Files
orchestrator/core/models/instincts.py
A.1
Instinct SQLAlchemy model
orchestrator/modules/learning/patterns/instinct_observer.py
A.2
Observation pipeline — extracts instincts from executions
orchestrator/modules/learning/feedback/confidence.py
A.3
Confidence scoring + decay engine
orchestrator/modules/learning/patterns/instinct_promoter.py
A.4
Auto-promotion logic
orchestrator/consumers/chatbot/context_modes.py
B.4
Context mode definitions + switching
orchestrator/core/services/cost_governor.py
C.2
Budget tracking + enforcement
orchestrator/modules/orchestrator/stages/eval_harness.py
C.3
pass@k evaluation harness
orchestrator/alembic/versions/xxx_add_instincts_table.py
A.1
Database migration
orchestrator/alembic/versions/xxx_add_workspace_model_prefs.py
C.1
Database migration
Modified Files
orchestrator/modules/learning/engine/core.py
A.2, A.6
Wire InstinctObserver into update loop; compute all 8 eval fields
orchestrator/modules/orchestrator/pipeline.py
B.1, B.2, B.3
Add HandoffDocument, StageVerdict, enriched SSE events
orchestrator/core/llm/manager.py
C.1
Add category parameter to model selection
orchestrator/consumers/chatbot/service.py
C.1, B.4
Route complexity tier to model category; integrate context modes
orchestrator/modules/orchestrator/stages/context_engineering.py
D.1
Iterative retrieval loop with refinement
orchestrator/core/context_guard.py
D.2
Phase-aware compaction strategy
orchestrator/api/learning.py
A.5
Add /evolve endpoint
orchestrator/core/models/workspaces.py
C.2
Add budget fields to workspace model
everything-claude-code Cross-References
A.1-A.4 (Instincts)
skills/continuous-learning-v2/SKILL.md
File-based → PostgreSQL; single-user → multi-tenant with workspace scoping
A.5 (Evolve)
commands/evolve.md
CLI command → REST API endpoint; local clustering → LLM-based semantic grouping
A.6 (Eval fields)
commands/skill-create.md
Skill quality metrics → agent performance metrics mapped to existing DB columns
B.1-B.2 (Handoffs/Verdicts)
agents/planner.md, agents/code-reviewer.md
Agent-specific handoff → generic pipeline protocol; text verdicts → enum-based system
B.4 (Context modes)
contexts/dev.md, contexts/review.md, contexts/research.md
Static context files → dynamic mode switching with tool priority adjustment
C.1-C.2 (Cost routing)
rules/common/performance.md
Manual model selection guidance → automated complexity-based routing with budget enforcement
C.3 (Eval harness)
skills/eval-harness/SKILL.md
Local eval script → pipeline-integrated stage with quality_assessor scoring
D.1 (Iterative retrieval)
skills/iterative-retrieval/SKILL.md
File-based retrieval → RAG pipeline integration with embedding-based relevance scoring
D.2 (Strategic compaction)
skills/strategic-compact/SKILL.md
Token-based → phase-boundary-based; conversation context → pipeline HandoffDocument chain
8. API Surface
New Endpoints
Modified Endpoints
9. Database Changes
New Tables
instincts — See A.1 for full schema. Indexed on (workspace_id, status) and (workspace_id, trigger_type, confidence DESC).
workspace_model_preferences — See C.1 for schema. One row per workspace per category.
Modified Tables
workspaces — Add columns:
budget_daily_limit_usd(Float, nullable, default null = unlimited)budget_monthly_limit_usd(Float, nullable, default null = unlimited)budget_alert_threshold_pct(Float, default 0.8)budget_overage_action(String, default "alert")
10. Risks and Mitigations
Instinct observation adds latency to execution path
Medium
Medium
Observer runs async after execution completes — zero impact on user-facing latency
Bad instincts get promoted and degrade routing
Low
High
Promotion requires ≥10 observations + ≥0.8 confidence. Promoted rules are monitored and auto-demoted on failure.
Iterative retrieval adds 1-1.5s to context engineering
High
Low
Only activated for CELL+ complexity. ATOM/MOLECULE use single-pass (no regression for simple queries).
Budget enforcement blocks critical workflows
Low
High
Default overage action is "alert" not "block". Admin can always override. Budget applies per-workspace, not per-request.
Phase-aware compaction is more complex than token-based
Medium
Medium
Implement incrementally: Phase 1 keeps current 80% threshold, Phase 2 adds phase boundary detection. Fallback to token-based if boundary detection fails.
eval harness (pass@k) multiplies cost by k
Medium
Medium
Default k=1 (no overhead). Only enabled explicitly per workspace. Budget enforcement (C.2) caps total spend regardless.
11. Success Metrics
Learning (Feature Area A)
Instinct creation rate: ≥ 5 new instincts per 100 executions within first week
Promotion rate: ≥ 10% of instincts reach promotion threshold within 30 days
Routing accuracy improvement: Promoted instincts should show ≥ 90% success rate post-promotion
Agent eval field population: All 8 fields populated for every agent with ≥ 5 executions
Pipeline (Feature Area B)
Handoff utilization: ≥ 80% of multi-stage workflows use structured HandoffDocuments
SHIP verdict frequency: ≥ 15% of pipelines short-circuit via SHIP (indicating efficiency)
NEEDS_WORK resolution rate: ≥ 70% of NEEDS_WORK verdicts resolve within max retries
Cost (Feature Area C)
Model cost reduction: ≥ 30% reduction in average cost per ATOM query (economy models vs. current)
Budget adherence: 0 uncontrolled spend events after budget enforcement is active
Eval quality: pass@3 shows ≥ 20% improvement over pass@1 for ORGAN+ tasks
Context (Feature Area D)
Retrieval quality: Iterative retrieval achieves ≥ 0.7 relevance on ≥ 85% of CELL+ queries (vs. current ~60% estimated)
Compaction safety: 0 instances of mid-reasoning compaction after D.2 is deployed
Token efficiency: ≥ 15% reduction in wasted context tokens through phase-aware allocation
Last updated

