PRD-58: System Prompt Management & FutureAGI Evaluation Integration
Version: 3.0 Status: In Progress Date: February 19, 2026 (updated from v2.0 Feb 18) Author: Claude Code + Gerard Prerequisites: PRD-29 (FutureAGI Observability), PRD-55 (Autonomous Assistant / Soul Designer) Branch: futureAGI (worktree: automatos-ai-futureAGI) FutureAGI Access: 6-month free trial (expires ~August 2026)
Changelog
1.0
Feb 17
Initial draft — SDK-based approach
2.0
Feb 18
Complete rewrite of Phase 1B. Dropped SDK (crashes Docker). Direct HTTP API. Live traffic eval with toggle. Intelligent self-healing vision.
3.0
Feb 19
Phase 1B marked complete. Full detailed specs for Phase 1C (Self-Healing) and Phase 2 (User-Facing Features). Architecture evolved to worker service pattern.
Executive Summary
Automatos has 17 system prompts hardcoded across the orchestrator that control every decision the platform makes — routing, task decomposition, agent selection, quality assessment, SQL generation, memory injection, personality, and more. Today these prompts:
Live in Python source code, scattered across 17 files
Can only be changed by a developer deploying code
Have zero quality data — they were tuned by feel, never evaluated
Cannot be A/B tested, versioned, or rolled back without git
This PRD introduces three capabilities:
Admin Prompt Management System (Phase 1A) — Move all 17 system prompts into the database with a full management UI, version history, and instant rollback. No code deploys to change how Automatos thinks. STATUS: SHIPPED
FutureAGI Live Traffic Evaluation (Phase 1B) — Per-prompt toggle that sends real chat input/output pairs to FutureAGI for quality scoring in the background. Scores accumulate over time as a live quality dashboard. Safety scanning on prompt text. STATUS: COMPLETE
Intelligent Self-Healing Prompt System (Phase 1C) — The orchestrator monitors rolling quality scores per prompt. When it detects degradation (score drops, error rate spikes), it automatically enables eval, runs optimization, and can swap in improved prompt versions. A self-healing AI. STATUS: PLANNED
Architecture Decision: Worker Service Pattern
The FutureAGI Python SDK (futureagi==0.6.0) was evaluated and rejected (v2.0). The architecture then evolved further:
v1.0: SDK in orchestrator → crashed Docker, broken APIs
v2.0: Direct HTTP in orchestrator → worked but polluted orchestrator deps
v3.0 (current): Isolated
agent-opt-workerservice handles all FutureAGI calls. Orchestrator calls worker via internal HTTP. Zero FutureAGI deps in orchestrator.
Decision: Orchestrator → Worker service (AGENT_OPT_WORKER_URL) via httpx. Worker owns FutureAGI API keys and SDK concerns. Orchestrator owns DB operations and dispatch logic.
Current Implementation Status
Phase 1A: Admin Prompt Management — SHIPPED ✅
Everything in the original Phase 1A is deployed and working:
DB tables (system_prompts, versions, eval_runs)
✅ Deployed
a02a002
Seed script (15 prompts across 4 categories)
✅ Deployed
a02a002
PromptRegistry service with caching + fallback
✅ Deployed
Pre-existing
Admin API endpoints (CRUD + versions + rollback)
✅ Deployed
5157daa
Admin UI: SystemPromptsTab in Settings
✅ Deployed
b80b1d7, d4eac9a
Auth: Clerk/API-key users can manage prompts
✅ Deployed
5157daa
Startup: create_tables + seed on Railway boot
✅ Deployed
a02a002
Phase 1B: FutureAGI Integration — COMPLETE ✅
Direct HTTP service (no SDK)
✅ Deployed
Routes through agent-opt-worker service
Per-template config (keys + models)
✅ Deployed
6b8936d
Concurrent execution (asyncio.gather)
✅ Deployed
6b8936d
Safety scanning (4 checks)
✅ Working
toxicity, bias, injection, moderation
Assessment (3 metrics)
✅ Working
completeness, is_helpful, is_concise via worker
Optimize
✅ Working
Uses worker /optimize with dataset from live traffic
Frontend: per-run-type rendering
✅ Deployed
d4eac9a
Frontend: polling + loading states
✅ Deployed
60a46e2
Live traffic eval toggle
✅ Deployed
Per-prompt toggle in assessments tab
Hook into chat pipeline
✅ Deployed
Fire-and-forget at 3 hook points in service.py
Idempotent column migration
✅ Deployed
ALTER TABLE ADD COLUMN IF NOT EXISTS in main.py
The 17 System Prompts
(Unchanged from v1.0 — see original section)
Every prompt that drives Automatos decision-making:
Core Orchestration (the "brain")
1
Personality/Soul
consumers/chatbot/personality.py
Tone, style, warmth of every response
Every message
2
Default System Prompt
consumers/chatbot/prompt_analyzer.py
Fallback identity when no personality set
Every message
3
Routing Classifier
core/routing/engine.py
Which agent handles each user message
Every message
4
Intent Classifier Patterns
consumers/chatbot/intent_classifier.py
Simple vs complex, tool needs, memory needs
Every message
5
Tool Ranking Logic
consumers/chatbot/prompt_analyzer.py
Which tools get suggested to the LLM
Every tool call
Multi-Agent Orchestration (complex tasks)
6
Task Decomposer
modules/orchestrator/stages/task_decomposer.py
How complex tasks break into subtasks
Every multi-step task
7
Complexity Analyzer
modules/orchestrator/stages/complexity_analyzer.py
Simple vs moderate vs complex classification
Every task
8
Agent Selector
modules/orchestrator/llm/llm_agent_selector.py
Which agent runs each subtask
Every subtask
9
Strategy Planner
modules/orchestrator/llm/master_orchestrator.py
Speed/quality/cost tradeoff strategy
Every orchestrated workflow
10
Quality Assessor
modules/orchestrator/stages/quality_assessor.py
Whether outputs meet quality threshold
Every output
11
Context Optimizer
modules/search/optimization/context_optimizer.py
What context gets injected, how much, what format
Every context-enriched request
Domain-Specific
12
NL2SQL Generator
modules/nl2sql/query/nl2sql_service.py
Natural language → SQL translation
Every data query
13
Memory Injection Template
modules/memory/operations/prompt_injection.py
How memories get formatted into context
Every memory-aware request
14
Agent Factory Builder
modules/agents/factory/agent_factory.py
Dynamic agent system prompt assembly
Every agent execution
15
Execution Manager
modules/agents/execution/execution_manager.py
Default professional execution prompt
Every agent run
Personas & Templates
16
Persona Presets (x4)
core/seeds/seed_personas.py
Engineer, Sales, Marketing, Support personas
Agent creation
17
Recipe Learning
core/services/recipe_learning_service.py
How improvement suggestions are generated
Recipe optimization
Phase 1A: Admin Prompt Management System — SHIPPED ✅
(Unchanged from v1.0 — fully deployed. See original PRD for data model, API endpoints, and UI specs.)
Key endpoints deployed:
Phase 1B: FutureAGI Live Traffic Evaluation — COMPLETE ✅
1B.1 The Problem with Synthetic Testing
The original PRD assumed we'd create test datasets per prompt and evaluate against those. In practice:
FutureAGI evaluates input/output pairs, not prompts in isolation
Sending fake output (
"System prompt assessed successfully.") produces garbage scoresCreating realistic test datasets for 17 prompts is weeks of manual work
Synthetic tests don't reflect real-world usage patterns
1B.2 The Solution: Live Traffic Evaluation
Instead of synthetic tests, evaluate real conversations.
When a system prompt has FutureAGI evaluation enabled:
A user sends a chat message
The orchestrator selects the system prompt, sends it + user message to LLM
LLM responds
Fire-and-forget: send the real (user input, LLM output) to FutureAGI for scoring
Score is stored in
system_prompt_eval_runslinked to the prompt versionScores accumulate over time → live quality dashboard
Zero impact on chat latency — the eval call is fully async in the background.
1B.3 Per-Prompt Eval Toggle
New field on SystemPrompt model:
Admin UI: Toggle switch per prompt in the detail view. When ON, every real chat interaction using that prompt gets scored.
API:
1B.4 Hook Point: Chat Pipeline
The hook goes into consumers/chatbot/service.py where SmartChatIntegration prepares the orchestrated request and the LLM response comes back.
1B.5 FutureAGI Service: Direct HTTP (No SDK)
1B.6 Live Traffic Eval Method
New method on FutureAGIService:
1B.7 Safety Scanning — Working ✅
Safety scanning runs on prompt text directly (no real I/O needed):
toxicity —
protectmodel, checks output textbias_detection —
protect_flashmodel, checks output textprompt_injection —
protectmodel, checks input textcontent_moderation —
protectmodel, checks output text
All 4 checks run concurrently via asyncio.gather().
1B.8 Optimize — Deferred
FutureAGI's improve-prompt API has gone async (returns job IDs, no inline results). Two options for later:
Poll FutureAGI for results (no polling endpoint found yet)
Use own LLMs to optimize based on accumulated assessment feedback
Decision: Park optimize for now. Focus on getting live traffic eval working — that's the foundation everything else builds on.
1B.9 Frontend: Assessment Dashboard
The Assessments tab per prompt shows:
Toggle: FutureAGI Eval ON/OFF switch
Live scores: Rolling quality metrics from real traffic
Assessment runs: Individual eval results with reasons
Safety results: Per-check pass/fail with detailed reasons
Auto-polling every 3s while runs are pending/running
1B.10 Implementation Steps (Phase 1B — what we're building now)
1
Add futureagi_eval_enabled column to SystemPrompt
🔨 Building
2
Add PATCH /eval-toggle API endpoint
🔨 Building
3
Add toggle switch to frontend prompt detail view
🔨 Building
4
Add eval_live_traffic() method to FutureAGIService
🔨 Building
5
Hook into chat pipeline (service.py after LLM response)
🔨 Building
6
Store per-message eval results in eval_runs table
🔨 Building
7
Display accumulated scores in assessments dashboard
🔨 Building
8
Test: send chat messages with eval ON, verify scores appear
🔨 Building
1B.11 Known Issues
prompt_adherence template returns server-side error
Can't use this metric
Replaced with is_helpful in defaults
bias_detection can timeout (>90s)
Occasional missing safety check
90s timeout, graceful degradation
FutureAGI optimize API returns async job IDs
Can't get optimized prompts inline
Deferred to Phase 1C
FutureAGI API has no rate limiting docs
Unknown throughput limits
Start with eval on 1-2 prompts, monitor
Phase 1C: Intelligent Self-Healing Prompt System — PLANNED
Vision
The orchestrator becomes self-aware about prompt quality. Instead of an admin manually checking scores, the system:
Monitors rolling average quality scores per prompt version
Detects degradation — score drops below threshold, error rate spikes
Responds automatically:
Enables FutureAGI eval if not already on
Triggers optimization (using own LLMs to rewrite based on failure patterns)
Creates a new prompt version candidate
A/B tests the candidate against the current version
If candidate scores better → activates it
If not → discards and alerts admin
Reports what it did and why via audit trail
Architecture
1C.1 Data Model Changes
New table for health monitoring + audit trail:
New columns on existing models:
1C.2 Prompt Health Monitor Service
Uses APScheduler (already in codebase via HeartbeatService) to run every 5 minutes.
1C.3 LLM-as-Judge Optimization
When FutureAGI optimize is unavailable or as a primary strategy, use own LLMs:
1C.4 A/B Testing Infrastructure
The chat pipeline already selects the active prompt version. For A/B testing, we intercept that selection:
The eval_live_traffic hook already stores version_id on each SystemPromptEvalRun, so scores naturally accumulate per version. The Health Monitor compares scores between control and candidate versions to decide promotion.
1C.5 A/B Test Resolution
1C.6 Health Monitor Triggers
Rolling quality score drops >15% from baseline
Over 20-eval window
Enable eval if off, trigger optimization
Rolling quality score drops >30% from baseline
Over 20-eval window
Critical alert to admin + fast-track optimization
Failure rate >10%
Over last 50 evals
Enable eval, trigger optimization
No eval data for >24h (stale)
Time-based
Auto-enable eval for heartbeat data
A/B candidate scores >10% better
After min 50 evals per version
Auto-promote candidate version
A/B test running >7 days with no winner
Time-based
Discard candidate, alert admin
1C.7 Admin API Endpoints
1C.8 Frontend: Prompt Health Dashboard
New "Health" sub-tab in the prompt detail view (alongside existing Editor, Versions, Assessments):
Health Overview Card:
Health status badge (healthy / degraded / critical / recovering)
Composite score gauge (0-100) with baseline indicator
Trend sparkline (last 24h of health snapshots)
Rolling Metrics Chart (Recharts — already in frontend deps):
Line chart showing completeness, helpfulness, conciseness over time
Baseline reference line
Degradation threshold markers
A/B Test Panel (when active):
Side-by-side score comparison: Control vs Candidate
Eval count per version
Projected winner based on current trend
Manual override buttons: "Promote Now" / "Discard"
Diff view of control vs candidate prompt text
Audit Trail:
Chronological list of auto-healing actions
Each entry: timestamp, action taken, reason, outcome
Link to relevant prompt version
Global Health Dashboard (new section in Settings → System Prompts):
Table of all prompts with health status, composite score, trend arrow
Filter by status (healthy/degraded/critical)
Sort by composite score or degradation delta
1C.9 Notifications
When the system takes auto-healing actions, notify admins:
Prompt degraded
In-app toast + audit log
"Personality prompt quality dropped 18%. Optimization triggered."
Optimization complete
In-app toast + audit log
"New candidate version v4 generated for Routing Classifier."
A/B test started
Audit log
"A/B test started: v3 (control) vs v4 (candidate), 50/50 split."
A/B test resolved
In-app toast + audit log
"Candidate v4 promoted — scored 14% better than v3."
Critical degradation
In-app toast + audit log
"CRITICAL: Task Decomposer quality dropped 35%. Manual review recommended."
1C.10 Startup Integration
1C.11 Implementation Steps
1
Add PromptHealthSnapshot and PromptABTest models
core/models/system_prompts.py
New tables + Pydantic schemas
2
Add health_status, baseline_score, ab_test_id to SystemPrompt
core/models/system_prompts.py
Idempotent migration in main.py
3
Add is_candidate, generated_by, generation_context to SystemPromptVersion
core/models/system_prompts.py
Idempotent migration in main.py
4
Create PromptHealthMonitor service
core/services/prompt_health_monitor.py
APScheduler, rolling averages, degradation detection
5
Add LLM-as-judge optimization logic
core/services/prompt_health_monitor.py
Uses existing LLM client + Claude Sonnet
6
Add A/B test traffic splitting to PromptRegistry
core/services/prompt_registry.py or smart_orchestrator.py
Random split based on traffic_split
7
Add A/B test resolution logic
core/services/prompt_health_monitor.py
Check running tests on each monitor cycle
8
Add health API endpoints
api/admin_prompts.py
Health overview, history, A/B test, audit
9
Update eval_live_traffic() to tag version_id from A/B test
core/services/futureagi_service.py
Version already tracked, just verify A/B path
10
Frontend: Health sub-tab with metrics chart
SystemPromptsTab.tsx
Recharts line chart + health status badge
11
Frontend: A/B test panel
SystemPromptsTab.tsx
Side-by-side scores, promote/discard buttons
12
Frontend: Global health dashboard
SystemPromptsTab.tsx
Summary table of all prompt health statuses
13
Frontend: Audit trail list
SystemPromptsTab.tsx
Chronological list of auto-actions
14
Register health monitor on app startup
main.py
APScheduler job, 5-min interval
15
Test: degrade a prompt, verify auto-optimization fires
Manual
Temporarily worsen a prompt, observe healing
16
Test: A/B test lifecycle end-to-end
Manual
Candidate generated → traffic split → promoted/discarded
1C.12 LLM-as-Judge Fallback Strategy
FutureAGI trial expires ~August 2026. The self-healing system must survive without it:
Live traffic scoring
Worker /score endpoint
Claude Haiku judges (input, output) against rubric
Safety scanning
Worker /safety endpoint
Claude scans for toxicity/injection using system prompt
Optimization
Worker /optimize endpoint
Claude Sonnet rewrites based on failure patterns (already built in 1C)
Quality metrics
FutureAGI templates (completeness, etc.)
Custom rubrics scored 0-1 by Claude
Fallback activation: When FutureAGI worker returns errors for >1 hour, auto-switch to LLM-as-judge mode. Log the switch. Admin can manually toggle in settings.
Phase 2: User-Facing Features — PLANNED
Prerequisite: Phase 1B complete (eval data flowing), Phase 1C desirable but not blocking.
Phase 2 brings prompt quality data out of the admin settings and into user-facing surfaces where it drives real product value.
2A. Model Comparison Modal
Goal: Wire the existing marketplace Compare button to a real comparison view backed by FutureAGI quality scores.
Current State: frontend/components/marketplace/marketplace-llms-tab.tsx and llm-model-detail-modal.tsx exist. Compare button present but wired to static/heuristic data.
2A.1 How It Works
Admin selects 2-3 models to compare in the LLM Marketplace
System runs the same set of test prompts through each model
FutureAGI scores each model's output on completeness, helpfulness, conciseness
Results displayed side-by-side in a comparison modal
2A.2 Backend
Test Prompt Defaults (used when admin doesn't provide custom):
2A.3 Frontend
Comparison Modal (extends llm-model-detail-modal.tsx):
Side-by-side columns, one per model
Radar chart (Recharts) showing metric scores overlaid
Bar chart comparing composite scores
Expandable rows showing individual prompt results
Latency and cost comparison row
"Select Winner" button that highlights recommended model
2A.4 Implementation Steps
1
Create ModelComparisonRequest/Result schemas
core/models/marketplace.py (or new file)
2
Create /api/marketplace/models/compare endpoint
api/model_comparison.py
3
Background worker: run prompts through each model, score with FutureAGI
core/services/model_comparison_service.py
4
Store comparison results (new table or use eval_runs with run_type="comparison")
core/models/
5
Frontend: Comparison modal with radar + bar charts
marketplace/model-comparison-modal.tsx
6
Wire Compare button in marketplace-llms-tab.tsx
marketplace/marketplace-llms-tab.tsx
2B. Enhanced Recipe Suggestions
Goal: Replace heuristic quality_score in recipe learning with real FutureAGI eval scores after recipe execution.
Current State: recipe_learning_service.py, recipe_quality_service.py, and recipe_memory_service.py exist. Quality scores are computed via heuristics in orchestrator/modules/orchestrator/tracker.py (~line 238: quality_score: float). Recipes tab in frontend at frontend/components/workflows/recipes-tab.tsx.
2B.1 How It Works
User executes a recipe (workflow)
Orchestrator runs the recipe, produces output
New: Fire-and-forget FutureAGI eval on recipe input/output (same as live chat eval)
Score stored alongside recipe execution result
Recipe suggestions panel shows actual quality scores instead of heuristic
Recipe learning uses real scores to rank improvement suggestions
2B.2 Backend Changes
2B.3 Frontend Changes
recipes-tab.tsx: Show actual quality scores with color coding (green >0.8, yellow 0.6-0.8, red <0.6)recipe-suggestions-panel.tsx: Rank suggestions by real quality delta, not heuristicrecipe-preview-panel.tsx: Show per-execution quality trend chartview-recipe-modal.tsx: Quality score breakdown in execution history
2B.4 Implementation Steps
1
Add FutureAGI scoring hook after recipe execution
core/services/recipe_quality_service.py
2
Store real scores in recipe_executions table
core/models/ (add eval_scores JSONB column)
3
Update recipe_learning_service to use real scores
core/services/recipe_learning_service.py
4
Frontend: Quality score badges on recipe cards
workflows/recipes-tab.tsx
5
Frontend: Quality trend chart in recipe detail
workflows/recipe-preview-panel.tsx
6
Frontend: Score-ranked suggestion panel
workflows/recipe-suggestions-panel.tsx
2C. Agent "Test My Prompt" Button
Goal: In the agent creation wizard, let users test their system prompt with generated scenarios and get quality scores before deploying.
Current State: frontend/components/agents/create-agent-modal.tsx has the agent creation flow. Backend agent factory at modules/agents/factory/agent_factory.py.
2C.1 How It Works
User writes a system prompt in agent creation modal
Clicks "Test My Prompt" button
System generates 3-5 test scenarios relevant to the prompt's purpose
Runs each scenario through the prompt + selected LLM
Scores each response with FutureAGI
Displays results inline: pass/fail per scenario, overall quality score
User can iterate on prompt, re-test, then save
2C.2 Backend
Scenario Generation (uses LLM):
2C.3 Frontend Changes
Add to create-agent-modal.tsx:
"Test My Prompt" button below the system prompt textarea
Loading state with progress (testing scenario 2/5...)
Results panel showing:
Per-scenario: input, response, score badges, pass/fail
Overall score with gauge visualization
Specific recommendations for improvement
"Re-test" button after edits
Quality gate: warning if overall score < 0.7 when saving
2C.4 Implementation Steps
1
Create test-prompt endpoint with scenario generation
api/prompt_testing.py
2
LLM scenario generator
core/services/prompt_testing_service.py
3
Run scenarios through model + FutureAGI scoring
core/services/prompt_testing_service.py
4
Frontend: Test button + results panel in create-agent-modal
agents/create-agent-modal.tsx
5
Frontend: Quality gate warning on save
agents/create-agent-modal.tsx
6
Cache test results so re-opening modal shows last test
api/prompt_testing.py
2D. Quality-Aware Model Recommendations
Goal: Replace cost-only model recommendations in analytics with recommendations backed by real eval data.
Current State: frontend/components/analytics/analytics-page.tsx and related analytics components exist. api/benchmarking.py has quality_score fields (heuristic). The orchestrator tracker records quality metrics per execution.
2D.1 How It Works
Live traffic eval (Phase 1B) accumulates quality scores per model used
Analytics dashboard aggregates: model × quality × cost × latency
Recommendations engine: "Switch Routing Classifier from GPT-4o to Claude Sonnet — 12% better quality at 40% lower cost"
Confidence intervals based on eval volume
2D.2 Backend
Recommendation Logic:
2D.3 Frontend Changes
New section in Analytics page (analytics-page.tsx):
"Model Optimization Recommendations" card
Table: prompt name, current model, recommended model, quality delta, cost delta
Confidence badge per recommendation
"Apply" button that updates the prompt's model config
Historical view: quality scores per model over time (line chart)
Enhance existing analytics:
analytics-overview.tsx: Add "Prompt Quality" section with aggregate scoresanalytics-costs.tsx: Overlay quality scores on cost charts to show cost/quality tradeoffanalytics-llm-usage.tsx: Quality column in model usage breakdown table
2D.4 Prerequisites
This feature requires tracking which model produced each eval run. Needs a small change to eval_live_traffic():
2D.5 Implementation Steps
1
Tag model + latency in live eval run metadata
core/services/futureagi_service.py
2
Create model recommendations endpoint
api/model_recommendations.py
3
Recommendation engine: aggregate scores by model, compare
core/services/model_recommendation_service.py
4
Frontend: Recommendations card in analytics
analytics/analytics-page.tsx
5
Frontend: Quality column in LLM usage table
analytics/analytics-llm-usage.tsx
6
Frontend: Cost/quality tradeoff chart
analytics/analytics-costs.tsx
Phase 2 Priority Order
2C — Test My Prompt
High (user-facing quality)
Medium
1B only
1st — immediate user value
2B — Recipe Quality
High (real scores > heuristics)
Medium
1B only
2nd — replaces hack with real data
2D — Model Recommendations
High (cost savings)
Medium-High
1B + model tagging
3rd — needs eval data volume
2A — Model Comparison
Medium (nice-to-have)
High
1B + multi-model infra
4th — most infrastructure work
Environment Variables
FutureAGI API Reference (Direct HTTP)
Reverse-engineered from SDK source + API testing on Feb 18, 2026.
Evaluation Endpoint
Available Templates
Improve Prompt (ASYNC — returns job ID)
Models
Quality assessment:
turing_large,turing_small,turing_flashSafety scanning:
protect,protect_flash
Security & Privacy
All
/api/admin/prompts/*endpoints require authenticated Clerk/API-key usersFutureAGI API keys live on the worker service only, not in orchestrator
Live traffic eval sends user messages to worker → FutureAGI API — confirm acceptable under data policy
Eval toggle is per-prompt, admin-controlled — not automatically enabled
Phase 1C auto-optimization creates full audit trail of every decision (PromptHealthSnapshot + PromptABTest)
Phase 1C auto-generated prompts always start as "draft" candidates, never go live without A/B validation
Phase 2C "Test My Prompt" generates synthetic scenarios, does NOT use real user data
Safety scanning runs on prompt text only, not user content
Risks & Mitigations
FutureAGI trial expires (Aug 2026)
Lose eval capability
LLM-as-judge fallback designed in Phase 1C.12. Auto-switches on worker failure.
FutureAGI API rate limits unknown
Eval calls may be throttled
Start with 1-2 prompts enabled, monitor. Sampling strategy in Phase 1B.
Live eval sends user data to FutureAGI
Privacy concern
Admin-controlled toggle. Can be disabled entirely. Worker isolates data flow.
FutureAGI API instability
400 errors, timeouts
Worker handles retries. Graceful degradation — never blocks chat.
Worker service unavailable
All eval stops
Orchestrator checks is_available. Chat unaffected. Health monitor logs gaps.
Auto-optimizer makes things worse (1C)
Quality degrades further
A/B testing with min 50 evals before promotion. Auto-rollback if candidate worse. Admin override available.
A/B test introduces inconsistency
Users get different quality
50/50 split means short exposure. Max 7-day test window with auto-timeout.
Health monitor false positives
Unnecessary optimizations
Conservative thresholds (15% degradation). Min 10 evals for baseline. Audit trail for review.
Phase 2C token costs
User-facing cost for prompt testing
5 scenarios × model cost per test. Consider warning or limiting daily tests.
Open Questions (Updated)
Dataset creation→ Resolved: Live traffic eval replaces synthetic datasetsEval frequency→ Resolved: Every message when toggle is ONRate limiting: What's FutureAGI's API rate limit? Need to test before enabling on high-traffic prompts
Sampling: For high-traffic prompts, should we eval every message or sample (e.g., 1 in 10)?
Privacy: User messages sent to FutureAGI — need to confirm data handling policy
Phase 1C triggers→ Resolved: Detailed thresholds defined in 1C.6 — start conservativeFutureAGI fallback→ Resolved: LLM-as-judge fallback designed in 1C.12A/B test traffic split: Should split be configurable per test or fixed at 50/50? (Spec says configurable via
traffic_splitfield)Multi-model eval tagging: Phase 2D needs model ID tagged on each eval run — requires chat pipeline to pass model info through to eval hook
Phase 2C cost: Test My Prompt runs real LLM calls (5 scenarios × model cost) — should we warn user about token usage?
Fresh Context Quick Start
If starting a new Claude Code session, read this section first.
Branch & Worktree
Railway deploys automatos-ai-api and automotas-ai-frontend from this branch.
Key File Map
orchestrator/core/models/system_prompts.py
ORM models: SystemPrompt, SystemPromptVersion, SystemPromptEvalRun + Pydantic schemas
Phase 1C adds PromptHealthSnapshot, PromptABTest here
orchestrator/api/admin_prompts.py
FastAPI endpoints for prompt CRUD + assessment + eval-toggle
Phase 1C adds health/ab-test endpoints
orchestrator/core/services/futureagi_service.py
Routes all FutureAGI ops to worker service. Has eval_live_traffic(), assess_prompt(), optimize_prompt()
Singleton at futureagi_service
orchestrator/services/heartbeat_service.py
APScheduler-based periodic task runner
Pattern to follow for Phase 1C health monitor
orchestrator/core/metadata_cache/scheduler.py
schedule lib-based daily sync
Alternative scheduling pattern
orchestrator/core/seeds/seed_system_prompts.py
Seeds 15 prompts on startup
orchestrator/core/database/database.py
SessionLocal, get_db, create_tables
Use SessionLocal() in background tasks
orchestrator/consumers/chatbot/service.py
Chat pipeline with 3 eval hook points (lines 746, 1226, 1499)
Fire-and-forget via asyncio.create_task()
orchestrator/consumers/chatbot/integration.py
SmartChatIntegration + OrchestratedRequest dataclass
Carries system_prompt field
orchestrator/consumers/chatbot/smart_orchestrator.py
Builds system prompt via get_happy_system_prompt()
Phase 1C A/B test version selection goes here
orchestrator/main.py
FastAPI app — imports routers, runs startup
Phase 1C adds health monitor startup
frontend/components/settings/SystemPromptsTab.tsx
Full prompt management UI — list, editor, versions, assessments, eval toggle
Phase 1C adds Health sub-tab
frontend/lib/api-client.ts
API client class
GOTCHA: Use apiClient.request() NOT apiClient()
frontend/components/agents/create-agent-modal.tsx
Agent creation wizard
Phase 2C adds "Test My Prompt" button
frontend/components/marketplace/marketplace-llms-tab.tsx
LLM model marketplace
Phase 2A adds comparison modal
frontend/components/workflows/recipes-tab.tsx
Recipe management
Phase 2B adds real quality scores
frontend/components/analytics/analytics-page.tsx
Analytics dashboard
Phase 2D adds model recommendations
Critical Gotchas (learned the hard way)
Wrong directory: Work in
automatos-ai-futureAGI, NOTautomatos-ai. Shell cwd resets between commands.apiClient pattern:
apiClientisnew ApiClient()— call.request(endpoint, options), not as a function.BackgroundTasks: The admin_prompts.py
trigger_assessmentuses FastAPIBackgroundTaskswithasyncio.run()inside a sync wrapper. This is the correct pattern for dispatching async work from sync endpoints.SessionLocal in background: Background tasks can't use request-scoped
get_db. UseSessionLocal()directly and close infinally.Worker service: FutureAGI calls go to
agent-opt-workeratAGENT_OPT_WORKER_URL(Railway internal). Worker owns API keys. Orchestrator just posts to/assess,/safety,/optimize,/score.APScheduler in codebase:
HeartbeatServiceusesAsyncIOSchedulerfrom APScheduler. Follow this pattern for Phase 1C health monitor.Recharts in frontend: Already a dependency — use for Phase 1C health charts and Phase 2 quality visualizations.
Optimize works via worker: Worker
/optimizecollects dataset from live traffic and runs FutureAGI optimize. No longer broken (was broken with direct SDK).
Database Access (Railway)
Railway CLI
Worker Service
Git History (futureAGI branch, most recent first)
Build Progress Tracker
Phase 1A — COMPLETE ✅
All components shipped. See Phase 1A section above.
Phase 1B — COMPLETE ✅
1
Add futureagi_eval_enabled column to SystemPrompt model
✅ DONE
core/models/system_prompts.py
Boolean, default False. Added to PromptResponse schema.
2
Add PATCH /futureagi-toggle API endpoint
✅ DONE
api/admin_prompts.py
Toggle on/off, returns updated prompt
3
Add toggle switch to frontend prompt detail view
✅ DONE
SystemPromptsTab.tsx
Toggle in assessments tab with green/grey styling
4
Add eval_live_traffic() method to FutureAGIService
✅ DONE
core/services/futureagi_service.py
Routes to worker /score, stores as "live" run
5
Hook into chat pipeline after LLM response
✅ DONE
consumers/chatbot/service.py
Fire-and-forget at 3 hook points (lines 746, 1226, 1499)
6
Store per-message eval results in eval_runs table
✅ DONE
Covered by step 4
Links to prompt_id + version_id, run_type="live"
7
Display accumulated scores in assessments dashboard
✅ DONE
SystemPromptsTab.tsx
"live" runs render same as "assess" runs
8
Test end-to-end: toggle ON → send chat → verify scores
✅ DONE
Manual
Deployed and working via worker service
9
Idempotent column migration on startup
✅ DONE
main.py
ALTER TABLE ADD COLUMN IF NOT EXISTS
Phase 1C — NOT STARTED
See Phase 1C section above for 16 implementation steps.
Next up: Step 1 — Add PromptHealthSnapshot and PromptABTest models
Phase 2 — NOT STARTED
See Phase 2 section above. Recommended order: 2C → 2B → 2D → 2A.
Last updated: 2026-02-19 Architecture: Orchestrator → agent-opt-worker service (Railway internal HTTP) Note: Architecture evolved from direct FutureAGI HTTP to isolated worker service. All FutureAGI API concerns live in the worker.
Last updated

