PRD-58: System Prompt Management & FutureAGI Evaluation Integration

Version: 3.0 Status: In Progress Date: February 19, 2026 (updated from v2.0 Feb 18) Author: Claude Code + Gerard Prerequisites: PRD-29 (FutureAGI Observability), PRD-55 (Autonomous Assistant / Soul Designer) Branch: futureAGI (worktree: automatos-ai-futureAGI) FutureAGI Access: 6-month free trial (expires ~August 2026)


Changelog

Version
Date
Changes

1.0

Feb 17

Initial draft — SDK-based approach

2.0

Feb 18

Complete rewrite of Phase 1B. Dropped SDK (crashes Docker). Direct HTTP API. Live traffic eval with toggle. Intelligent self-healing vision.

3.0

Feb 19

Phase 1B marked complete. Full detailed specs for Phase 1C (Self-Healing) and Phase 2 (User-Facing Features). Architecture evolved to worker service pattern.


Executive Summary

Automatos has 17 system prompts hardcoded across the orchestrator that control every decision the platform makes — routing, task decomposition, agent selection, quality assessment, SQL generation, memory injection, personality, and more. Today these prompts:

  • Live in Python source code, scattered across 17 files

  • Can only be changed by a developer deploying code

  • Have zero quality data — they were tuned by feel, never evaluated

  • Cannot be A/B tested, versioned, or rolled back without git

This PRD introduces three capabilities:

  1. Admin Prompt Management System (Phase 1A) — Move all 17 system prompts into the database with a full management UI, version history, and instant rollback. No code deploys to change how Automatos thinks. STATUS: SHIPPED

  2. FutureAGI Live Traffic Evaluation (Phase 1B) — Per-prompt toggle that sends real chat input/output pairs to FutureAGI for quality scoring in the background. Scores accumulate over time as a live quality dashboard. Safety scanning on prompt text. STATUS: COMPLETE

  3. Intelligent Self-Healing Prompt System (Phase 1C) — The orchestrator monitors rolling quality scores per prompt. When it detects degradation (score drops, error rate spikes), it automatically enables eval, runs optimization, and can swap in improved prompt versions. A self-healing AI. STATUS: PLANNED

Architecture Decision: Worker Service Pattern

The FutureAGI Python SDK (futureagi==0.6.0) was evaluated and rejected (v2.0). The architecture then evolved further:

  1. v1.0: SDK in orchestrator → crashed Docker, broken APIs

  2. v2.0: Direct HTTP in orchestrator → worked but polluted orchestrator deps

  3. v3.0 (current): Isolated agent-opt-worker service handles all FutureAGI calls. Orchestrator calls worker via internal HTTP. Zero FutureAGI deps in orchestrator.

Decision: Orchestrator → Worker service (AGENT_OPT_WORKER_URL) via httpx. Worker owns FutureAGI API keys and SDK concerns. Orchestrator owns DB operations and dispatch logic.


Current Implementation Status

Phase 1A: Admin Prompt Management — SHIPPED ✅

Everything in the original Phase 1A is deployed and working:

Component
Status
Commits

DB tables (system_prompts, versions, eval_runs)

✅ Deployed

a02a002

Seed script (15 prompts across 4 categories)

✅ Deployed

a02a002

PromptRegistry service with caching + fallback

✅ Deployed

Pre-existing

Admin API endpoints (CRUD + versions + rollback)

✅ Deployed

5157daa

Admin UI: SystemPromptsTab in Settings

✅ Deployed

b80b1d7, d4eac9a

Auth: Clerk/API-key users can manage prompts

✅ Deployed

5157daa

Startup: create_tables + seed on Railway boot

✅ Deployed

a02a002

Phase 1B: FutureAGI Integration — COMPLETE ✅

Component
Status
Notes

Direct HTTP service (no SDK)

✅ Deployed

Routes through agent-opt-worker service

Per-template config (keys + models)

✅ Deployed

6b8936d

Concurrent execution (asyncio.gather)

✅ Deployed

6b8936d

Safety scanning (4 checks)

✅ Working

toxicity, bias, injection, moderation

Assessment (3 metrics)

✅ Working

completeness, is_helpful, is_concise via worker

Optimize

✅ Working

Uses worker /optimize with dataset from live traffic

Frontend: per-run-type rendering

✅ Deployed

d4eac9a

Frontend: polling + loading states

✅ Deployed

60a46e2

Live traffic eval toggle

✅ Deployed

Per-prompt toggle in assessments tab

Hook into chat pipeline

✅ Deployed

Fire-and-forget at 3 hook points in service.py

Idempotent column migration

✅ Deployed

ALTER TABLE ADD COLUMN IF NOT EXISTS in main.py


The 17 System Prompts

(Unchanged from v1.0 — see original section)

Every prompt that drives Automatos decision-making:

Core Orchestration (the "brain")

#
Prompt
File
What It Decides
Impact

1

Personality/Soul

consumers/chatbot/personality.py

Tone, style, warmth of every response

Every message

2

Default System Prompt

consumers/chatbot/prompt_analyzer.py

Fallback identity when no personality set

Every message

3

Routing Classifier

core/routing/engine.py

Which agent handles each user message

Every message

4

Intent Classifier Patterns

consumers/chatbot/intent_classifier.py

Simple vs complex, tool needs, memory needs

Every message

5

Tool Ranking Logic

consumers/chatbot/prompt_analyzer.py

Which tools get suggested to the LLM

Every tool call

Multi-Agent Orchestration (complex tasks)

#
Prompt
File
What It Decides
Impact

6

Task Decomposer

modules/orchestrator/stages/task_decomposer.py

How complex tasks break into subtasks

Every multi-step task

7

Complexity Analyzer

modules/orchestrator/stages/complexity_analyzer.py

Simple vs moderate vs complex classification

Every task

8

Agent Selector

modules/orchestrator/llm/llm_agent_selector.py

Which agent runs each subtask

Every subtask

9

Strategy Planner

modules/orchestrator/llm/master_orchestrator.py

Speed/quality/cost tradeoff strategy

Every orchestrated workflow

10

Quality Assessor

modules/orchestrator/stages/quality_assessor.py

Whether outputs meet quality threshold

Every output

11

Context Optimizer

modules/search/optimization/context_optimizer.py

What context gets injected, how much, what format

Every context-enriched request

Domain-Specific

#
Prompt
File
What It Decides
Impact

12

NL2SQL Generator

modules/nl2sql/query/nl2sql_service.py

Natural language → SQL translation

Every data query

13

Memory Injection Template

modules/memory/operations/prompt_injection.py

How memories get formatted into context

Every memory-aware request

14

Agent Factory Builder

modules/agents/factory/agent_factory.py

Dynamic agent system prompt assembly

Every agent execution

15

Execution Manager

modules/agents/execution/execution_manager.py

Default professional execution prompt

Every agent run

Personas & Templates

#
Prompt
File
What It Decides
Impact

16

Persona Presets (x4)

core/seeds/seed_personas.py

Engineer, Sales, Marketing, Support personas

Agent creation

17

Recipe Learning

core/services/recipe_learning_service.py

How improvement suggestions are generated

Recipe optimization


Phase 1A: Admin Prompt Management System — SHIPPED ✅

(Unchanged from v1.0 — fully deployed. See original PRD for data model, API endpoints, and UI specs.)

Key endpoints deployed:


Phase 1B: FutureAGI Live Traffic Evaluation — COMPLETE ✅

1B.1 The Problem with Synthetic Testing

The original PRD assumed we'd create test datasets per prompt and evaluate against those. In practice:

  • FutureAGI evaluates input/output pairs, not prompts in isolation

  • Sending fake output ("System prompt assessed successfully.") produces garbage scores

  • Creating realistic test datasets for 17 prompts is weeks of manual work

  • Synthetic tests don't reflect real-world usage patterns

1B.2 The Solution: Live Traffic Evaluation

Instead of synthetic tests, evaluate real conversations.

When a system prompt has FutureAGI evaluation enabled:

  1. A user sends a chat message

  2. The orchestrator selects the system prompt, sends it + user message to LLM

  3. LLM responds

  4. Fire-and-forget: send the real (user input, LLM output) to FutureAGI for scoring

  5. Score is stored in system_prompt_eval_runs linked to the prompt version

  6. Scores accumulate over time → live quality dashboard

Zero impact on chat latency — the eval call is fully async in the background.

1B.3 Per-Prompt Eval Toggle

New field on SystemPrompt model:

Admin UI: Toggle switch per prompt in the detail view. When ON, every real chat interaction using that prompt gets scored.

API:

1B.4 Hook Point: Chat Pipeline

The hook goes into consumers/chatbot/service.py where SmartChatIntegration prepares the orchestrated request and the LLM response comes back.

1B.5 FutureAGI Service: Direct HTTP (No SDK)

1B.6 Live Traffic Eval Method

New method on FutureAGIService:

1B.7 Safety Scanning — Working ✅

Safety scanning runs on prompt text directly (no real I/O needed):

  • toxicityprotect model, checks output text

  • bias_detectionprotect_flash model, checks output text

  • prompt_injectionprotect model, checks input text

  • content_moderationprotect model, checks output text

All 4 checks run concurrently via asyncio.gather().

1B.8 Optimize — Deferred

FutureAGI's improve-prompt API has gone async (returns job IDs, no inline results). Two options for later:

  1. Poll FutureAGI for results (no polling endpoint found yet)

  2. Use own LLMs to optimize based on accumulated assessment feedback

Decision: Park optimize for now. Focus on getting live traffic eval working — that's the foundation everything else builds on.

1B.9 Frontend: Assessment Dashboard

The Assessments tab per prompt shows:

  • Toggle: FutureAGI Eval ON/OFF switch

  • Live scores: Rolling quality metrics from real traffic

  • Assessment runs: Individual eval results with reasons

  • Safety results: Per-check pass/fail with detailed reasons

  • Auto-polling every 3s while runs are pending/running

1B.10 Implementation Steps (Phase 1B — what we're building now)

Step
What
Status

1

Add futureagi_eval_enabled column to SystemPrompt

🔨 Building

2

Add PATCH /eval-toggle API endpoint

🔨 Building

3

Add toggle switch to frontend prompt detail view

🔨 Building

4

Add eval_live_traffic() method to FutureAGIService

🔨 Building

5

Hook into chat pipeline (service.py after LLM response)

🔨 Building

6

Store per-message eval results in eval_runs table

🔨 Building

7

Display accumulated scores in assessments dashboard

🔨 Building

8

Test: send chat messages with eval ON, verify scores appear

🔨 Building

1B.11 Known Issues

Issue
Impact
Workaround

prompt_adherence template returns server-side error

Can't use this metric

Replaced with is_helpful in defaults

bias_detection can timeout (>90s)

Occasional missing safety check

90s timeout, graceful degradation

FutureAGI optimize API returns async job IDs

Can't get optimized prompts inline

Deferred to Phase 1C

FutureAGI API has no rate limiting docs

Unknown throughput limits

Start with eval on 1-2 prompts, monitor


Phase 1C: Intelligent Self-Healing Prompt System — PLANNED

Vision

The orchestrator becomes self-aware about prompt quality. Instead of an admin manually checking scores, the system:

  1. Monitors rolling average quality scores per prompt version

  2. Detects degradation — score drops below threshold, error rate spikes

  3. Responds automatically:

    • Enables FutureAGI eval if not already on

    • Triggers optimization (using own LLMs to rewrite based on failure patterns)

    • Creates a new prompt version candidate

    • A/B tests the candidate against the current version

    • If candidate scores better → activates it

    • If not → discards and alerts admin

  4. Reports what it did and why via audit trail

Architecture

1C.1 Data Model Changes

New table for health monitoring + audit trail:

New columns on existing models:

1C.2 Prompt Health Monitor Service

Uses APScheduler (already in codebase via HeartbeatService) to run every 5 minutes.

1C.3 LLM-as-Judge Optimization

When FutureAGI optimize is unavailable or as a primary strategy, use own LLMs:

1C.4 A/B Testing Infrastructure

The chat pipeline already selects the active prompt version. For A/B testing, we intercept that selection:

The eval_live_traffic hook already stores version_id on each SystemPromptEvalRun, so scores naturally accumulate per version. The Health Monitor compares scores between control and candidate versions to decide promotion.

1C.5 A/B Test Resolution

1C.6 Health Monitor Triggers

Signal
Threshold
Action

Rolling quality score drops >15% from baseline

Over 20-eval window

Enable eval if off, trigger optimization

Rolling quality score drops >30% from baseline

Over 20-eval window

Critical alert to admin + fast-track optimization

Failure rate >10%

Over last 50 evals

Enable eval, trigger optimization

No eval data for >24h (stale)

Time-based

Auto-enable eval for heartbeat data

A/B candidate scores >10% better

After min 50 evals per version

Auto-promote candidate version

A/B test running >7 days with no winner

Time-based

Discard candidate, alert admin

1C.7 Admin API Endpoints

1C.8 Frontend: Prompt Health Dashboard

New "Health" sub-tab in the prompt detail view (alongside existing Editor, Versions, Assessments):

Health Overview Card:

  • Health status badge (healthy / degraded / critical / recovering)

  • Composite score gauge (0-100) with baseline indicator

  • Trend sparkline (last 24h of health snapshots)

Rolling Metrics Chart (Recharts — already in frontend deps):

  • Line chart showing completeness, helpfulness, conciseness over time

  • Baseline reference line

  • Degradation threshold markers

A/B Test Panel (when active):

  • Side-by-side score comparison: Control vs Candidate

  • Eval count per version

  • Projected winner based on current trend

  • Manual override buttons: "Promote Now" / "Discard"

  • Diff view of control vs candidate prompt text

Audit Trail:

  • Chronological list of auto-healing actions

  • Each entry: timestamp, action taken, reason, outcome

  • Link to relevant prompt version

Global Health Dashboard (new section in Settings → System Prompts):

  • Table of all prompts with health status, composite score, trend arrow

  • Filter by status (healthy/degraded/critical)

  • Sort by composite score or degradation delta

1C.9 Notifications

When the system takes auto-healing actions, notify admins:

Event
Channel
Content

Prompt degraded

In-app toast + audit log

"Personality prompt quality dropped 18%. Optimization triggered."

Optimization complete

In-app toast + audit log

"New candidate version v4 generated for Routing Classifier."

A/B test started

Audit log

"A/B test started: v3 (control) vs v4 (candidate), 50/50 split."

A/B test resolved

In-app toast + audit log

"Candidate v4 promoted — scored 14% better than v3."

Critical degradation

In-app toast + audit log

"CRITICAL: Task Decomposer quality dropped 35%. Manual review recommended."

1C.10 Startup Integration

1C.11 Implementation Steps

#
Step
Files
Notes

1

Add PromptHealthSnapshot and PromptABTest models

core/models/system_prompts.py

New tables + Pydantic schemas

2

Add health_status, baseline_score, ab_test_id to SystemPrompt

core/models/system_prompts.py

Idempotent migration in main.py

3

Add is_candidate, generated_by, generation_context to SystemPromptVersion

core/models/system_prompts.py

Idempotent migration in main.py

4

Create PromptHealthMonitor service

core/services/prompt_health_monitor.py

APScheduler, rolling averages, degradation detection

5

Add LLM-as-judge optimization logic

core/services/prompt_health_monitor.py

Uses existing LLM client + Claude Sonnet

6

Add A/B test traffic splitting to PromptRegistry

core/services/prompt_registry.py or smart_orchestrator.py

Random split based on traffic_split

7

Add A/B test resolution logic

core/services/prompt_health_monitor.py

Check running tests on each monitor cycle

8

Add health API endpoints

api/admin_prompts.py

Health overview, history, A/B test, audit

9

Update eval_live_traffic() to tag version_id from A/B test

core/services/futureagi_service.py

Version already tracked, just verify A/B path

10

Frontend: Health sub-tab with metrics chart

SystemPromptsTab.tsx

Recharts line chart + health status badge

11

Frontend: A/B test panel

SystemPromptsTab.tsx

Side-by-side scores, promote/discard buttons

12

Frontend: Global health dashboard

SystemPromptsTab.tsx

Summary table of all prompt health statuses

13

Frontend: Audit trail list

SystemPromptsTab.tsx

Chronological list of auto-actions

14

Register health monitor on app startup

main.py

APScheduler job, 5-min interval

15

Test: degrade a prompt, verify auto-optimization fires

Manual

Temporarily worsen a prompt, observe healing

16

Test: A/B test lifecycle end-to-end

Manual

Candidate generated → traffic split → promoted/discarded

1C.12 LLM-as-Judge Fallback Strategy

FutureAGI trial expires ~August 2026. The self-healing system must survive without it:

Capability
FutureAGI Path
LLM-as-Judge Fallback

Live traffic scoring

Worker /score endpoint

Claude Haiku judges (input, output) against rubric

Safety scanning

Worker /safety endpoint

Claude scans for toxicity/injection using system prompt

Optimization

Worker /optimize endpoint

Claude Sonnet rewrites based on failure patterns (already built in 1C)

Quality metrics

FutureAGI templates (completeness, etc.)

Custom rubrics scored 0-1 by Claude

Fallback activation: When FutureAGI worker returns errors for >1 hour, auto-switch to LLM-as-judge mode. Log the switch. Admin can manually toggle in settings.


Phase 2: User-Facing Features — PLANNED

Prerequisite: Phase 1B complete (eval data flowing), Phase 1C desirable but not blocking.

Phase 2 brings prompt quality data out of the admin settings and into user-facing surfaces where it drives real product value.

2A. Model Comparison Modal

Goal: Wire the existing marketplace Compare button to a real comparison view backed by FutureAGI quality scores.

Current State: frontend/components/marketplace/marketplace-llms-tab.tsx and llm-model-detail-modal.tsx exist. Compare button present but wired to static/heuristic data.

2A.1 How It Works

  1. Admin selects 2-3 models to compare in the LLM Marketplace

  2. System runs the same set of test prompts through each model

  3. FutureAGI scores each model's output on completeness, helpfulness, conciseness

  4. Results displayed side-by-side in a comparison modal

2A.2 Backend

Test Prompt Defaults (used when admin doesn't provide custom):

2A.3 Frontend

Comparison Modal (extends llm-model-detail-modal.tsx):

  • Side-by-side columns, one per model

  • Radar chart (Recharts) showing metric scores overlaid

  • Bar chart comparing composite scores

  • Expandable rows showing individual prompt results

  • Latency and cost comparison row

  • "Select Winner" button that highlights recommended model

2A.4 Implementation Steps

#
Step
Files

1

Create ModelComparisonRequest/Result schemas

core/models/marketplace.py (or new file)

2

Create /api/marketplace/models/compare endpoint

api/model_comparison.py

3

Background worker: run prompts through each model, score with FutureAGI

core/services/model_comparison_service.py

4

Store comparison results (new table or use eval_runs with run_type="comparison")

core/models/

5

Frontend: Comparison modal with radar + bar charts

marketplace/model-comparison-modal.tsx

6

Wire Compare button in marketplace-llms-tab.tsx

marketplace/marketplace-llms-tab.tsx


2B. Enhanced Recipe Suggestions

Goal: Replace heuristic quality_score in recipe learning with real FutureAGI eval scores after recipe execution.

Current State: recipe_learning_service.py, recipe_quality_service.py, and recipe_memory_service.py exist. Quality scores are computed via heuristics in orchestrator/modules/orchestrator/tracker.py (~line 238: quality_score: float). Recipes tab in frontend at frontend/components/workflows/recipes-tab.tsx.

2B.1 How It Works

  1. User executes a recipe (workflow)

  2. Orchestrator runs the recipe, produces output

  3. New: Fire-and-forget FutureAGI eval on recipe input/output (same as live chat eval)

  4. Score stored alongside recipe execution result

  5. Recipe suggestions panel shows actual quality scores instead of heuristic

  6. Recipe learning uses real scores to rank improvement suggestions

2B.2 Backend Changes

2B.3 Frontend Changes

  • recipes-tab.tsx: Show actual quality scores with color coding (green >0.8, yellow 0.6-0.8, red <0.6)

  • recipe-suggestions-panel.tsx: Rank suggestions by real quality delta, not heuristic

  • recipe-preview-panel.tsx: Show per-execution quality trend chart

  • view-recipe-modal.tsx: Quality score breakdown in execution history

2B.4 Implementation Steps

#
Step
Files

1

Add FutureAGI scoring hook after recipe execution

core/services/recipe_quality_service.py

2

Store real scores in recipe_executions table

core/models/ (add eval_scores JSONB column)

3

Update recipe_learning_service to use real scores

core/services/recipe_learning_service.py

4

Frontend: Quality score badges on recipe cards

workflows/recipes-tab.tsx

5

Frontend: Quality trend chart in recipe detail

workflows/recipe-preview-panel.tsx

6

Frontend: Score-ranked suggestion panel

workflows/recipe-suggestions-panel.tsx


2C. Agent "Test My Prompt" Button

Goal: In the agent creation wizard, let users test their system prompt with generated scenarios and get quality scores before deploying.

Current State: frontend/components/agents/create-agent-modal.tsx has the agent creation flow. Backend agent factory at modules/agents/factory/agent_factory.py.

2C.1 How It Works

  1. User writes a system prompt in agent creation modal

  2. Clicks "Test My Prompt" button

  3. System generates 3-5 test scenarios relevant to the prompt's purpose

  4. Runs each scenario through the prompt + selected LLM

  5. Scores each response with FutureAGI

  6. Displays results inline: pass/fail per scenario, overall quality score

  7. User can iterate on prompt, re-test, then save

2C.2 Backend

Scenario Generation (uses LLM):

2C.3 Frontend Changes

Add to create-agent-modal.tsx:

  • "Test My Prompt" button below the system prompt textarea

  • Loading state with progress (testing scenario 2/5...)

  • Results panel showing:

    • Per-scenario: input, response, score badges, pass/fail

    • Overall score with gauge visualization

    • Specific recommendations for improvement

    • "Re-test" button after edits

  • Quality gate: warning if overall score < 0.7 when saving

2C.4 Implementation Steps

#
Step
Files

1

Create test-prompt endpoint with scenario generation

api/prompt_testing.py

2

LLM scenario generator

core/services/prompt_testing_service.py

3

Run scenarios through model + FutureAGI scoring

core/services/prompt_testing_service.py

4

Frontend: Test button + results panel in create-agent-modal

agents/create-agent-modal.tsx

5

Frontend: Quality gate warning on save

agents/create-agent-modal.tsx

6

Cache test results so re-opening modal shows last test

api/prompt_testing.py


2D. Quality-Aware Model Recommendations

Goal: Replace cost-only model recommendations in analytics with recommendations backed by real eval data.

Current State: frontend/components/analytics/analytics-page.tsx and related analytics components exist. api/benchmarking.py has quality_score fields (heuristic). The orchestrator tracker records quality metrics per execution.

2D.1 How It Works

  1. Live traffic eval (Phase 1B) accumulates quality scores per model used

  2. Analytics dashboard aggregates: model × quality × cost × latency

  3. Recommendations engine: "Switch Routing Classifier from GPT-4o to Claude Sonnet — 12% better quality at 40% lower cost"

  4. Confidence intervals based on eval volume

2D.2 Backend

Recommendation Logic:

2D.3 Frontend Changes

New section in Analytics page (analytics-page.tsx):

  • "Model Optimization Recommendations" card

  • Table: prompt name, current model, recommended model, quality delta, cost delta

  • Confidence badge per recommendation

  • "Apply" button that updates the prompt's model config

  • Historical view: quality scores per model over time (line chart)

Enhance existing analytics:

  • analytics-overview.tsx: Add "Prompt Quality" section with aggregate scores

  • analytics-costs.tsx: Overlay quality scores on cost charts to show cost/quality tradeoff

  • analytics-llm-usage.tsx: Quality column in model usage breakdown table

2D.4 Prerequisites

This feature requires tracking which model produced each eval run. Needs a small change to eval_live_traffic():

2D.5 Implementation Steps

#
Step
Files

1

Tag model + latency in live eval run metadata

core/services/futureagi_service.py

2

Create model recommendations endpoint

api/model_recommendations.py

3

Recommendation engine: aggregate scores by model, compare

core/services/model_recommendation_service.py

4

Frontend: Recommendations card in analytics

analytics/analytics-page.tsx

5

Frontend: Quality column in LLM usage table

analytics/analytics-llm-usage.tsx

6

Frontend: Cost/quality tradeoff chart

analytics/analytics-costs.tsx


Phase 2 Priority Order

Phase
Value
Effort
Depends On
Recommended Order

2C — Test My Prompt

High (user-facing quality)

Medium

1B only

1st — immediate user value

2B — Recipe Quality

High (real scores > heuristics)

Medium

1B only

2nd — replaces hack with real data

2D — Model Recommendations

High (cost savings)

Medium-High

1B + model tagging

3rd — needs eval data volume

2A — Model Comparison

Medium (nice-to-have)

High

1B + multi-model infra

4th — most infrastructure work


Environment Variables


FutureAGI API Reference (Direct HTTP)

Reverse-engineered from SDK source + API testing on Feb 18, 2026.

Evaluation Endpoint

Available Templates

Improve Prompt (ASYNC — returns job ID)

Models

  • Quality assessment: turing_large, turing_small, turing_flash

  • Safety scanning: protect, protect_flash


Security & Privacy

  • All /api/admin/prompts/* endpoints require authenticated Clerk/API-key users

  • FutureAGI API keys live on the worker service only, not in orchestrator

  • Live traffic eval sends user messages to worker → FutureAGI API — confirm acceptable under data policy

  • Eval toggle is per-prompt, admin-controlled — not automatically enabled

  • Phase 1C auto-optimization creates full audit trail of every decision (PromptHealthSnapshot + PromptABTest)

  • Phase 1C auto-generated prompts always start as "draft" candidates, never go live without A/B validation

  • Phase 2C "Test My Prompt" generates synthetic scenarios, does NOT use real user data

  • Safety scanning runs on prompt text only, not user content


Risks & Mitigations

Risk
Impact
Mitigation

FutureAGI trial expires (Aug 2026)

Lose eval capability

LLM-as-judge fallback designed in Phase 1C.12. Auto-switches on worker failure.

FutureAGI API rate limits unknown

Eval calls may be throttled

Start with 1-2 prompts enabled, monitor. Sampling strategy in Phase 1B.

Live eval sends user data to FutureAGI

Privacy concern

Admin-controlled toggle. Can be disabled entirely. Worker isolates data flow.

FutureAGI API instability

400 errors, timeouts

Worker handles retries. Graceful degradation — never blocks chat.

Worker service unavailable

All eval stops

Orchestrator checks is_available. Chat unaffected. Health monitor logs gaps.

Auto-optimizer makes things worse (1C)

Quality degrades further

A/B testing with min 50 evals before promotion. Auto-rollback if candidate worse. Admin override available.

A/B test introduces inconsistency

Users get different quality

50/50 split means short exposure. Max 7-day test window with auto-timeout.

Health monitor false positives

Unnecessary optimizations

Conservative thresholds (15% degradation). Min 10 evals for baseline. Audit trail for review.

Phase 2C token costs

User-facing cost for prompt testing

5 scenarios × model cost per test. Consider warning or limiting daily tests.


Open Questions (Updated)

  1. Dataset creationResolved: Live traffic eval replaces synthetic datasets

  2. Eval frequencyResolved: Every message when toggle is ON

  3. Rate limiting: What's FutureAGI's API rate limit? Need to test before enabling on high-traffic prompts

  4. Sampling: For high-traffic prompts, should we eval every message or sample (e.g., 1 in 10)?

  5. Privacy: User messages sent to FutureAGI — need to confirm data handling policy

  6. Phase 1C triggersResolved: Detailed thresholds defined in 1C.6 — start conservative

  7. FutureAGI fallbackResolved: LLM-as-judge fallback designed in 1C.12

  8. A/B test traffic split: Should split be configurable per test or fixed at 50/50? (Spec says configurable via traffic_split field)

  9. Multi-model eval tagging: Phase 2D needs model ID tagged on each eval run — requires chat pipeline to pass model info through to eval hook

  10. Phase 2C cost: Test My Prompt runs real LLM calls (5 scenarios × model cost) — should we warn user about token usage?


Fresh Context Quick Start

If starting a new Claude Code session, read this section first.

Branch & Worktree

Railway deploys automatos-ai-api and automotas-ai-frontend from this branch.

Key File Map

File
What
Notes

orchestrator/core/models/system_prompts.py

ORM models: SystemPrompt, SystemPromptVersion, SystemPromptEvalRun + Pydantic schemas

Phase 1C adds PromptHealthSnapshot, PromptABTest here

orchestrator/api/admin_prompts.py

FastAPI endpoints for prompt CRUD + assessment + eval-toggle

Phase 1C adds health/ab-test endpoints

orchestrator/core/services/futureagi_service.py

Routes all FutureAGI ops to worker service. Has eval_live_traffic(), assess_prompt(), optimize_prompt()

Singleton at futureagi_service

orchestrator/services/heartbeat_service.py

APScheduler-based periodic task runner

Pattern to follow for Phase 1C health monitor

orchestrator/core/metadata_cache/scheduler.py

schedule lib-based daily sync

Alternative scheduling pattern

orchestrator/core/seeds/seed_system_prompts.py

Seeds 15 prompts on startup

orchestrator/core/database/database.py

SessionLocal, get_db, create_tables

Use SessionLocal() in background tasks

orchestrator/consumers/chatbot/service.py

Chat pipeline with 3 eval hook points (lines 746, 1226, 1499)

Fire-and-forget via asyncio.create_task()

orchestrator/consumers/chatbot/integration.py

SmartChatIntegration + OrchestratedRequest dataclass

Carries system_prompt field

orchestrator/consumers/chatbot/smart_orchestrator.py

Builds system prompt via get_happy_system_prompt()

Phase 1C A/B test version selection goes here

orchestrator/main.py

FastAPI app — imports routers, runs startup

Phase 1C adds health monitor startup

frontend/components/settings/SystemPromptsTab.tsx

Full prompt management UI — list, editor, versions, assessments, eval toggle

Phase 1C adds Health sub-tab

frontend/lib/api-client.ts

API client class

GOTCHA: Use apiClient.request() NOT apiClient()

frontend/components/agents/create-agent-modal.tsx

Agent creation wizard

Phase 2C adds "Test My Prompt" button

frontend/components/marketplace/marketplace-llms-tab.tsx

LLM model marketplace

Phase 2A adds comparison modal

frontend/components/workflows/recipes-tab.tsx

Recipe management

Phase 2B adds real quality scores

frontend/components/analytics/analytics-page.tsx

Analytics dashboard

Phase 2D adds model recommendations

Critical Gotchas (learned the hard way)

  1. Wrong directory: Work in automatos-ai-futureAGI, NOT automatos-ai. Shell cwd resets between commands.

  2. apiClient pattern: apiClient is new ApiClient() — call .request(endpoint, options), not as a function.

  3. BackgroundTasks: The admin_prompts.py trigger_assessment uses FastAPI BackgroundTasks with asyncio.run() inside a sync wrapper. This is the correct pattern for dispatching async work from sync endpoints.

  4. SessionLocal in background: Background tasks can't use request-scoped get_db. Use SessionLocal() directly and close in finally.

  5. Worker service: FutureAGI calls go to agent-opt-worker at AGENT_OPT_WORKER_URL (Railway internal). Worker owns API keys. Orchestrator just posts to /assess, /safety, /optimize, /score.

  6. APScheduler in codebase: HeartbeatService uses AsyncIOScheduler from APScheduler. Follow this pattern for Phase 1C health monitor.

  7. Recharts in frontend: Already a dependency — use for Phase 1C health charts and Phase 2 quality visualizations.

  8. Optimize works via worker: Worker /optimize collects dataset from live traffic and runs FutureAGI optimize. No longer broken (was broken with direct SDK).

Database Access (Railway)

Railway CLI

Worker Service

Git History (futureAGI branch, most recent first)


Build Progress Tracker

Phase 1A — COMPLETE ✅

All components shipped. See Phase 1A section above.

Phase 1B — COMPLETE ✅

#
Step
Status
Files
Notes

1

Add futureagi_eval_enabled column to SystemPrompt model

✅ DONE

core/models/system_prompts.py

Boolean, default False. Added to PromptResponse schema.

2

Add PATCH /futureagi-toggle API endpoint

✅ DONE

api/admin_prompts.py

Toggle on/off, returns updated prompt

3

Add toggle switch to frontend prompt detail view

✅ DONE

SystemPromptsTab.tsx

Toggle in assessments tab with green/grey styling

4

Add eval_live_traffic() method to FutureAGIService

✅ DONE

core/services/futureagi_service.py

Routes to worker /score, stores as "live" run

5

Hook into chat pipeline after LLM response

✅ DONE

consumers/chatbot/service.py

Fire-and-forget at 3 hook points (lines 746, 1226, 1499)

6

Store per-message eval results in eval_runs table

✅ DONE

Covered by step 4

Links to prompt_id + version_id, run_type="live"

7

Display accumulated scores in assessments dashboard

✅ DONE

SystemPromptsTab.tsx

"live" runs render same as "assess" runs

8

Test end-to-end: toggle ON → send chat → verify scores

✅ DONE

Manual

Deployed and working via worker service

9

Idempotent column migration on startup

✅ DONE

main.py

ALTER TABLE ADD COLUMN IF NOT EXISTS

Phase 1C — NOT STARTED

See Phase 1C section above for 16 implementation steps.

Next up: Step 1 — Add PromptHealthSnapshot and PromptABTest models

Phase 2 — NOT STARTED

See Phase 2 section above. Recommended order: 2C → 2B → 2D → 2A.

Last updated: 2026-02-19 Architecture: Orchestrator → agent-opt-worker service (Railway internal HTTP) Note: Architecture evolved from direct FutureAGI HTTP to isolated worker service. All FutureAGI API concerns live in the worker.

Last updated