Prompt Evaluation
This page documents the prompt evaluation system, which scores system prompts using quality metrics and safety checks powered by the FutureAGI SDK. Evaluation runs are triggered on-demand via the admin UI or automatically on live chat traffic, with results stored in the SystemPromptEvalRun table.
For managing prompt versions and the registry, see System Prompt Management. For automatic live traffic evaluation, see Live Traffic Scoring. For optimization algorithms, see Prompt Optimization. For worker service architecture details, see Worker Service Architecture.
Overview
The evaluation system provides three types of assessments:
Assessment
Quality scoring for prompt effectiveness
completeness, is_helpful, is_concise, prompt_adherence, factual_accuracy
/assess
Safety
Security and content moderation checks
toxicity, prompt_injection, content_moderation, bias_detection
/safety
Live Scoring
Real-time evaluation of chat interactions
Configurable subset (default: completeness, is_helpful, is_concise)
/score
All evaluations are dispatched from the orchestrator (FutureAGIService) to the isolated worker service (agent-opt-worker), which handles SDK calls and returns structured results.
Sources: orchestrator/core/services/futureagi_service.py:117-145, services/agent-opt-worker/main.py:218-241, services/agent-opt-worker/main.py:253-291
Evaluation Architecture
Sources: orchestrator/core/services/futureagi_service.py:44-73, orchestrator/api/admin_prompts.py:395-465, services/agent-opt-worker/main.py:16-35
Assessment Endpoint
The /assess endpoint scores prompt quality using configurable metrics. Each metric runs as a separate evaluation template via the FutureAGI SDK.
Request Flow
Sources: orchestrator/api/admin_prompts.py:395-465, orchestrator/core/services/futureagi_service.py:117-144, services/agent-opt-worker/main.py:218-241
Metrics Configuration
The worker maintains a TEMPLATE_CONFIG dictionary mapping metric names to their required inputs and optimal models:
completeness
input, output
turing_large
Checks if response fully addresses query
is_helpful
input, output
turing_large
Evaluates utility and actionability
is_concise
output
turing_large
Measures response brevity
prompt_adherence
input, output
turing_large
Checks instruction following
factual_accuracy
input, output
turing_large
Verifies correctness
groundedness
input, output, context
turing_large
Ensures context-based responses
summary_quality
input, output
turing_large
Evaluates summarization
Sources: services/agent-opt-worker/main.py:124-136
Parallel Execution
All metrics are evaluated concurrently using ThreadPoolExecutor to minimize latency:
Sources: services/agent-opt-worker/main.py:226-238
Result Parsing
The SDK returns varied output formats depending on template type. The worker normalizes all results to a consistent structure:
Sources: services/agent-opt-worker/main.py:74-120
Safety Checks
The /safety endpoint runs security and content moderation templates to detect harmful content in system prompts.
Safety Templates
toxicity
output
protect
Detects toxic, offensive, or harmful language
prompt_injection
input
protect
Identifies prompt injection attacks
content_moderation
output
protect
Flags inappropriate content
bias_detection
output
protect_flash
Detects biased language
Sources: services/agent-opt-worker/main.py:256-257
Safety Preamble
To reduce false positives when evaluating system prompts (which contain instructional language), the worker prefixes the content with a preamble:
This context prevents the safety model from flagging legitimate instruction text (e.g., "Handle toxic user input gracefully") as unsafe.
Sources: services/agent-opt-worker/main.py:247-260
Safety Result Structure
Safety checks return a combined verdict:
Sources: services/agent-opt-worker/main.py:276-291, frontend/components/settings/SystemPromptsTab.tsx:638-661
Evaluation Run Lifecycle
Sources: orchestrator/core/services/futureagi_service.py:349-427, orchestrator/core/models/system_prompts.py:108-139
Database Models
SystemPromptEvalRun:
Sources: orchestrator/core/models/system_prompts.py:108-139
Background Task Processing
The orchestrator uses FastAPI's BackgroundTasks to run evaluations asynchronously:
This prevents blocking the HTTP response while the worker processes the evaluation (which can take 10-60 seconds for multiple metrics).
Sources: orchestrator/api/admin_prompts.py:442-448
Results Display
Frontend Polling
The UI polls assessment runs every 3 seconds when any are pending or running:
Sources: frontend/components/settings/SystemPromptsTab.tsx:166-173
Assessment Results Rendering
Sources: frontend/components/settings/SystemPromptsTab.tsx:592-722
Quality Score Visualization
For completed assessment runs, the UI displays each metric with a visual progress bar:
Sources: frontend/components/settings/SystemPromptsTab.tsx:619-634
API Integration
Triggering Assessments
Endpoint: POST /api/admin/prompts/{prompt_id}/assess
Request:
Response:
Sources: orchestrator/api/admin_prompts.py:395-465
Listing Assessment Runs
Endpoint: GET /api/admin/prompts/{prompt_id}/assessment-runs
Returns all runs for a prompt, sorted by created_at DESC:
Sources: orchestrator/api/admin_prompts.py:364-392
Worker HTTP Endpoints
/assess - Quality Metrics
/assess - Quality MetricsRequest:
Response:
Sources: services/agent-opt-worker/main.py:218-241
/safety - Security Checks
/safety - Security ChecksRequest:
Response:
Sources: services/agent-opt-worker/main.py:253-291
/score - Live Traffic
/score - Live TrafficRequest:
Response:
Sources: services/agent-opt-worker/main.py:298-327
Error Handling
Worker Errors
The worker wraps all evaluations in try-catch blocks and returns errors in the result structure:
Individual metric failures don't crash the entire assessment - other metrics continue executing.
Sources: services/agent-opt-worker/main.py:70-72
Orchestrator Error Handling
The FutureAGIService catches worker errors and stores them in the eval run:
This ensures the run record always has a final state, even if the worker times out or crashes.
Sources: orchestrator/core/services/futureagi_service.py:401-415
Timeout Handling
The orchestrator sets timeouts on worker HTTP calls:
Assessment/Safety: 120 seconds (
WORKER_TIMEOUT)Optimization: 300 seconds (
OPTIMIZE_TIMEOUT)
Sources: orchestrator/core/services/futureagi_service.py:24-26
Configuration
Environment Variables
Orchestrator:
Worker:
The orchestrator only needs the worker URL - all API keys live on the worker side to maintain isolation.
Sources: orchestrator/core/services/futureagi_service.py:24, services/agent-opt-worker/main.py:42-51
Availability Check
The FutureAGIService checks availability on initialization:
When unavailable, all evaluation methods return error responses immediately without attempting HTTP calls.
Sources: orchestrator/core/services/futureagi_service.py:62-72
Last updated

