Prompt Evaluation

chevron-rightRelevant source fileshashtag

This page documents the prompt evaluation system, which scores system prompts using quality metrics and safety checks powered by the FutureAGI SDK. Evaluation runs are triggered on-demand via the admin UI or automatically on live chat traffic, with results stored in the SystemPromptEvalRun table.

For managing prompt versions and the registry, see System Prompt Management. For automatic live traffic evaluation, see Live Traffic Scoring. For optimization algorithms, see Prompt Optimization. For worker service architecture details, see Worker Service Architecture.


Overview

The evaluation system provides three types of assessments:

Type
Purpose
Metrics Used
Endpoint

Assessment

Quality scoring for prompt effectiveness

completeness, is_helpful, is_concise, prompt_adherence, factual_accuracy

/assess

Safety

Security and content moderation checks

toxicity, prompt_injection, content_moderation, bias_detection

/safety

Live Scoring

Real-time evaluation of chat interactions

Configurable subset (default: completeness, is_helpful, is_concise)

/score

All evaluations are dispatched from the orchestrator (FutureAGIService) to the isolated worker service (agent-opt-worker), which handles SDK calls and returns structured results.

Sources: orchestrator/core/services/futureagi_service.py:117-145, services/agent-opt-worker/main.py:218-241, services/agent-opt-worker/main.py:253-291


Evaluation Architecture

spinner

Sources: orchestrator/core/services/futureagi_service.py:44-73, orchestrator/api/admin_prompts.py:395-465, services/agent-opt-worker/main.py:16-35


Assessment Endpoint

The /assess endpoint scores prompt quality using configurable metrics. Each metric runs as a separate evaluation template via the FutureAGI SDK.

Request Flow

spinner

Sources: orchestrator/api/admin_prompts.py:395-465, orchestrator/core/services/futureagi_service.py:117-144, services/agent-opt-worker/main.py:218-241

Metrics Configuration

The worker maintains a TEMPLATE_CONFIG dictionary mapping metric names to their required inputs and optimal models:

Metric
Required Inputs
Model
Purpose

completeness

input, output

turing_large

Checks if response fully addresses query

is_helpful

input, output

turing_large

Evaluates utility and actionability

is_concise

output

turing_large

Measures response brevity

prompt_adherence

input, output

turing_large

Checks instruction following

factual_accuracy

input, output

turing_large

Verifies correctness

groundedness

input, output, context

turing_large

Ensures context-based responses

summary_quality

input, output

turing_large

Evaluates summarization

Sources: services/agent-opt-worker/main.py:124-136

Parallel Execution

All metrics are evaluated concurrently using ThreadPoolExecutor to minimize latency:

Sources: services/agent-opt-worker/main.py:226-238

Result Parsing

The SDK returns varied output formats depending on template type. The worker normalizes all results to a consistent structure:

spinner

Sources: services/agent-opt-worker/main.py:74-120


Safety Checks

The /safety endpoint runs security and content moderation templates to detect harmful content in system prompts.

Safety Templates

Template
Input
Model
Purpose

toxicity

output

protect

Detects toxic, offensive, or harmful language

prompt_injection

input

protect

Identifies prompt injection attacks

content_moderation

output

protect

Flags inappropriate content

bias_detection

output

protect_flash

Detects biased language

Sources: services/agent-opt-worker/main.py:256-257

Safety Preamble

To reduce false positives when evaluating system prompts (which contain instructional language), the worker prefixes the content with a preamble:

This context prevents the safety model from flagging legitimate instruction text (e.g., "Handle toxic user input gracefully") as unsafe.

Sources: services/agent-opt-worker/main.py:247-260

Safety Result Structure

Safety checks return a combined verdict:

Sources: services/agent-opt-worker/main.py:276-291, frontend/components/settings/SystemPromptsTab.tsx:638-661


Evaluation Run Lifecycle

spinner

Sources: orchestrator/core/services/futureagi_service.py:349-427, orchestrator/core/models/system_prompts.py:108-139

Database Models

SystemPromptEvalRun:

Sources: orchestrator/core/models/system_prompts.py:108-139

Background Task Processing

The orchestrator uses FastAPI's BackgroundTasks to run evaluations asynchronously:

This prevents blocking the HTTP response while the worker processes the evaluation (which can take 10-60 seconds for multiple metrics).

Sources: orchestrator/api/admin_prompts.py:442-448


Results Display

Frontend Polling

The UI polls assessment runs every 3 seconds when any are pending or running:

Sources: frontend/components/settings/SystemPromptsTab.tsx:166-173

Assessment Results Rendering

spinner

Sources: frontend/components/settings/SystemPromptsTab.tsx:592-722

Quality Score Visualization

For completed assessment runs, the UI displays each metric with a visual progress bar:

Sources: frontend/components/settings/SystemPromptsTab.tsx:619-634


API Integration

Triggering Assessments

Endpoint: POST /api/admin/prompts/{prompt_id}/assess

Request:

Response:

Sources: orchestrator/api/admin_prompts.py:395-465

Listing Assessment Runs

Endpoint: GET /api/admin/prompts/{prompt_id}/assessment-runs

Returns all runs for a prompt, sorted by created_at DESC:

Sources: orchestrator/api/admin_prompts.py:364-392


Worker HTTP Endpoints

/assess - Quality Metrics

Request:

Response:

Sources: services/agent-opt-worker/main.py:218-241

/safety - Security Checks

Request:

Response:

Sources: services/agent-opt-worker/main.py:253-291

/score - Live Traffic

Request:

Response:

Sources: services/agent-opt-worker/main.py:298-327


Error Handling

Worker Errors

The worker wraps all evaluations in try-catch blocks and returns errors in the result structure:

Individual metric failures don't crash the entire assessment - other metrics continue executing.

Sources: services/agent-opt-worker/main.py:70-72

Orchestrator Error Handling

The FutureAGIService catches worker errors and stores them in the eval run:

This ensures the run record always has a final state, even if the worker times out or crashes.

Sources: orchestrator/core/services/futureagi_service.py:401-415

Timeout Handling

The orchestrator sets timeouts on worker HTTP calls:

  • Assessment/Safety: 120 seconds (WORKER_TIMEOUT)

  • Optimization: 300 seconds (OPTIMIZE_TIMEOUT)

Sources: orchestrator/core/services/futureagi_service.py:24-26


Configuration

Environment Variables

Orchestrator:

Worker:

The orchestrator only needs the worker URL - all API keys live on the worker side to maintain isolation.

Sources: orchestrator/core/services/futureagi_service.py:24, services/agent-opt-worker/main.py:42-51

Availability Check

The FutureAGIService checks availability on initialization:

When unavailable, all evaluation methods return error responses immediately without attempting HTTP calls.

Sources: orchestrator/core/services/futureagi_service.py:62-72


Last updated