Live Traffic Scoring
Purpose: This page documents the FutureAGI live traffic scoring system, which automatically evaluates every chat interaction against quality metrics in real-time. This feature builds a dataset of scored input/output pairs that powers prompt optimization and provides continuous quality monitoring.
For information about the broader prompt optimization system, see Prompt Optimization. For details on the worker service that executes scoring, see Worker Service Architecture. For prompt management and versioning, see System Prompt Management.
Overview
Live traffic scoring is a fire-and-forget evaluation system that runs after each chat response is generated. When enabled for a system prompt, it:
Captures the user input and assistant output from real conversations
Sends them to the
agent-opt-workerservice for scoringEvaluates the exchange against quality metrics (completeness, helpfulness, conciseness)
Stores scores in the database as evaluation runs
Builds a dataset for future prompt optimization
The system is designed to have zero impact on user-facing latency—scoring happens asynchronously after the response is already delivered.
Sources: orchestrator/core/services/futureagi_service.py:229-301
System Architecture
Sources: orchestrator/core/services/futureagi_service.py:44-49, orchestrator/core/services/futureagi_service.py:229-301, services/agent-opt-worker/main.py:298-326
Enabling Live Scoring
Live traffic scoring is controlled by the futureagi_eval_enabled flag on the SystemPrompt model. Admins can toggle this flag via the UI or API.
Database Model
futureagi_eval_enabled
Boolean
When true, every chat interaction scores this prompt
slug
String
Unique identifier (e.g., "chatbot-friendly")
category
String
Grouping (personality, orchestrator, specialized)
Sources: orchestrator/core/models/system_prompts.py:32-69
Frontend Toggle
The System Prompts admin UI includes a toggle switch for each prompt:
Sources: frontend/components/settings/SystemPromptsTab.tsx:249-259, frontend/components/settings/SystemPromptsTab.tsx:554-572, orchestrator/api/admin_prompts.py:341-357
API Endpoint
Toggles the flag and returns the updated PromptResponse. Requires admin authentication.
Sources: orchestrator/api/admin_prompts.py:341-357
Scoring Flow
Invocation
The eval_live_traffic() method is called from the chat streaming pipeline after a response completes. It runs asynchronously and never blocks the user response.
Sources: orchestrator/core/services/futureagi_service.py:234-301
Text Extraction
Messages are stored as JSON arrays of parts (text, tool calls, etc.). The _extract_text() helper extracts plain text:
Sources: orchestrator/core/services/futureagi_service.py:29-41
Worker Communication
The orchestrator sends a POST request to the worker's /score endpoint:
The worker responds with scores:
Sources: orchestrator/core/services/futureagi_service.py:258-272, services/agent-opt-worker/main.py:298-326
Metrics and Templates
Default Metrics
Live scoring uses three core quality metrics defined in FutureAGIService.LIVE_METRICS:
completeness
Quality metric
turing_large
Does the output fully address the input?
is_helpful
Quality metric
turing_large
Is the response genuinely useful?
is_concise
Quality metric
turing_large
Is it brief without losing clarity?
Sources: orchestrator/core/services/futureagi_service.py:232, services/agent-opt-worker/main.py:124-136
Scoring Engine
The worker uses the FutureAGI SDK to run scoring templates:
Sources: services/agent-opt-worker/main.py:54-121, services/agent-opt-worker/main.py:298-326
Input Building
Each template requires specific input keys. The _build_inputs() function maps the provided text to template requirements:
Sources: services/agent-opt-worker/main.py:124-150
Concurrent Execution
All metrics run concurrently using ThreadPoolExecutor to minimize total latency:
Sources: services/agent-opt-worker/main.py:303-324
Data Storage
SystemPromptEvalRun Model
Each live scoring event creates one database record per enabled prompt:
id
UUID
Primary key
prompt_id
UUID
FK to SystemPrompt
version_id
UUID
FK to SystemPromptVersion (active version at time of scoring)
run_type
String
"live" for live traffic scoring
status
String
"completed" (set immediately)
scores
JSONB
Full scoring results from worker
started_at
DateTime
Scoring start time
completed_at
DateTime
Scoring end time
created_at
DateTime
Record creation time
Sources: orchestrator/core/models/system_prompts.py:108-139
Scores JSONB Structure
The scores field stores the complete scoring response:
Sources: orchestrator/core/services/futureagi_service.py:284-296
Record Creation
A separate SystemPromptEvalRun is created for each enabled prompt, allowing comparison across different prompt configurations:
Sources: orchestrator/core/services/futureagi_service.py:276-294
Integration with Prompt Optimization
Live traffic scoring builds the dataset used for prompt optimization. When an admin triggers optimization, the system:
Queries
SystemPromptEvalRunrecords withrun_type="live"Extracts input/output pairs from recent chat messages
Uses these real examples to optimize the prompt
Dataset Collection
The _collect_optimization_dataset() method queries the messages table to build training data:
This finds user/assistant pairs (the assistant message immediately following each user message) and extracts the text.
Sources: orchestrator/core/services/futureagi_service.py:307-343
Usage in Optimization
When optimize_prompt() is called, it:
Sources: orchestrator/core/services/futureagi_service.py:160-226
The optimization worker uses this dataset to evaluate prompt variants, finding versions that score better on the target metrics.
Performance Considerations
Fire-and-Forget Pattern
Live scoring is implemented as a fire-and-forget operation to ensure zero user-facing latency:
The method returns None immediately and never raises exceptions that would affect the chat pipeline.
Sources: orchestrator/core/services/futureagi_service.py:234-245
Error Handling
All exceptions are caught and logged as warnings, never bubbling up:
Sources: orchestrator/core/services/futureagi_service.py:298-301
Worker Timeout
The HTTP call to the worker uses a 120-second timeout (defined in WORKER_TIMEOUT):
If the worker is slow or unavailable, the call fails gracefully without blocking the orchestrator.
Sources: orchestrator/core/services/futureagi_service.py:25, orchestrator/core/services/futureagi_service.py:78-97
Database Session Isolation
Live scoring uses a fresh database session (SessionLocal()) to avoid interfering with the main chat transaction:
Sources: orchestrator/core/services/futureagi_service.py:247-301
Frontend Visualization
Assessment Runs Display
The System Prompts UI displays live scoring results in the "Assessments" tab:
Sources: frontend/components/settings/SystemPromptsTab.tsx:551-723
Score Rendering
Each metric displays:
Color indicator: Green dot (passed), amber dot (failed), gray dot (unknown)
Metric name: Converted from snake_case (e.g.,
is_helpful→ "is helpful")Score percentage: Rounded to nearest integer
Reason: Collapsible text explanation from the scoring engine
Sources: frontend/components/settings/SystemPromptsTab.tsx:617-635
Configuration
Environment Variables
Live scoring requires the following environment variables:
AGENT_OPT_WORKER_URL
Yes
Worker service URL (default: http://agent-opt-worker.railway.internal:8080)
FUTUREAGI_API_KEY
Yes (on worker)
FutureAGI platform API key
FUTUREAGI_SECRET_KEY
Yes (on worker)
FutureAGI platform secret key
The orchestrator checks AGENT_OPT_WORKER_URL to determine if the service is available:
Sources: orchestrator/core/services/futureagi_service.py:24, orchestrator/core/services/futureagi_service.py:62-68
Worker Dependencies
The worker service requires specific SDK packages:
Sources: services/agent-opt-worker/requirements.txt:3-5
Summary
Live Traffic Scoring is a passive quality monitoring system that:
Captures real user interactions automatically
Scores them against quality metrics (completeness, helpfulness, conciseness)
Stores results for analysis and optimization
Enables continuous improvement of system prompts
Operates without any user-facing latency impact
The system is fully decoupled from the chat pipeline—it's enabled per-prompt via a simple flag, runs asynchronously after responses complete, and gracefully handles all failure modes. The accumulated scoring data feeds directly into the prompt optimization workflow, creating a closed loop of continuous quality improvement.
Sources: orchestrator/core/services/futureagi_service.py:1-432, services/agent-opt-worker/main.py:1-545, orchestrator/core/models/system_prompts.py:1-205
Last updated

