Live Traffic Scoring

chevron-rightRelevant source fileshashtag

Purpose: This page documents the FutureAGI live traffic scoring system, which automatically evaluates every chat interaction against quality metrics in real-time. This feature builds a dataset of scored input/output pairs that powers prompt optimization and provides continuous quality monitoring.

For information about the broader prompt optimization system, see Prompt Optimization. For details on the worker service that executes scoring, see Worker Service Architecture. For prompt management and versioning, see System Prompt Management.


Overview

Live traffic scoring is a fire-and-forget evaluation system that runs after each chat response is generated. When enabled for a system prompt, it:

  1. Captures the user input and assistant output from real conversations

  2. Sends them to the agent-opt-worker service for scoring

  3. Evaluates the exchange against quality metrics (completeness, helpfulness, conciseness)

  4. Stores scores in the database as evaluation runs

  5. Builds a dataset for future prompt optimization

The system is designed to have zero impact on user-facing latency—scoring happens asynchronously after the response is already delivered.

Sources: orchestrator/core/services/futureagi_service.py:229-301


System Architecture

spinner

Sources: orchestrator/core/services/futureagi_service.py:44-49, orchestrator/core/services/futureagi_service.py:229-301, services/agent-opt-worker/main.py:298-326


Enabling Live Scoring

Live traffic scoring is controlled by the futureagi_eval_enabled flag on the SystemPrompt model. Admins can toggle this flag via the UI or API.

Database Model

Field
Type
Description

futureagi_eval_enabled

Boolean

When true, every chat interaction scores this prompt

slug

String

Unique identifier (e.g., "chatbot-friendly")

category

String

Grouping (personality, orchestrator, specialized)

Sources: orchestrator/core/models/system_prompts.py:32-69

Frontend Toggle

The System Prompts admin UI includes a toggle switch for each prompt:

spinner

Sources: frontend/components/settings/SystemPromptsTab.tsx:249-259, frontend/components/settings/SystemPromptsTab.tsx:554-572, orchestrator/api/admin_prompts.py:341-357

API Endpoint

Toggles the flag and returns the updated PromptResponse. Requires admin authentication.

Sources: orchestrator/api/admin_prompts.py:341-357


Scoring Flow

Invocation

The eval_live_traffic() method is called from the chat streaming pipeline after a response completes. It runs asynchronously and never blocks the user response.

spinner

Sources: orchestrator/core/services/futureagi_service.py:234-301

Text Extraction

Messages are stored as JSON arrays of parts (text, tool calls, etc.). The _extract_text() helper extracts plain text:

Sources: orchestrator/core/services/futureagi_service.py:29-41

Worker Communication

The orchestrator sends a POST request to the worker's /score endpoint:

The worker responds with scores:

Sources: orchestrator/core/services/futureagi_service.py:258-272, services/agent-opt-worker/main.py:298-326


Metrics and Templates

Default Metrics

Live scoring uses three core quality metrics defined in FutureAGIService.LIVE_METRICS:

Metric
Template
Model
Description

completeness

Quality metric

turing_large

Does the output fully address the input?

is_helpful

Quality metric

turing_large

Is the response genuinely useful?

is_concise

Quality metric

turing_large

Is it brief without losing clarity?

Sources: orchestrator/core/services/futureagi_service.py:232, services/agent-opt-worker/main.py:124-136

Scoring Engine

The worker uses the FutureAGI SDK to run scoring templates:

spinner

Sources: services/agent-opt-worker/main.py:54-121, services/agent-opt-worker/main.py:298-326

Input Building

Each template requires specific input keys. The _build_inputs() function maps the provided text to template requirements:

Sources: services/agent-opt-worker/main.py:124-150

Concurrent Execution

All metrics run concurrently using ThreadPoolExecutor to minimize total latency:

Sources: services/agent-opt-worker/main.py:303-324


Data Storage

SystemPromptEvalRun Model

Each live scoring event creates one database record per enabled prompt:

Field
Type
Description

id

UUID

Primary key

prompt_id

UUID

FK to SystemPrompt

version_id

UUID

FK to SystemPromptVersion (active version at time of scoring)

run_type

String

"live" for live traffic scoring

status

String

"completed" (set immediately)

scores

JSONB

Full scoring results from worker

started_at

DateTime

Scoring start time

completed_at

DateTime

Scoring end time

created_at

DateTime

Record creation time

Sources: orchestrator/core/models/system_prompts.py:108-139

Scores JSONB Structure

The scores field stores the complete scoring response:

Sources: orchestrator/core/services/futureagi_service.py:284-296

Record Creation

A separate SystemPromptEvalRun is created for each enabled prompt, allowing comparison across different prompt configurations:

Sources: orchestrator/core/services/futureagi_service.py:276-294


Integration with Prompt Optimization

Live traffic scoring builds the dataset used for prompt optimization. When an admin triggers optimization, the system:

  1. Queries SystemPromptEvalRun records with run_type="live"

  2. Extracts input/output pairs from recent chat messages

  3. Uses these real examples to optimize the prompt

Dataset Collection

The _collect_optimization_dataset() method queries the messages table to build training data:

This finds user/assistant pairs (the assistant message immediately following each user message) and extracts the text.

Sources: orchestrator/core/services/futureagi_service.py:307-343

Usage in Optimization

When optimize_prompt() is called, it:

spinner

Sources: orchestrator/core/services/futureagi_service.py:160-226

The optimization worker uses this dataset to evaluate prompt variants, finding versions that score better on the target metrics.


Performance Considerations

Fire-and-Forget Pattern

Live scoring is implemented as a fire-and-forget operation to ensure zero user-facing latency:

The method returns None immediately and never raises exceptions that would affect the chat pipeline.

Sources: orchestrator/core/services/futureagi_service.py:234-245

Error Handling

All exceptions are caught and logged as warnings, never bubbling up:

Sources: orchestrator/core/services/futureagi_service.py:298-301

Worker Timeout

The HTTP call to the worker uses a 120-second timeout (defined in WORKER_TIMEOUT):

If the worker is slow or unavailable, the call fails gracefully without blocking the orchestrator.

Sources: orchestrator/core/services/futureagi_service.py:25, orchestrator/core/services/futureagi_service.py:78-97

Database Session Isolation

Live scoring uses a fresh database session (SessionLocal()) to avoid interfering with the main chat transaction:

Sources: orchestrator/core/services/futureagi_service.py:247-301


Frontend Visualization

Assessment Runs Display

The System Prompts UI displays live scoring results in the "Assessments" tab:

spinner

Sources: frontend/components/settings/SystemPromptsTab.tsx:551-723

Score Rendering

Each metric displays:

  • Color indicator: Green dot (passed), amber dot (failed), gray dot (unknown)

  • Metric name: Converted from snake_case (e.g., is_helpful → "is helpful")

  • Score percentage: Rounded to nearest integer

  • Reason: Collapsible text explanation from the scoring engine

Sources: frontend/components/settings/SystemPromptsTab.tsx:617-635


Configuration

Environment Variables

Live scoring requires the following environment variables:

Variable
Required
Description

AGENT_OPT_WORKER_URL

Yes

Worker service URL (default: http://agent-opt-worker.railway.internal:8080)

FUTUREAGI_API_KEY

Yes (on worker)

FutureAGI platform API key

FUTUREAGI_SECRET_KEY

Yes (on worker)

FutureAGI platform secret key

The orchestrator checks AGENT_OPT_WORKER_URL to determine if the service is available:

Sources: orchestrator/core/services/futureagi_service.py:24, orchestrator/core/services/futureagi_service.py:62-68

Worker Dependencies

The worker service requires specific SDK packages:

Sources: services/agent-opt-worker/requirements.txt:3-5


Summary

Live Traffic Scoring is a passive quality monitoring system that:

  • Captures real user interactions automatically

  • Scores them against quality metrics (completeness, helpfulness, conciseness)

  • Stores results for analysis and optimization

  • Enables continuous improvement of system prompts

  • Operates without any user-facing latency impact

The system is fully decoupled from the chat pipeline—it's enabled per-prompt via a simple flag, runs asynchronously after responses complete, and gracefully handles all failure modes. The accumulated scoring data feeds directly into the prompt optimization workflow, creating a closed loop of continuous quality improvement.

Sources: orchestrator/core/services/futureagi_service.py:1-432, services/agent-opt-worker/main.py:1-545, orchestrator/core/models/system_prompts.py:1-205


Last updated