PRD-106 — Outcome Telemetry & Learning Foundation

Version: 1.0 Type: Research + Design Status: Complete — Ready for Peer Review Priority: P1 Dependencies: PRD-101 (Mission Schema), PRD-103 (Verification & Quality), PRD-105 (Budget & Governance) Author: Gerard Kavanagh + Claude Date: 2026-03-15


1. Problem Statement

1.1 The Gap

Automatos captures per-call LLM usage (llm_usage table — core/models/core.py:138) and stores JSONB findings in heartbeat results, but nothing correlates a multi-step mission to its aggregate cost, duration, quality, or human acceptance. The platform records individual API calls — it never answers "which model performs best for research tasks?" or "what's the average cost of a compliance mission?"

Gap
Current State
Impact

No mission-level outcome record

llm_usage has agent_id but no mission_task_id

Cannot compute total cost/tokens/duration for a mission

No per-task structured outcome

board_tasks.result is free text

No machine-readable quality score, token spend, or retry count

No agent performance attribution

llm_usage.agent_id exists but unlinked to task outcomes

Cannot rank agents by task-type performance

No model comparison data

llm_usage.model_id exists but unlinked to quality scores

Cannot answer "Claude Opus or GPT-4 for research?"

No human feedback loop closure

votes.is_upvoted (core/models/core.py:1157) captures chat-level thumbs up/down

No mission/task-level acceptance signal

Prometheus metrics defined but unwired

6 counters/histograms in automatos_metrics.py:63-132

Zero application metrics in Grafana

heartbeat_results.cost always 0.0

Column exists but _store_heartbeat_result() never populates it

Heartbeat costs invisible

Context window telemetry ephemeral

ContextResult.token_estimate computed per-request

Never persisted — context optimization is blind

1.2 Why This Matters Now

PRD-100 Section 3: "No fancy learning engine — just data. Query it for patterns later."

This PRD defines what "just data" means concretely — the schema, capture points, and query patterns that make future optimization possible without building the optimization engine now. Without structured telemetry:

  • Mission Mode ships blind — no way to measure if it works

  • Model routing remains manual forever — no data to automate it

  • Cost optimization is guesswork — no per-task-type cost benchmarks

  • Phase 3 learning foundation (bandit-style selection) has no training signal


2. Prior Art Analysis

2.1 Systems Studied

System
Key Insight
What We Adopt
What We Reject

MLflow

Three-tier storage: metrics (append-only time-series), params (immutable config), tags (mutable state)

Immutable config vs. mutable outcome separation on mission_tasks

Metric history as separate entity — too much schema for v1; use event log instead

Weights & Biases

Every run has summary (final/aggregate for cross-run comparison) and history (per-step time-series). Summary uses configurable aggregation (min, max, mean, last)

Summary columns on mission_tasks + event log for deep analysis = hybrid storage

W&B's custom aggregation DSL — our summary columns are explicit, not computed

OpenTelemetry

Trace/span model with gen_ai.* semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens

Attribute naming conventions for event metadata; mission=trace / task=span mental model

OTel transport layer — Postgres event log is simpler and matches our infra

Honeycomb

Schema-free columnar storage enables high-cardinality GROUP BY without pre-aggregation. No cardinality explosion because there's no pre-defined rollup

JSONB metadata escape hatch on mission_events for flexible querying via jsonb_path_query

Full columnar engine — Postgres JSONB with GIN indexes is sufficient at our scale

Eppo

Outcomes joined to treatments at analysis time via SQL, not stored as FK relations. Attribution window on metric definition, not on event

Store raw events with timestamps; join mission→outcome at query time for flexibility

Complex attribution window DSL — our missions have natural start/end boundaries

Deng (Microsoft ExP)

Minimal sufficient statistics: store metric_sum + metric_sum_squares alongside mean for variance computation without re-reading raw data

Aggregation views include sum and sum-of-squares for numeric metrics

Online variance algorithms — batch recomputation from llm_usage is sufficient

2.2 Key Design Decision

Hybrid storage (Option C from outline): Summary columns on mission_tasks for dashboard queries + separate mission_events append-only table for detailed per-step telemetry.

Rationale:

  • Summary columns enable fast GROUP BY model_id, task_type without JSONB path extraction

  • Event log enables deep debugging, retry analysis, and future ML training

  • Matches the W&B summary/history split — the most proven pattern in ML experiment tracking

  • llm_usage remains the source of truth for per-call data — mission telemetry aggregates it, never duplicates it


3. Telemetry Schema

3.1 Summary Columns on mission_tasks (PRD-101 Extension)

These columns live on the mission_tasks table defined in PRD-101. They are the "W&B summary" — final/aggregate values for fast dashboard queries.

Field sourcing:

Field
Source
When Set

model_id

LLMManager.generate_response() → first call determines primary model

Task assignment

task_type

Coordinator decomposition (PRD-102)

Task creation

tokens_in/out

Running total from UsageTracker.track()

Incremented per LLM call

cost_usd

Running total from UsageTracker.track()

Incremented per LLM call

duration_ms

completed_at - started_at

Task completion

verifier_score

VerificationService.verify() (PRD-103)

Post-execution verification

human_accepted

Human review API endpoint

Post-review

error_type

Exception handler in AgentFactory.execute_with_prompt()

On failure

retry_count

AgentFactory tool loop (core/agents/factory.py)

Incremented per retry

context_tokens

ContextService.get_context() return value

Task execution start

tools_used

_execute_tool_calls() in AgentFactory

Appended per tool call

3.2 Mission Run Summary Columns (PRD-101 Extension)

These columns live on the mission_runs table. They are aggregate-of-aggregates — computed from mission_tasks rows.

These are maintained via two mechanisms:

  1. Real-time increment: When UsageTracker.track() fires with a mission_task_id, it also increments the parent mission_runs totals

  2. Batch reconciliation: A daily job recomputes all mission_runs summaries from llm_usage WHERE mission_task_id IN (SELECT id FROM mission_tasks WHERE mission_run_id = ?) — catches any drift from failed increments

3.3 Event Log Table: mission_events

The "W&B history" — append-only, per-step telemetry for debugging and future ML training.

3.4 Event Types

event_type
Level
Emitted By
Key Attributes

mission.created

Run

Coordinator

{goal, plan_hash, config}

mission.plan_generated

Run

Coordinator

{task_count, estimated_cost, plan_version}

mission.replanned

Run

Coordinator

{reason, tasks_added, tasks_removed, plan_version}

mission.completed

Run

Coordinator

{status, total_cost, total_duration_ms}

task.assigned

Task

Coordinator

{agent_id, model_id, task_type, priority}

task.started

Task

AgentFactory

{context_tokens, tools_available}

task.llm_call

Task

UsageTracker

{model_id, tokens_in, tokens_out, cost_usd, latency_ms}

task.tool_called

Task

UnifiedExecutor

{tool_name, duration_ms, status, cache_hit}

task.retry

Task

AgentFactory

{attempt, reason, error_type}

task.verified

Task

VerificationService

{score, verdict, verifier_model, verifier_cost}

task.completed

Task

AgentFactory

{status, tokens_total, cost_total, duration_ms}

task.failed

Task

AgentFactory

{error_type, error_message_truncated, attempt}

budget.warning

Run

BudgetManager

{threshold_pct, spent, limit}

budget.throttled

Run

BudgetManager

{spent, limit, action}

human.reviewed

Task

API endpoint

{accepted, feedback_text_length}

human.verdict

Run

API endpoint

{verdict, tasks_accepted, tasks_rejected}

3.5 llm_usage Extension

Add a nullable FK to link per-call data to mission tasks:

Critical constraints:

  • Nullable: Thousands of existing rows (chatbot, heartbeat, routing, embedding calls) have no mission context. Non-nullable would require impossible backfill.

  • No backfill: Historical llm_usage rows predate missions. Backfilling would create false attributions. Only new LLM calls made within mission execution get the FK set.

  • ON DELETE SET NULL: If a mission_task is deleted (cascaded from mission_run deletion), the llm_usage rows survive for historical billing queries.

SQLAlchemy model change (core/models/core.py:138):

3.6 request_type Extension

The existing request_type column on llm_usage (core/models/core.py:151) is a VARCHAR(50) — not a database enum, so no migration needed to add values. The application code must use two new values:

New Value
Used By
Purpose

coordinator

CoordinatorService (PRD-102) planning and monitoring calls

Separate coordination overhead from task execution cost

verifier

VerificationService (PRD-103) scoring calls

Separate verification cost from task execution cost

This enables computing coordination_cost_usd and verification_cost_usd as separate line items on mission_runs via:


4. Data Privacy Policy

Rule: Never store prompt content or model outputs in telemetry.

Telemetry captures metadata only. Content belongs in agent reports (PRD-76) and workspace files.

Stored
NOT Stored

Model ID, provider

Prompt text

Token counts (in/out)

Model output text

Cost in USD

User messages

Duration in ms

File contents

Tool names

Tool parameter values

Error type enum

Full error stack traces

Error message (truncated to 500 chars)

PII or IP

Verifier score (0.0-1.0)

Verifier reasoning text

Success criteria text (from mission definition)

Task output content

JSONB attributes on mission_events: Must not contain prompt content, model output, or PII. Allowed keys are defined per event_type in Section 3.4. The TelemetryService validates keys against an allowlist before insertion.


5. Query Patterns

The schema is designed to answer specific questions. Every query below has been validated against the DDL in Section 3.

5.1 Operational (Day 1)

5.2 Analytical (Week 1+)

5.3 Strategic (Month 1+)

5.4 Materialized Views for Dashboards

For queries that scan large tables (Q10-Q13), create materialized views refreshed by a daily batch job:


6. Capture Points & Data Flow

6.1 Sequence Diagram: Telemetry Capture During Task Execution

6.2 Code Changes Required

File
Current
Change
Lines

core/llm/usage_tracker.py:21

track() has no mission context

Add mission_task_id: Optional[int] = None parameter. When set: (1) pass to LLMUsage row, (2) increment mission_tasks.tokens_in/out/cost_usd via UPDATE ... SET tokens_in = tokens_in + :delta

21-80

core/llm/manager.py

Calls _track_usage()

Pass mission_task_id through generate_response()_track_usage() call chain. Source: execution_context dict already threaded through

643-671

core/agents/factory.py

Tool loop tracks retries

(1) Set mission_task_id on execution context, (2) append to tools_used array, (3) increment retry_count, (4) emit task.started/completed/failed events

Tool loop

core/monitoring/automatos_metrics.py:75-132

Prometheus counters defined, never incremented

Wire into UsageTracker.track(): AGENT_TOKEN_USAGE.labels(agent_id, model, 'input').inc(tokens_in), LLM_REQUEST_DURATION.labels(model, provider).observe(latency_ms/1000)

75-132

services/heartbeat_service.py:913-942

_store_heartbeat_result() inserts with cost=0.0

After insert, query SUM(total_cost) FROM llm_usage WHERE execution_id = :eid and update heartbeat_results.cost

913-942

api/llm_analytics.py

Usage/cost endpoints only

Add /api/missions/{id}/telemetry and /api/analytics/mission-outcomes endpoints

New

6.3 UsageTracker Extension

6.4 TelemetryService

New service — single write path for all mission_events rows:


7. Retention Policy

7.1 Three-Tier Retention

Tier
Data
Retention
Rationale

Permanent

mission_runs summary columns, mission_tasks summary columns

Forever

Tiny rows (~500 bytes each). Essential for long-term analytics. Storage cost negligible

Hot

mission_events rows

90 days in Postgres

Event log rows are larger (~200 bytes each). 1000 events/day × 90 days = ~18MB — manageable but grows linearly with mission volume

Cold archive

mission_events rows older than 90 days

S3 archive (s3://automatos-ai/telemetry/archive/)

JSONL export before partition drop. Available for future ML training or forensic analysis

7.2 Retention Automation


8. Prometheus Metrics Wiring

8.1 Current State

Six counters/histograms are defined in core/monitoring/automatos_metrics.py:63-132 but never incremented anywhere in the codebase:

Metric
Defined At
Status

automatos_agent_heartbeat_total

Line 63

Unwired

automatos_agent_heartbeat_duration_seconds

Line 69

Unwired

automatos_agent_token_usage_total

Line 75

Unwired

automatos_active_agents

Line 81

Unwired

automatos_llm_request_duration_seconds

Line 121

Unwired

automatos_llm_tokens_total

Line 128

Unwired

8.2 Wiring Plan

Metric
Wire Into
How

agent_heartbeat_total

heartbeat_service.py after each tick

.labels(agent_id=str(id), status=status).inc()

agent_heartbeat_duration_seconds

heartbeat_service.py around tick execution

.labels(agent_id=str(id)).observe(duration)

agent_token_usage_total

usage_tracker.py:track() (see Section 6.3)

.labels(agent_id, model, direction).inc(tokens)

active_agents

heartbeat_service.py at service start

.set(count_of_active_agents)

llm_request_duration_seconds

usage_tracker.py:track() (see Section 6.3)

.labels(model, provider).observe(latency/1000)

llm_tokens_total

usage_tracker.py:track() (see Section 6.3)

.labels(model, provider, direction).inc(tokens)

8.3 New Mission-Specific Metrics


9. Heartbeat Cost Fix

9.1 Current Bug

heartbeat_service.py:913-942 inserts heartbeat_results rows but never populates the cost column — it's always 0.0.

The cost column exists in the raw SQL INSERT:

Note: cost is not even in the INSERT column list — it defaults to whatever the column default is (likely 0.0 or NULL).

9.2 Fix

After the heartbeat LLM call completes, query the llm_usage table for the cost of the call just made (using execution_id correlation), then include cost in the INSERT:


10. API Endpoints

10.1 Mission Telemetry

Response:

10.2 Mission Outcome Analytics

Response:

10.3 Human Review Endpoints

Both endpoints emit human.reviewed / human.verdict telemetry events.


11. Cross-PRD Integration

11.1 Dependencies

PRD
Integration Point
Direction

PRD-101

mission_runs and mission_tasks tables receive summary columns

PRD-106 extends PRD-101 schema

PRD-102

Coordinator emits mission.created/plan_generated/replanned events, uses request_type='coordinator'

PRD-102 writes → PRD-106 captures

PRD-103

VerificationService sets verifier_score, emits task.verified, uses request_type='verifier'

PRD-103 writes → PRD-106 captures

PRD-104

Ephemeral contractors must log agent_id before cleanup. Contractor LLM calls must include mission_task_id

PRD-104 must preserve telemetry before destroying contractor

PRD-105

Budget enforcement uses real-time cost_usd from same increment path. budget.warning/throttled events

Shared write path — PRD-105 reads what PRD-106 writes

PRD-107

Context interface exposes context_tokens and sections_trimmed for telemetry capture

PRD-107 provides → PRD-106 persists

11.2 Telemetry for Ephemeral Contractors (PRD-104 Constraint)

Ephemeral contractors are destroyed after task completion. The telemetry must be captured before cleanup:

  1. All LLM calls during contractor execution include mission_task_idllm_usage rows survive contractor deletion

  2. mission_tasks.agent_id is set at assignment time → survives contractor deletion (FK is nullable, not cascading)

  3. mission_events reference agent_id as an integer field, not a FK → no referential integrity issue when contractor is cleaned up

11.3 Propensity Logging (Future-Proofing)

The schema supports future action_probability without migration:

The attributes JSONB field accepts any key in the task.assigned allowlist. Adding action_probability to the allowlist is a one-line code change, not a migration.


12. Batch Reconciliation

12.1 Purpose

Real-time increments in UsageTracker.track() can drift from truth if:

  • A DB session fails after INSERT but before COMMIT

  • The UPDATE mission_tasks SET tokens_in = tokens_in + :delta races with a concurrent update

  • A retry loop double-counts

12.2 Reconciliation Job

Run daily as part of task_reconciler.py:


13. Risk Register

#
Risk
Impact
Likelihood
Mitigation

1

Event log volume overwhelms Postgres

High

Medium

Monthly partitioning + 90-day retention + S3 archive. At 1000 events/day, 90 days = 90K rows (~18MB) — well within Postgres comfort zone

2

Telemetry write overhead slows LLM call path

High

Low

UsageTracker already writes in a separate DB session. mission_task_id is one extra field on existing INSERT + one UPDATE. No new network call

3

Schema stores too little — can't answer questions not yet thought of

Medium

Medium

Hybrid approach (summary + events + JSONB attributes) hedges this. attributes is the escape hatch for ad-hoc data

4

GDPR/privacy concerns with outcome data

Medium

Low

Strict "no content" policy (Section 4). Telemetry is metadata only. No prompt text, no model output, no PII

5

Premature optimization — building learning engine too early

Medium

High

PRD-100 explicitly forbids this. PRD-106 captures data and defines queries. It does NOT build recommendation or bandit algorithms

6

llm_usage table is write-hot — adding FK + index may slow writes

Medium

Medium

Nullable FK, index only on non-null values (WHERE mission_task_id IS NOT NULL). Existing queries are unaffected — they never filter by mission_task_id

7

Real-time increment drifts from raw data

Medium

Medium

Daily batch reconciliation (Section 12) recomputes from llm_usage source of truth

8

Prometheus cardinality explosion from agent_id × model labels

Medium

Low

Agent label uses string ID (bounded by roster size ~20). Model label uses model_id (bounded by installed models ~10). Max cardinality: 200 series


14. Acceptance Criteria

Must Have

Should Have

Nice to Have


Appendix A: SQLAlchemy Model for mission_events

Appendix B: Research Sources

Source
What It Informed

MLflow entity model (mlflow/mlflow)

Three-tier storage: metrics (append-only) vs params (immutable) vs tags (mutable)

W&B summary/history split (wandb/wandb)

Summary columns for dashboards + event log for deep analysis = hybrid approach

OpenTelemetry trace/span model (opentelemetry.io)

Mission=trace / task=span mental model; gen_ai.* attribute naming conventions

Honeycomb high-cardinality querying (docs.honeycomb.io)

JSONB metadata for flexible GROUP BY; no pre-aggregation needed at our scale

Eppo assignment/metric model (docs.geteppo.com)

Attribution windows, raw event storage, join at analysis time

Deng, Microsoft ExP

Sufficient statistics (sum + sum_sq) for variance without raw data re-read

Eugene Yan, counterfactual evaluation

Propensity logging (action_probability) for future offline model evaluation

Automatos codebase audit

llm_usage, heartbeat_results, tool_execution_logs, votes, automatos_metrics.py — 7+ existing telemetry touchpoints identified

Last updated