PRD-106 — Outcome Telemetry & Learning Foundation
Version: 1.0 Type: Research + Design Status: Complete — Ready for Peer Review Priority: P1 Dependencies: PRD-101 (Mission Schema), PRD-103 (Verification & Quality), PRD-105 (Budget & Governance) Author: Gerard Kavanagh + Claude Date: 2026-03-15
1. Problem Statement
1.1 The Gap
Automatos captures per-call LLM usage (llm_usage table — core/models/core.py:138) and stores JSONB findings in heartbeat results, but nothing correlates a multi-step mission to its aggregate cost, duration, quality, or human acceptance. The platform records individual API calls — it never answers "which model performs best for research tasks?" or "what's the average cost of a compliance mission?"
No mission-level outcome record
llm_usage has agent_id but no mission_task_id
Cannot compute total cost/tokens/duration for a mission
No per-task structured outcome
board_tasks.result is free text
No machine-readable quality score, token spend, or retry count
No agent performance attribution
llm_usage.agent_id exists but unlinked to task outcomes
Cannot rank agents by task-type performance
No model comparison data
llm_usage.model_id exists but unlinked to quality scores
Cannot answer "Claude Opus or GPT-4 for research?"
No human feedback loop closure
votes.is_upvoted (core/models/core.py:1157) captures chat-level thumbs up/down
No mission/task-level acceptance signal
Prometheus metrics defined but unwired
6 counters/histograms in automatos_metrics.py:63-132
Zero application metrics in Grafana
heartbeat_results.cost always 0.0
Column exists but _store_heartbeat_result() never populates it
Heartbeat costs invisible
Context window telemetry ephemeral
ContextResult.token_estimate computed per-request
Never persisted — context optimization is blind
1.2 Why This Matters Now
PRD-100 Section 3: "No fancy learning engine — just data. Query it for patterns later."
This PRD defines what "just data" means concretely — the schema, capture points, and query patterns that make future optimization possible without building the optimization engine now. Without structured telemetry:
Mission Mode ships blind — no way to measure if it works
Model routing remains manual forever — no data to automate it
Cost optimization is guesswork — no per-task-type cost benchmarks
Phase 3 learning foundation (bandit-style selection) has no training signal
2. Prior Art Analysis
2.1 Systems Studied
MLflow
Three-tier storage: metrics (append-only time-series), params (immutable config), tags (mutable state)
Immutable config vs. mutable outcome separation on mission_tasks
Metric history as separate entity — too much schema for v1; use event log instead
Weights & Biases
Every run has summary (final/aggregate for cross-run comparison) and history (per-step time-series). Summary uses configurable aggregation (min, max, mean, last)
Summary columns on mission_tasks + event log for deep analysis = hybrid storage
W&B's custom aggregation DSL — our summary columns are explicit, not computed
OpenTelemetry
Trace/span model with gen_ai.* semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
Attribute naming conventions for event metadata; mission=trace / task=span mental model
OTel transport layer — Postgres event log is simpler and matches our infra
Honeycomb
Schema-free columnar storage enables high-cardinality GROUP BY without pre-aggregation. No cardinality explosion because there's no pre-defined rollup
JSONB metadata escape hatch on mission_events for flexible querying via jsonb_path_query
Full columnar engine — Postgres JSONB with GIN indexes is sufficient at our scale
Eppo
Outcomes joined to treatments at analysis time via SQL, not stored as FK relations. Attribution window on metric definition, not on event
Store raw events with timestamps; join mission→outcome at query time for flexibility
Complex attribution window DSL — our missions have natural start/end boundaries
Deng (Microsoft ExP)
Minimal sufficient statistics: store metric_sum + metric_sum_squares alongside mean for variance computation without re-reading raw data
Aggregation views include sum and sum-of-squares for numeric metrics
Online variance algorithms — batch recomputation from llm_usage is sufficient
2.2 Key Design Decision
Hybrid storage (Option C from outline): Summary columns on mission_tasks for dashboard queries + separate mission_events append-only table for detailed per-step telemetry.
Rationale:
Summary columns enable fast
GROUP BY model_id, task_typewithout JSONB path extractionEvent log enables deep debugging, retry analysis, and future ML training
Matches the W&B summary/history split — the most proven pattern in ML experiment tracking
llm_usageremains the source of truth for per-call data — mission telemetry aggregates it, never duplicates it
3. Telemetry Schema
3.1 Summary Columns on mission_tasks (PRD-101 Extension)
mission_tasks (PRD-101 Extension)These columns live on the mission_tasks table defined in PRD-101. They are the "W&B summary" — final/aggregate values for fast dashboard queries.
Field sourcing:
model_id
LLMManager.generate_response() → first call determines primary model
Task assignment
task_type
Coordinator decomposition (PRD-102)
Task creation
tokens_in/out
Running total from UsageTracker.track()
Incremented per LLM call
cost_usd
Running total from UsageTracker.track()
Incremented per LLM call
duration_ms
completed_at - started_at
Task completion
verifier_score
VerificationService.verify() (PRD-103)
Post-execution verification
human_accepted
Human review API endpoint
Post-review
error_type
Exception handler in AgentFactory.execute_with_prompt()
On failure
retry_count
AgentFactory tool loop (core/agents/factory.py)
Incremented per retry
context_tokens
ContextService.get_context() return value
Task execution start
tools_used
_execute_tool_calls() in AgentFactory
Appended per tool call
3.2 Mission Run Summary Columns (PRD-101 Extension)
These columns live on the mission_runs table. They are aggregate-of-aggregates — computed from mission_tasks rows.
These are maintained via two mechanisms:
Real-time increment: When
UsageTracker.track()fires with amission_task_id, it also increments the parentmission_runstotalsBatch reconciliation: A daily job recomputes all mission_runs summaries from
llm_usage WHERE mission_task_id IN (SELECT id FROM mission_tasks WHERE mission_run_id = ?)— catches any drift from failed increments
3.3 Event Log Table: mission_events
mission_eventsThe "W&B history" — append-only, per-step telemetry for debugging and future ML training.
3.4 Event Types
mission.created
Run
Coordinator
{goal, plan_hash, config}
mission.plan_generated
Run
Coordinator
{task_count, estimated_cost, plan_version}
mission.replanned
Run
Coordinator
{reason, tasks_added, tasks_removed, plan_version}
mission.completed
Run
Coordinator
{status, total_cost, total_duration_ms}
task.assigned
Task
Coordinator
{agent_id, model_id, task_type, priority}
task.started
Task
AgentFactory
{context_tokens, tools_available}
task.llm_call
Task
UsageTracker
{model_id, tokens_in, tokens_out, cost_usd, latency_ms}
task.tool_called
Task
UnifiedExecutor
{tool_name, duration_ms, status, cache_hit}
task.retry
Task
AgentFactory
{attempt, reason, error_type}
task.verified
Task
VerificationService
{score, verdict, verifier_model, verifier_cost}
task.completed
Task
AgentFactory
{status, tokens_total, cost_total, duration_ms}
task.failed
Task
AgentFactory
{error_type, error_message_truncated, attempt}
budget.warning
Run
BudgetManager
{threshold_pct, spent, limit}
budget.throttled
Run
BudgetManager
{spent, limit, action}
human.reviewed
Task
API endpoint
{accepted, feedback_text_length}
human.verdict
Run
API endpoint
{verdict, tasks_accepted, tasks_rejected}
3.5 llm_usage Extension
llm_usage ExtensionAdd a nullable FK to link per-call data to mission tasks:
Critical constraints:
Nullable: Thousands of existing rows (chatbot, heartbeat, routing, embedding calls) have no mission context. Non-nullable would require impossible backfill.
No backfill: Historical
llm_usagerows predate missions. Backfilling would create false attributions. Only new LLM calls made within mission execution get the FK set.ON DELETE SET NULL: If a mission_task is deleted (cascaded from mission_run deletion), thellm_usagerows survive for historical billing queries.
SQLAlchemy model change (core/models/core.py:138):
3.6 request_type Extension
request_type ExtensionThe existing request_type column on llm_usage (core/models/core.py:151) is a VARCHAR(50) — not a database enum, so no migration needed to add values. The application code must use two new values:
coordinator
CoordinatorService (PRD-102) planning and monitoring calls
Separate coordination overhead from task execution cost
verifier
VerificationService (PRD-103) scoring calls
Separate verification cost from task execution cost
This enables computing coordination_cost_usd and verification_cost_usd as separate line items on mission_runs via:
4. Data Privacy Policy
Rule: Never store prompt content or model outputs in telemetry.
Telemetry captures metadata only. Content belongs in agent reports (PRD-76) and workspace files.
Model ID, provider
Prompt text
Token counts (in/out)
Model output text
Cost in USD
User messages
Duration in ms
File contents
Tool names
Tool parameter values
Error type enum
Full error stack traces
Error message (truncated to 500 chars)
PII or IP
Verifier score (0.0-1.0)
Verifier reasoning text
Success criteria text (from mission definition)
Task output content
JSONB attributes on mission_events: Must not contain prompt content, model output, or PII. Allowed keys are defined per event_type in Section 3.4. The TelemetryService validates keys against an allowlist before insertion.
5. Query Patterns
The schema is designed to answer specific questions. Every query below has been validated against the DDL in Section 3.
5.1 Operational (Day 1)
5.2 Analytical (Week 1+)
5.3 Strategic (Month 1+)
5.4 Materialized Views for Dashboards
For queries that scan large tables (Q10-Q13), create materialized views refreshed by a daily batch job:
6. Capture Points & Data Flow
6.1 Sequence Diagram: Telemetry Capture During Task Execution
6.2 Code Changes Required
core/llm/usage_tracker.py:21
track() has no mission context
Add mission_task_id: Optional[int] = None parameter. When set: (1) pass to LLMUsage row, (2) increment mission_tasks.tokens_in/out/cost_usd via UPDATE ... SET tokens_in = tokens_in + :delta
21-80
core/llm/manager.py
Calls _track_usage()
Pass mission_task_id through generate_response() → _track_usage() call chain. Source: execution_context dict already threaded through
643-671
core/agents/factory.py
Tool loop tracks retries
(1) Set mission_task_id on execution context, (2) append to tools_used array, (3) increment retry_count, (4) emit task.started/completed/failed events
Tool loop
core/monitoring/automatos_metrics.py:75-132
Prometheus counters defined, never incremented
Wire into UsageTracker.track(): AGENT_TOKEN_USAGE.labels(agent_id, model, 'input').inc(tokens_in), LLM_REQUEST_DURATION.labels(model, provider).observe(latency_ms/1000)
75-132
services/heartbeat_service.py:913-942
_store_heartbeat_result() inserts with cost=0.0
After insert, query SUM(total_cost) FROM llm_usage WHERE execution_id = :eid and update heartbeat_results.cost
913-942
api/llm_analytics.py
Usage/cost endpoints only
Add /api/missions/{id}/telemetry and /api/analytics/mission-outcomes endpoints
New
6.3 UsageTracker Extension
6.4 TelemetryService
New service — single write path for all mission_events rows:
7. Retention Policy
7.1 Three-Tier Retention
Permanent
mission_runs summary columns, mission_tasks summary columns
Forever
Tiny rows (~500 bytes each). Essential for long-term analytics. Storage cost negligible
Hot
mission_events rows
90 days in Postgres
Event log rows are larger (~200 bytes each). 1000 events/day × 90 days = ~18MB — manageable but grows linearly with mission volume
Cold archive
mission_events rows older than 90 days
S3 archive (s3://automatos-ai/telemetry/archive/)
JSONL export before partition drop. Available for future ML training or forensic analysis
7.2 Retention Automation
8. Prometheus Metrics Wiring
8.1 Current State
Six counters/histograms are defined in core/monitoring/automatos_metrics.py:63-132 but never incremented anywhere in the codebase:
automatos_agent_heartbeat_total
Line 63
Unwired
automatos_agent_heartbeat_duration_seconds
Line 69
Unwired
automatos_agent_token_usage_total
Line 75
Unwired
automatos_active_agents
Line 81
Unwired
automatos_llm_request_duration_seconds
Line 121
Unwired
automatos_llm_tokens_total
Line 128
Unwired
8.2 Wiring Plan
agent_heartbeat_total
heartbeat_service.py after each tick
.labels(agent_id=str(id), status=status).inc()
agent_heartbeat_duration_seconds
heartbeat_service.py around tick execution
.labels(agent_id=str(id)).observe(duration)
agent_token_usage_total
usage_tracker.py:track() (see Section 6.3)
.labels(agent_id, model, direction).inc(tokens)
active_agents
heartbeat_service.py at service start
.set(count_of_active_agents)
llm_request_duration_seconds
usage_tracker.py:track() (see Section 6.3)
.labels(model, provider).observe(latency/1000)
llm_tokens_total
usage_tracker.py:track() (see Section 6.3)
.labels(model, provider, direction).inc(tokens)
8.3 New Mission-Specific Metrics
9. Heartbeat Cost Fix
9.1 Current Bug
heartbeat_service.py:913-942 inserts heartbeat_results rows but never populates the cost column — it's always 0.0.
The cost column exists in the raw SQL INSERT:
Note: cost is not even in the INSERT column list — it defaults to whatever the column default is (likely 0.0 or NULL).
9.2 Fix
After the heartbeat LLM call completes, query the llm_usage table for the cost of the call just made (using execution_id correlation), then include cost in the INSERT:
10. API Endpoints
10.1 Mission Telemetry
Response:
10.2 Mission Outcome Analytics
Response:
10.3 Human Review Endpoints
Both endpoints emit human.reviewed / human.verdict telemetry events.
11. Cross-PRD Integration
11.1 Dependencies
PRD-101
mission_runs and mission_tasks tables receive summary columns
PRD-106 extends PRD-101 schema
PRD-102
Coordinator emits mission.created/plan_generated/replanned events, uses request_type='coordinator'
PRD-102 writes → PRD-106 captures
PRD-103
VerificationService sets verifier_score, emits task.verified, uses request_type='verifier'
PRD-103 writes → PRD-106 captures
PRD-104
Ephemeral contractors must log agent_id before cleanup. Contractor LLM calls must include mission_task_id
PRD-104 must preserve telemetry before destroying contractor
PRD-105
Budget enforcement uses real-time cost_usd from same increment path. budget.warning/throttled events
Shared write path — PRD-105 reads what PRD-106 writes
PRD-107
Context interface exposes context_tokens and sections_trimmed for telemetry capture
PRD-107 provides → PRD-106 persists
11.2 Telemetry for Ephemeral Contractors (PRD-104 Constraint)
Ephemeral contractors are destroyed after task completion. The telemetry must be captured before cleanup:
All LLM calls during contractor execution include
mission_task_id→llm_usagerows survive contractor deletionmission_tasks.agent_idis set at assignment time → survives contractor deletion (FK is nullable, not cascading)mission_eventsreferenceagent_idas an integer field, not a FK → no referential integrity issue when contractor is cleaned up
11.3 Propensity Logging (Future-Proofing)
The schema supports future action_probability without migration:
The attributes JSONB field accepts any key in the task.assigned allowlist. Adding action_probability to the allowlist is a one-line code change, not a migration.
12. Batch Reconciliation
12.1 Purpose
Real-time increments in UsageTracker.track() can drift from truth if:
A DB session fails after INSERT but before COMMIT
The
UPDATE mission_tasks SET tokens_in = tokens_in + :deltaraces with a concurrent updateA retry loop double-counts
12.2 Reconciliation Job
Run daily as part of task_reconciler.py:
13. Risk Register
1
Event log volume overwhelms Postgres
High
Medium
Monthly partitioning + 90-day retention + S3 archive. At 1000 events/day, 90 days = 90K rows (~18MB) — well within Postgres comfort zone
2
Telemetry write overhead slows LLM call path
High
Low
UsageTracker already writes in a separate DB session. mission_task_id is one extra field on existing INSERT + one UPDATE. No new network call
3
Schema stores too little — can't answer questions not yet thought of
Medium
Medium
Hybrid approach (summary + events + JSONB attributes) hedges this. attributes is the escape hatch for ad-hoc data
4
GDPR/privacy concerns with outcome data
Medium
Low
Strict "no content" policy (Section 4). Telemetry is metadata only. No prompt text, no model output, no PII
5
Premature optimization — building learning engine too early
Medium
High
PRD-100 explicitly forbids this. PRD-106 captures data and defines queries. It does NOT build recommendation or bandit algorithms
6
llm_usage table is write-hot — adding FK + index may slow writes
Medium
Medium
Nullable FK, index only on non-null values (WHERE mission_task_id IS NOT NULL). Existing queries are unaffected — they never filter by mission_task_id
7
Real-time increment drifts from raw data
Medium
Medium
Daily batch reconciliation (Section 12) recomputes from llm_usage source of truth
8
Prometheus cardinality explosion from agent_id × model labels
Medium
Low
Agent label uses string ID (bounded by roster size ~20). Model label uses model_id (bounded by installed models ~10). Max cardinality: 200 series
14. Acceptance Criteria
Must Have
Should Have
Nice to Have
Appendix A: SQLAlchemy Model for mission_events
mission_eventsAppendix B: Research Sources
MLflow entity model (mlflow/mlflow)
Three-tier storage: metrics (append-only) vs params (immutable) vs tags (mutable)
W&B summary/history split (wandb/wandb)
Summary columns for dashboards + event log for deep analysis = hybrid approach
OpenTelemetry trace/span model (opentelemetry.io)
Mission=trace / task=span mental model; gen_ai.* attribute naming conventions
Honeycomb high-cardinality querying (docs.honeycomb.io)
JSONB metadata for flexible GROUP BY; no pre-aggregation needed at our scale
Eppo assignment/metric model (docs.geteppo.com)
Attribution windows, raw event storage, join at analysis time
Deng, Microsoft ExP
Sufficient statistics (sum + sum_sq) for variance without raw data re-read
Eugene Yan, counterfactual evaluation
Propensity logging (action_probability) for future offline model evaluation
Automatos codebase audit
llm_usage, heartbeat_results, tool_execution_logs, votes, automatos_metrics.py — 7+ existing telemetry touchpoints identified
Last updated

