PRD-121 — HARNESS: Self-Optimizing Organization Loop
Version: 1.0 Type: Implementation Status: Draft Priority: P1 Research Base: Meta-Harness (arXiv 2603.28052v1), Mission Zero (Mission-0.1), PRDs 82A (Coordinator), 76 (Reports), 64 (Action Discovery), 12 (Playbook Patterns), 59 (Workflow Engine V2) Author: Gerard Kavanagh + Claude Date: 2026-03-31
1. Goal
User-facing: Auto continuously tunes your team — you see better agent performance, lower costs, and occasional board suggestions for bigger changes. No setup required.
Technical: Close the optimization loop on Mission Zero. Today the organization is built once and drifts. HARNESS is a system playbook that runs weekly, invisible to the user. It collects org-wide metrics, diagnoses regressions against a stored baseline, prescribes configuration changes with risk scores, auto-applies safe ones, queues risky ones as board tasks for human review, and snapshots a new baseline for next week's comparison.
The result: a Meta-Harness-style iterative optimization loop applied to organizational configuration. Agents, models, heartbeats, tools, costs, and quality converge toward optimal over 4-6 weekly runs.
2. Background — Why This Matters
2.1 The Meta-Harness Paper
Stanford's Meta-Harness (Lee et al., March 2026) proves three things relevant to Automatos:
The harness matters more than the model. Changing the code/config around a fixed LLM produces a 6x performance gap. Automatos' tool routing, memory injection, agent configuration, and mission coordination IS the harness.
Full diagnostic traces beat summaries. Giving an optimizer access to raw execution traces (not compressed summaries) improved accuracy from 34.6% to 50.0% — a 44% gain.
Iterative search converges fast. Meta-Harness matches competitors in 4 evaluations vs 60, because the proposer sees everything.
2.2 Mission Zero Gap
Mission Zero (Mission-0.1) is a one-shot organizational build:
Research marketplace → Build agent specs → Execute configuration → Validate
There is no closed loop. After the initial build:
Agent models may be cost-inefficient for their actual workload
Heartbeat intervals may be too frequent (wasting tokens) or too rare (stale data)
Tools get assigned but never called
Success rates drift without anyone noticing
Cost creeps without attribution
HARNESS closes this loop.
3. What Ships
HarnessService
Orchestrator-level service. Registered at startup like the coordinator tick. Runs weekly cron. 5-phase pipeline hardcoded in service (Collect → Diagnose → Prescribe → Apply → Baseline). Same for every workspace.
3 platform tools
platform_harness_status, platform_harness_trigger, platform_harness_history — called by Auto, not user-facing
Workspace file layout
/harness/ directory with baselines, traces, changelogs, prescriptions
Risk framework
5-tier risk scoring for prescriptions (auto-apply ≤ 2, queue ≥ 3)
Convergence detection
Track delta magnitude across runs, detect when org config stabilizes
4. What Does NOT Ship (Deferred)
Prompt optimization / A/B testing
v2
Needs shadow-mode execution infrastructure
Tool/skill assignment optimization
v2
Needs deeper impact analysis before auto-applying
Blueprint rule modifications
v2
Governance changes need more human oversight initially
Explicit rollback mechanism
v2
Baseline diff is sufficient for v1; next run can prescribe reversions
Auto-cadence switching (weekly ↔ biweekly)
v2
Convergence detection ships in v1, but cadence change is manual
Frontend dashboard for HARNESS
v2
Existing Reports tab + Board are sufficient
Cross-workspace pattern sharing
v3
Requires marketplace-level analytics
Agent creation/retirement
v3
High-risk structural changes
HARNESS self-prompt-tuning
v3
Needs safeguards around modifying hardcoded phase prompts
Grafana integration for infra metrics
v3
Platform tools cover application-level metrics for now
5. User Experience
5.1 The User Never Sees HARNESS
HARNESS is orchestrator infrastructure — same level as the coordinator tick or the heartbeat scheduler. It is:
Registered at server startup alongside the coordinator's 5s tick and heartbeat scheduler
Not in any UI — no playbook list, no settings page, no toggle
Same for every workspace — hardcoded phases, hardcoded schedule, no per-workspace config
Cannot be deleted, disabled, or modified by users
The user's experience is:
They sign up, create a workspace, add agents
HARNESS runs silently in the background (dormant until ≥ 3 agents with ≥ 7 days of data)
First meaningful run produces a baseline and an audit report in the Reports tab
Subsequent runs may produce board tasks like
[HARNESS] Suggest: reduce SCOUT heartbeat to 180min— the user approves or dismisses like any other taskOver time, agent configs quietly improve. Costs trend down. Success rates trend up.
5.2 Auto Surfaces HARNESS Naturally
When a user asks Auto:
"How's the team performing?" → Auto calls
platform_harness_status, references latest report"Any optimization ideas?" → Auto calls
platform_harness_history, summarizes recent prescriptions"Why did SCOUT's model change?" → Auto reads
/harness/changelog/for the applied change + rationale
The user never needs to know the word "HARNESS." It's just Auto being a good CTO.
5.3 Lifecycle
Dormant
< 3 agents OR < 7 days of data
Cron fires but Step 2 detects insufficient data, writes baseline-only, no prescriptions
Exploring
First 3 runs with sufficient data
Full optimization pass, learning the org's profile
Converging
Runs 4-6+
Deltas shrinking, prescriptions fewer and more targeted
Converged
Delta magnitude < 2.0 for 2+ runs
Monitoring mode — only flags regressions, suggests biweekly cadence
Diverging
External change (new agents, model swap, etc.)
Re-enters exploring mode, weekly cadence resumes
5.4 System-Level Registration
HARNESS registers at server startup in main.py lifespan, same pattern as:
CoordinatorService.start()— 5s mission tickHeartbeatService.start()— agent/orchestrator heartbeatsPlaybookSchedulerService.start()— cron playbooks
No per-workspace provisioning. No seed migration. The service queries all active workspaces and registers a job for each. New workspaces get picked up on the next scheduler reload.
6. Design Decisions
6.1 Orchestrator Heartbeat Schedule, NOT Playbook
Playbooks are user-visible — they appear in the playbook/recipe list, can be deleted, renamed, or misconfigured by non-technical users. HARNESS must be invisible infrastructure.
HARNESS registers as a named orchestrator heartbeat schedule (job_id = f"harness_{workspace_id}"), same pattern as the existing orchestrator heartbeat but with its own weekly cron. It's configured in code, not in the database.
Agent heartbeats: one per agent (Auto's is taken) — not suitable
Playbooks: user-visible, deletable — not suitable
Orchestrator heartbeat schedule: system infrastructure, registered at startup — correct
Schedule: 0 2 * * 0 (Sunday 2AM UTC weekly)
6.2 Hardcoded Service, NOT Configurable Template
HARNESS has a fixed 5-phase pipeline. The phases, prompts, and risk thresholds are hardcoded in HarnessService, not stored in a DB template. This means:
No seed migration needed
No user can modify, delete, or misconfigure it
Same behavior for every workspace
Updates ship with code deploys
Missions are for user goals with LLM-planned DAGs — wrong primitive for a fixed system pipeline
6.3 Workspace Files, NOT New DB Tables
Meta-Harness stores full traces on filesystem for LLM readability. Full diagnostic traces are too large for JSONB columns. Workspace files are:
Agent-readable via
workspace_read_fileHuman-browsable in the UI
Naturally organized by date
No schema migration required
6.4 Platform Tools + NL2SQL Hybrid
Platform analytics tools provide pre-computed aggregations (good for dashboards). But platform_query_data (NL2SQL) can query raw data from heartbeat_results, recipe_executions, orchestration_runs, llm_usage, and agent_reports for richer cross-table analysis that isn't exposed via platform tools. Using both maximizes coverage.
6.5 Board Tasks for Human Review
Prescriptions with risk ≥ 3 become board_tasks with tags: ["harness"]. The human moves them to "done" (approved) or "blocked" (rejected). The next HARNESS run reads board task statuses to close the loop. No new approval mechanism needed — the existing board lifecycle is sufficient.
7. The 5-Phase Pipeline
6.1 Phase 1: COLLECT
Agent: Auto (CTO) Mode: Read-only Purpose: Gather comprehensive raw metrics across all agents and systems.
Tool calls:
1
platform_list_agents
All 14 agents with configs, models, tools, skills
2
platform_get_agent_ranking
Composite scoring: success, speed, volume
3
platform_get_success_rate
Overall success rate + 7-day trend
4
platform_get_error_rates
Failures by agent type, severity (30d)
5
platform_get_cost_breakdown
Cost by agent, provider, model (7d + 30d)
6
platform_get_sla_compliance
Completion rate + response time vs targets
7
platform_get_efficiency_score
Composite 0-100 score
8
platform_get_bottlenecks
Failure rates, queue buildup, slow executions
9
platform_get_llm_usage
Token counts by model
10
platform_board_summary
Tasks by status, priority, busiest agents
11
workspace_read_file
/harness/baseline_latest.json (previous baseline)
12
platform_query_data
Per-agent heartbeat costs (7d)
13
platform_query_data
Per-playbook success rates (7d)
14
platform_query_data
Recent mission outcomes (7d)
15
platform_list_tasks
Prior HARNESS prescriptions (tags=["harness"])
Design principle: Preserve full diagnostic data, not summaries. This is the Meta-Harness insight — raw traces enable the proposer to do causal reasoning about failures.
Output: Raw metrics JSON → scratchpad → Step 2
6.2 Phase 2: DIAGNOSE
Agent: Auto (CTO) Mode: Read-only Purpose: Compare current metrics against previous baseline, produce per-agent health cards.
Per-agent health card:
success_rate_delta, cost_delta, efficiency_delta, error_rate_delta, token_usage_delta
Classification:
REGRESSION(>10% worse),IMPROVEMENT(>10% better),ANOMALY(zero activity, cost spike, new error patterns),STABLE
Cross-cutting analysis:
Department-level aggregate performance
Model cost-efficiency: are we paying premium prices for budget-tier output?
Tool utilization: tools assigned but never called in 7d
Heartbeat health: missed beats, excessive cost per heartbeat
Playbook quality trends: declining
quality_scoreon any playbook
Root cause classification for each issue: model_mismatch | prompt_drift | tool_gap | overload | underutilized | config_stale | cost_inefficient
Self-optimization: Also reads the HARNESS playbook's own learning_data via platform_get_recipe to factor in patterns from prior HARNESS runs (e.g., "Step 1 COLLECT is slow due to NL2SQL timeout").
Output: Structured diagnosis JSON → scratchpad → Step 3
6.3 Phase 3: PRESCRIBE
Agent: Auto (CTO) Mode: Read-only (generates prescriptions but does not apply) Purpose: Generate prioritized, risk-scored configuration change proposals.
Prescription schema:
Risk scoring rules:
1
Yes
Description, tags, team, job_title updates
2
Yes
Heartbeat interval ±30min, temperature ±0.1, active_hours shift
3
Queue for human
Model change within same cost tier, tool addition
4
Queue for human
Model tier change (haiku→sonnet), prompt rewrite, skill changes
5
Queue for human
Agent deactivation, proactive_level→autonomous, deletion
Pareto filter: Prefer accuracy improvements over cost cuts when success_rate < 85%. Only optimize cost when quality is healthy. This prevents the optimizer from racing to the cheapest model at the expense of output quality.
Rejected-change awareness: Skip prescriptions matching prior "blocked" board tasks to avoid re-prescribing what the human already rejected.
Output: Prescriptions JSON array → scratchpad → Step 4
6.4 Phase 4: APPLY
Agent: Auto (CTO) Mode: Write (conditional on risk score) Purpose: Execute safe changes, queue risky ones for human review.
Auto-apply (risk ≤ 2):
model_change (same tier)
platform_update_agent with model_config
temperature_adjust
platform_update_agent with model_config.temperature
heartbeat_tune
platform_configure_agent_heartbeat
tag_update
platform_update_agent with tags
description_update
platform_update_agent with description
Queue for review (risk ≥ 3):
Creates board_task via platform_create_task with:
Title:
[HARNESS] {change_type} for {agent_name}Description: Full prescription details + rationale + risk score
Tags:
["harness", "org-review", "risk-{N}"]Priority:
"high"if risk ≥ 4, else"medium"
Also applies: Previously approved board tasks (status=done, tag=harness) that weren't yet applied from prior runs.
Output: Changelog (applied + queued + failed) → scratchpad → Step 5
6.5 Phase 5: BASELINE
Agent: Auto (CTO) Mode: Write Purpose: Snapshot new state for next week's comparison, publish artifacts, submit audit report.
Writes:
workspace_write_file→/harness/baseline_latest.json(overwrite — current state)workspace_write_file→/harness/baselines/{YYYY-MM-DD}.json(archive — append-only)workspace_write_file→/harness/traces/{YYYY-MM-DD}_trace.json(full diagnostic trace)workspace_write_file→/harness/changelog/{YYYY-MM-DD}.md(human-readable changes)workspace_write_file→/harness/prescriptions/{YYYY-MM-DD}_rx.json(all prescriptions)platform_submit_report→ type="audit", title="HARNESS Weekly Org Review — Run #{N}"
Convergence tracking (stored in baseline):
8. Baseline Schema
9. Workspace File Layout
10. Risk Framework
1
Yes
Description, tags, team
Cosmetic, zero operational impact
2
Yes
Heartbeat interval ±30min, temperature ±0.1
Low impact, easily reversible
3
Queue
Model change within tier, add tool
Could affect output quality
4
Queue
Model tier change, prompt rewrite, skill change
Significant behavioral change
5
Queue
Agent deactivation, proactive→autonomous
Irreversible or high-risk
Rollback strategy (v1): If an auto-applied change causes regression (detected in next HARNESS run), the DIAGNOSE step will identify it via delta computation, and PRESCRIBE will generate a reversion prescription. The baseline diff serves as the rollback reference.
11. Convergence Detection
exploring
iteration_count < 3
Full optimization pass, weekly cadence
converging
total_delta_magnitude decreasing run-over-run
Continue weekly, standard risk thresholds
converged
delta < 2.0 AND variance < 0.02 for 2+ consecutive runs
Note in report, suggest biweekly switch
diverging
total_delta_magnitude increasing
Keep weekly, flag in report for human attention
total_delta_magnitude = sum of absolute deltas across all agents across all metrics (success_rate, cost, efficiency, error_rate). When this approaches zero, the org configuration is stable.
12. Self-Learning
12.1 Heartbeat Results as Learning Data
Each HARNESS run stores its results in heartbeat_results (same as any orchestrator heartbeat):
findingsJSONB — diagnosis summary, prescription count, convergence stateactions_takenJSONB — applied changes, queued changes, failurestokens_used,cost— resource consumption per run
Phase 2 (DIAGNOSE) reads prior HARNESS heartbeat_results via platform_query_data to detect multi-run patterns:
Is COLLECT getting slower? (tokens_used trending up)
Are prescriptions being rejected repeatedly? (same board tasks reappearing as blocked)
Is the cost of running HARNESS itself growing?
12.2 Convergence as Self-Assessment
The convergence signal (total_delta_magnitude) IS the quality metric. If HARNESS runs are producing changes but metrics aren't improving (delta not shrinking), the system is thrashing. The DIAGNOSE phase detects this and reduces prescription aggressiveness.
12.3 Trace History as Memory
The /harness/traces/ workspace files are HARNESS's long-term memory. Phase 2 can read all prior traces (non-Markovian access) to identify:
"We tried upgrading SCOUT to sonnet in run 3 but it was reverted in run 4 — don't retry"
"Heartbeat tuning for NEXUS has stabilized at 150min across 3 runs — leave it alone"
This mirrors Meta-Harness's key finding: the proposer reads 82+ files of prior history, not just the last run.
13. Feedback Loop — How Run N+1 Reads Run N
Baseline file
Step 1 reads /harness/baseline_latest.json
Previous week's per-agent metrics + convergence state
Board task resolution
Step 1 queries platform_list_tasks(tags=["harness"])
Which prescriptions were approved (done) or rejected (blocked)
Playbook learning_data
Step 2 reads via platform_get_recipe
Patterns from prior HARNESS runs
Trace history
Step 2 reads /harness/traces/
Multi-week trend detection (e.g., 3-week decline)
Report history
platform_get_latest_report for Auto
Prior audit report summaries
14. Platform Tools
13.1 platform_harness_status (read)
platform_harness_status (read)Returns current HARNESS state: last run date, convergence status, quality score, iteration count, next scheduled run.
13.2 platform_harness_trigger (write)
platform_harness_trigger (write)Manually trigger a HARNESS run outside the weekly cron schedule. Useful for post-incident optimization or after major org changes.
13.3 platform_harness_history (read)
platform_harness_history (read)List past HARNESS runs with dates, prescription counts, applied/queued counts, convergence state per run.
15. Implementation — Files
14.1 New Files
orchestrator/services/harness_service.py
150
HarnessService: ensure_playbook_exists() (called at workspace provisioning), get_status(), trigger_now()
orchestrator/modules/tools/discovery/actions_harness.py
80
3 ActionDefinitions registered via register_harness_actions()
orchestrator/modules/tools/discovery/handlers_harness.py
120
Handler functions for the 3 platform tools
15.2 Modified Files
orchestrator/modules/tools/discovery/platform_actions.py
Add from .actions_harness import register_harness_actions + call in register_all_actions()
orchestrator/modules/tools/execution/unified_executor.py
Add 3 handler entries to _handlers dict
orchestrator/consumers/chatbot/auto.py
Add HARNESS keywords to _PLATFORM_KEYWORDS
orchestrator/main.py
Add HarnessService.start(scheduler) to lifespan startup
15.3 3-File Pattern (Platform Tools)
Every platform tool in Automatos follows the same pattern:
actions_*.py— ActionDefinition registration (name, description, parameters, permission_level)handlers_*.py— Handler function that does the workplatform_actions.py— Wires registrar intoregister_all_actions()
Plus unified_executor.py gets the handler entry in its _handlers dict. The 3 HARNESS platform tools follow this exactly.
15.4 What's NOT a Platform Tool
The 5-phase pipeline itself is NOT platform tools. It's hardcoded methods in HarnessService:
_phase_collect(workspace_id)— gathers metrics via platform tool calls_phase_diagnose(workspace_id, metrics, baseline)— LLM-powered analysis_phase_prescribe(workspace_id, diagnosis)— LLM-powered prescription generation_phase_apply(workspace_id, prescriptions)— executes safe changes, queues risky ones_phase_baseline(workspace_id, metrics, changelog)— writes workspace files + submits report
These run sequentially inside a single _harness_tick(workspace_id) method, called by the APScheduler cron job.
16. Migration
No migration needed. No new tables, no new columns, no seed data.
HARNESS uses only existing infrastructure:
heartbeat_results— stores execution results per runboard_tasks— queued prescriptions for human reviewagent_reports— weekly audit reportWorkspace files — baselines, traces, changelogs
The service registers itself at startup. Deploy the code, restart the server, HARNESS is live.
17. Phasing
v1 — Core Loop (this PRD)
Orchestrator heartbeat schedule, weekly cron, hardcoded 5-phase pipeline
Metrics collection via platform tools + NL2SQL
Per-agent diagnosis with delta computation
Prescriptions for: model_change, heartbeat_tune, temperature_adjust, tag/description
Auto-apply risk ≤ 2, board_task queue risk ≥ 3
Workspace file storage (baselines, traces, changelog)
Convergence detection (basic)
3 platform tools for Auto (status, trigger, history)
No migration, no seed data, no user-facing config
v2 — Expanded Optimization Surface
Prompt optimization: generate variants, A/B test via shadow mode
Tool/skill assignment optimization
Blueprint rule suggestions
Explicit rollback mechanism with change_id tracking
Convergence auto-cadence switching (weekly ↔ biweekly)
Pareto frontier visualization in frontend
v3 — Autonomous Adaptation
Cross-workspace pattern sharing (marketplace-level insights)
Agent creation/retirement recommendations
HARNESS self-prompt-tuning from learning_data
Cost budget enforcement (auto-downgrade models when budget exceeded)
Grafana integration for infrastructure-level metrics
18. Meta-Harness Mapping
How HARNESS maps to the Stanford Meta-Harness architecture:
Proposer agent (Claude Code Opus)
Auto (CTO agent, Opus model)
Filesystem of prior candidates
/harness/ workspace files (baselines, traces, changelogs)
Execution traces
metrics.json + per-agent health cards + heartbeat results
Score function
Agent rankings + SLA compliance + cost delta
Pareto frontier
Risk tiers (auto-apply vs human-review)
Code-space search
Config-space search (models, heartbeats, temperatures, tools)
Non-Markovian access
Step 2 reads ALL prior traces, not just last run
Convergence detection
total_delta_magnitude tracking across runs
Self-learning
PlaybookLearningService (Stage 6) + PlaybookQualityService (Stage 7)
Key difference: Meta-Harness operates in a sandbox (evaluation set). HARNESS operates on live agents with real consequences. The risk framework ensures only safe changes are auto-applied.
19. Verification Plan
Seed the playbook via
seed_harness_playbook.pyagainst dev workspaceManual trigger via
platform_harness_triggerin Auto chatVerify COLLECT:
/harness/traces/{date}_trace.jsonhas all metric categoriesVerify DIAGNOSE: trace has per-agent health cards with deltas (first run: all baselines are "new")
Verify PRESCRIBE: prescriptions JSON has risk scores and rationale
Verify APPLY: board has queued tasks with
harnesstag; agents have auto-applied changesVerify BASELINE:
/harness/baseline_latest.jsonexists with correct schemaVerify REPORT: Reports tab shows "HARNESS Weekly Org Review" audit report
Verify self-learning: HARNESS playbook's
learning_dataandquality_scoreupdatedRun a second time: Step 1 reads previous baseline, Step 2 computes real deltas
Cron test: playbook appears in
PlaybookSchedulerService._load_cron_playbooks()on restart
20. Success Criteria
Last updated

