PRD-78: Autonomous Test Coverage & Quality Mesh
Version: 1.1 Status: Active Priority: P0 Author: Gar Kavanagh + Auto CTO Created: 2026-03-10 Updated: 2026-03-12 Dependencies: PRD-05 (Memory & Knowledge), PRD-55 (Autonomous Assistant Platform), PRD-68 (Progressive Complexity Routing), PRD-69 (Agent Intelligence Layer), PRD-72 (Activity Command Centre), PRD-73 (Observability & Monitoring Stack), PRD-77 (Agent Self-Scheduling & Memory Dashboard)
Executive Summary
Automatos already has the foundations for autonomous quality engineering: API tests, workflow execution, memory, scheduling, skills, platform observability, and agents that can raise Jira tickets and fix bugs. What it does not yet have is a deliberately designed quality mesh: a structured testing architecture that grows with the platform, runs at different cadences, produces machine-readable artifacts, and allows specialist agents to cooperate reliably.
Current baseline (2026-03-12): 124 API integration tests across 23 domain files, 3 runner scripts, 1 audit/gap-finder tool. No unit tests, no browser tests, no regression-pin tests.
Today, testing is partially present:
tests/run_nightly.pyruns the broad suite and produces summaries.tests/run_health_regression.pyruns a curated high-signal subset.tests/run_gap_finder.pyinventories the suite and detects gaps.Skills exist for QA analysis, Jira administration, and automated bug fixing.
The missing piece is the system design that turns these into a coordinated quality program:
A test taxonomy that grows organically as the platform stabilizes
Scheduled execution lanes by speed, risk, and purpose
Consistent artifacts that downstream agents can consume
A clear role for browser automation, internal tool validation, API contracts, and worker-level regressions
Optional specialist agents and model routing for quality tasks
This PRD defines that system and its growth milestones.
Growth Philosophy
This is a living architecture, not a "write 500 tests" project. The test suite grows in lockstep with platform maturity:
Pilot Readiness (now) — Flush bugs, pin known regressions, deepen P0 coverage so 10-15 pilot users hit a stable platform.
Post-Pilot Hardening — As real users surface issues, each bug becomes a regression test. Coverage grows from usage, not quotas.
Mature Platform — When the platform stops changing rapidly, fill out the full taxonomy. The 500+ target is a north star, not a Phase 1 deliverable.
What We're Building
A scalable test coverage architecture spanning API integration tests, regression pins, internal tool validation, stateful journeys, and eventually browser tests — growing as the platform matures
Three production testing recipes with clean runner scripts, schedules, artifact outputs, and downstream handoffs:
Nightly Self-Test Suite
API Health Check & Regression Detector
Weekly Test Coverage Gap Finder
A specialist-agent quality workflow where QA, Jira, and Bug Fixer agents cooperate with rigid artifact contracts
A scheduling model that separates PR checks, hourly health, nightly full confidence, weekly gap analysis, and release validation
A quality artifact standard that lets agents create Jira issues, classify severity, trace source files, and drive bug-fixing without ad hoc reasoning
What We're NOT Building
A separate test repository right now
Full load/performance engineering in this PRD
Chaos testing or production incident automation
A new CI platform
A full synthetic monitoring platform for deployed environments
500 tests before the platform is stable enough to warrant them
Those can become future PRDs once the in-repo testing mesh is stable.
1. Problem Statement
Automatos needs more than "more tests." It needs structured, scalable, agent-operable testing.
Current pain points:
Coverage is present but uneven. Many current tests are smoke checks per endpoint. There are fewer stateful user journeys and fewer end-to-end contract validations than the platform now needs.
No formal test taxonomy exists. Without grouping by risk, speed, and schedule, a 500+ suite will become slow, noisy, and hard to reason about.
The QA -> Jira -> Bug Fixer handoff was previously implicit. Artifact names, scratchpad payloads, and runner outputs were not rigid enough for reliable automation.
Browser simulation is underdeveloped. The product has rich UI and multi-step agent workflows, but browser-level user journeys are not yet a first-class layer.
Internal tools need deeper validation. Platform actions, scratchpad contracts, scheduler behavior, memory logic, recipe execution, and agent orchestration require test coverage beyond HTTP route checks.
Testing must stay version-aligned with the product. Since routes, payloads, workflows, runner outputs, and agent behaviors change rapidly, test logic should evolve in the same repo and release cycle as the code.
2. Strategic Decision: Keep Testing In-Repo
The test system remains inside automatos-ai.
Why
Version alignment matters more than separation right now.
API routes, response schemas, internal tools, UI behavior, and workflow contracts change together.
Bug fixes and test fixes often belong in the same branch and PR.
The downstream agents depend on repo-relative evidence.
qa-engineeremits repo-relativesource_filesjira-adminputs those references into ticketsbug-fixerreproduces directly from those paths
Agent workflows are easier when code and tests share one lifecycle.
A separate repo would add drift, duplicated review overhead, and slower iteration.
Future Split Line
Only split into a separate quality repo later for:
load testing
deployed-environment black-box validation
synthetic monitoring
chaos testing
cross-repo certification
Core regression, journey, API, worker, and internal tool tests stay in automatos-ai.
3. Coverage Architecture
We will build a test pyramid plus schedule matrix.
3.1 Layer A: Fast deterministic tests
Target: 180-220 tests
Purpose:
Catch refactor regressions quickly
Validate pure logic and service behavior
Keep PR feedback fast
Coverage examples:
memory classification and formatting
model config validation
severity and category mapping
runner output builders
route payload validators
tool registration and formatting
scratchpad contract builders
workflow stage/event formatting
scheduler utilities
3.2 Layer B: API integration tests
Target: 140-180 tests
Purpose:
Validate backend route contracts
Verify real state changes and resource lifecycles
Coverage examples:
agents CRUD + model config + execute
chat create/history/rename/delete
memory stats/search/recent
workflows execute/status/cancel
heartbeat endpoints
routing, channels, tools, skills, personas
documents, knowledge, webhooks, keys, analytics
3.3 Layer C: Browser / Playwright journeys
Target: 80-120 tests
Purpose:
Simulate real user behavior
Catch UI regressions, stale state, broken navigation, and role-based visibility issues
Coverage examples:
login / workspace landing
create and configure agent
edit model config
enable heartbeat
create and run recipe
open chat, follow up, rename, revisit history
document upload + processing + search
routing rule management
activity feed and command centre flows
settings pages
3.4 Layer D: Internal tools, worker, and orchestration tests
Target: 60-90 tests
Purpose:
Validate Automatos-specific machinery not visible through simple route smoke tests
Coverage examples:
platform_*executor behaviortool routing
scratchpad read/write contracts
scheduler / recipe scheduler / heartbeat service
memory daily logs and access logging
workflow execution streaming and SSE/AI SDK event shape
task runner queued vs local behavior
Jira/QA/Bug Fixer artifact compatibility
3.5 Layer E: Regression contracts
Target: 30-50 tests
Purpose:
Pin the expensive bugs
Prevent repeated break/fix cycles
Coverage examples:
memory scoping regressions
null-handling bugs
response-shape regressions
bad SQL column/path regressions
runner artifact contract regressions
Jira evidence handoff regressions
workflow execution handle regressions
3.6 Coverage Growth by Milestone
Tests grow with the platform. Targets are approximate and driven by need, not quotas.
API integration (Layer B)
160+
200+
250+
Regression contracts (Layer E)
15+
40+
60+
Internal tools/worker (Layer D)
10+
30+
70+
Stateful journeys (Layer B+)
10+
30+
50+
Playwright/browser (Layer C)
0
10+
80+
Fast deterministic (Layer A)
0
50+
150+
Total
~195
~360
~660
Pilot Readiness = what we need before 10-15 users hit the platform. Post-Pilot = bugs surfaced by real usage become regression tests; stable areas get deeper journeys. Mature Platform = full taxonomy, browser coverage, unit test layer, release validation lane.
4. Test Grouping by Risk
P0 Critical Areas
These get the deepest coverage first:
authentication and workspace scoping
chat and orchestration
memory and Mem0 integration
workflows and recipes
heartbeat and scheduler
internal tool execution
runner artifact contracts
document and knowledge retrieval
P1 High-Value Product Areas
agent configuration and model config
routing and channels
skills/plugins assignment
analytics and activity
settings pages
workspace file/exec
P2 Nice-to-Have Areas
UI polish states
rare edge cases
non-blocking admin screens
long-tail validation and degraded fallback flows
5. Schedule Matrix
Not all 500+ tests run all the time.
5.1 PR / Pre-Merge Lane
Runtime target: 5-10 minutes
Run:
fast deterministic tests
critical API smoke and key regressions
selected internal contract tests
optional tiny browser smoke pack
Purpose:
block obvious breakage
fast developer feedback
5.2 API Health Check & Regression Detector
Script: python3 tests/run_health_regression.py
Runtime target: 3-8 minutes
Run:
curated high-signal API subset
chat, agent, memory, workflow, heartbeat checks
critical user journeys
required orchestrator regression tests
Outputs:
health-regression-report.jsonhealth-regression-summary.jsonqa-report.json
Purpose:
detect fresh regressions
drive QA analysis and Jira filing
5.3 Nightly Self-Test Suite
Script: python3 tests/run_nightly.py
Runtime target: 20-45 minutes
Run:
full API suite
required orchestrator regressions
growing internal platform validation
selected slower integration tests
Outputs:
test-report.jsontest-summary.json
Purpose:
broad nightly confidence
historical trend and run-level status
5.4 Weekly Test Coverage Gap Finder
Script: python3 tests/run_gap_finder.py
Runtime target: 1-5 minutes for audit mode, longer if later expanded with execution checks
Run:
test inventory scan
domain coverage analysis
journey vs smoke classification
missing-domain and weak-coverage detection
Outputs:
coverage-gap-summary.json
Purpose:
identify test debt
create weekly planning work
5.5 Release Validation Lane
Runtime target: 45-120 minutes
Run:
nightly suite
full browser pack
role/permission matrix
worker/scheduler deep checks
document and channel flows
optional deployment-specific validation
Purpose:
release confidence
6. Runner Scripts and Artifact Contracts
6.1 Existing / Planned Runner Scripts
Nightly Self-Test Suite
tests/run_nightly.py
Broad full-suite validation
API Health Check & Regression Detector
tests/run_health_regression.py
Fast high-signal regression lane
Weekly Test Coverage Gap Finder
tests/run_gap_finder.py
Coverage inventory and planning
6.2 Artifact Contracts
test-summary.json
test-summary.jsonAudience:
nightly recipes
high-level status reporting
Shape:
total tests
pass/fail counts
duration
failures with
nodeid,assertion_message,source_files
health-regression-summary.json
health-regression-summary.jsonAudience:
run-level health reporting
Jira overview
Shape:
targeted suite summary
curated test target list
high-signal failure list
qa-report.json
qa-report.jsonAudience:
qa-engineerjira-adminbug-fixer
Shape:
coverage-gap-summary.json
coverage-gap-summary.jsonAudience:
weekly planning recipes
jira-adminfuture test planner agents
Shape:
covered domains
missing expected domains
journey files
smoke files
module inventory
action items
7. Specialist Agents and Skills
7.1 Core Specialist Agents
QA Engineer
Primary responsibilities:
execute runner scripts
classify failures
enrich failures with logs
generate
qa_report
Primary skill:
automatos-skills/qa-engineer
Preferred model profile:
balanced reasoning, high reliability, moderate cost
Why:
needs structured analysis, classification, and evidence correlation
Jira Admin
Primary responsibilities:
create/update issues from test artifacts
map severity to Jira priorities
export rich
issue_details
Primary skill:
automatos-skills/jira-admin
Preferred model profile:
concise, schema-following, low hallucination
Why:
operational accuracy matters more than creativity
Bug Fixer
Primary responsibilities:
reproduce
write failing test
apply minimal fix
verify and prepare PR
Primary skill:
automatos-skills/bug-fixer
Preferred model profile:
stronger code reasoning and repo navigation
Why:
this is the most expensive but most value-dense step
7.2 Future Specialist Agents (Post-Pilot)
These agents are not needed until the suite exceeds ~300 tests and real users are generating regression data. Do not build them prematurely.
Playwright Runner (Post-Pilot)
Responsibilities:
browser automation
UI regressions
journey validation
screenshot and trace capture
Contract Auditor (Post-Pilot)
Responsibilities:
verify response schemas
compare API output against expected structures
detect artifact drift
Flake Hunter (Mature Platform)
Responsibilities:
detect unstable tests across repeated runs
classify infra flake vs product regression
quarantine candidates
Coverage Planner (Mature Platform)
Responsibilities:
read
coverage-gap-summary.jsongroup missing areas into actionable themes
propose next test additions and schedules
7.3 Model Routing by Task Type
QA Engineer
Standard
classification + evidence synthesis
Jira Admin
Economy/Standard
structured summarization and ticketing
Bug Fixer
Premium
code reasoning, test-first fixes
Playwright Runner
Standard
interaction reliability
Coverage Planner
Standard
structured planning
Flake Hunter
Standard/Premium
pattern detection across failures
8. Skill and Recipe Coordination
8.1 Dynamic Skill Use
The skills are context-aware, not hardcoded to only the three internal test recipes.
Rules:
If dedicated runner outputs exist, prefer them.
If they do not exist, operate from raw pytest, route output, browser output, or platform logs.
Skills must remain useful outside the recipe chain.
8.2 Preferred Recipe Flow
Flow A: Regression Bug Flow
qa-engineerrunsrun_health_regression.pyqa-engineerenrichesqa-report.jsonwith platform logs when failures existjira-adminreadsqa_reportjira-admincreates Jira issues and exportsissue_detailsbug-fixerreadsissue_detailsbug-fixerreproduces, tests, fixes, verifies, PRs
Flow B: Nightly Oversight Flow
qa-engineerrunsrun_nightly.pyqa-engineerreviewstest-summary.jsonjira-admincreates aggregated nightly bug/tasks as needed
Flow C: Weekly Planning Flow
qa-engineerorcoverage plannerrunsrun_gap_finder.pyjira-admincreates Tasks/Stories fromcoverage-gap-summary.jsonplanning agent groups future coverage work by domain and priority
9. Browser and Playwright Strategy
9.1 Browser Test Groups
Organize by journey, not by page:
tests/playwright/smoke/tests/playwright/chat/tests/playwright/agents/tests/playwright/recipes/tests/playwright/knowledge/tests/playwright/admin/tests/playwright/error_states/
9.2 Browser Priorities
Phase 1:
login / landing
create agent
chat basic flow
create/run recipe
document upload
Phase 2:
routing/channels/settings
activity feed
memory dashboard
browser error handling and retries
Phase 3:
role matrix
long multi-step workflows
mobile and cross-browser coverage
10. File and Folder Strategy
10.1 Recommended In-Repo Structure
10.2 Naming Rules
test_<domain>.pyfor domain suitestest_<journey>.pyfor stateful multi-step flowstest_<bug_or_contract>.pyfor regressionsbrowser suites grouped by user intent, not component name
11. Implementation Plan
Milestone 1: Pilot Readiness (current priority)
Goal: Flush bugs and pin regressions so 10-15 pilot users hit a stable platform.
Exit criteria: All P0 areas have at least one stateful journey test. Known regressions from memory system, multi-tenancy, and cloud doc sync are pinned. Runners produce clean artifacts. ~160-200 total tests.
Stabilize the three runner scripts and lock artifact schemas
Create
tests/regressions/directory and pin known bugs:Memory scoping regressions (Mem0 search, user_id format mismatch)
Multi-tenancy isolation (workspace fallback bug)
Cloud document sync (silent processing failure)
AgentFactory tool source divergence
Deepen P0 API tests with stateful journeys:
Chat: create → send → history → rename → delete
Memory: store → search → stats consistency
Workflows: create → execute → status → cancel
Heartbeat: enable → trigger → verify results
Agents: create → configure → assign tools → execute → delete
Add test data cleanup strategy (teardown fixtures or dedicated test workspace)
Add internal tool contract tests for
platform_*executor pathsEnsure
run_health_regression.pytargets include new regression testsVerify QA → Jira → Bug Fixer artifact handoff works end-to-end
Milestone 2: Post-Pilot Hardening
Gate: 10-15 users actively using the platform and generating feedback.
Goal: Every user-reported bug becomes a regression test. Deepen coverage in areas users actually exercise.
Convert each pilot bug into a regression test in
tests/regressions/Expand API journeys to cover user-reported flows
Add internal tool and worker tests for scheduler, recipe execution, scratchpad
Begin Playwright smoke pack (login, create agent, basic chat) — only if UI bugs are a real problem
Grow to ~300-360 tests
Milestone 3: Mature Platform
Gate: Platform is stable, release cadence is predictable, API surface is not changing weekly.
Stand up full Playwright browser journey suite
Add fast deterministic unit tests for pure logic
Add flake detection and release validation lane
Reach 500+ tests across all layers
Evaluate specialist agents (Contract Auditor, Flake Hunter, Coverage Planner)
Review split-line for external quality repo only if lifecycle divergence demands it
12. Success Metrics
Pilot Readiness (Milestone 1)
P0 domains with at least one stateful journey test
100%
Known regressions pinned as tests
100%
qa-report.json directly usable by Jira Admin
100% of runs
Health regression runtime
< 8 min
Nightly full runtime
< 30 min
Total tests
~160-200
Post-Pilot (Milestone 2)
User-reported bugs encoded as regression tests
90%+
Bug Fixer able to reproduce from issue_details
90%+
All P1 domains covered
100%
Total tests
~300-360
Mature Platform (Milestone 3)
Total automated tests
500+
Browser journey coverage for critical flows
80%+
PR lane runtime
< 10 min
Nightly full runtime
< 45 min
Weekly gap finder output contains actionable tasks
100%
13. Open Questions and Decisions
Resolved
Should platform logs be fetched inside runner scripts or remain agent-enriched? Decision: Runner-enriched (deterministic). Runner scripts fetch logs when failures occur and embed them in
qa-report.json. Agents can augment further but the baseline artifact must be self-contained.Do we need a dedicated Coverage Planner agent now or later? Decision: Later (Milestone 3). The weekly gap finder output is consumed manually or by
jira-adminuntil the suite is large enough to warrant a dedicated planner.When do we add the 4 new specialist agents? Decision: Not until Milestone 2 at earliest. The 3 existing agents (QA Engineer, Jira Admin, Bug Fixer) handle Milestone 1. New agents are gated on real need, not PRD ambition.
Still Open
When should browser smoke move into the health regression lane? Only after Playwright tests are stable for 2+ weeks with <5% flake rate.
How do we quarantine flaky tests without hiding real instability? Needs a dedicated flake policy — likely a
@pytest.mark.quarantinemarker with a weekly review cadence.When do we add load/performance lanes? After Milestone 2, only if user feedback indicates performance issues.
Test data cleanup strategy: Dedicated test workspace? Teardown fixtures? Or both? Needs decision before Milestone 1 regression tests create persistent test data.
14. Bottom Line
This PRD does not propose "write 500 tests." It proposes a quality operating system that grows with Automatos:
tests grouped by purpose and risk
scripts grouped by schedule
artifacts grouped by downstream agent
specialist skills aligned with those artifacts
coverage that deepens as the platform stabilizes and users surface real issues
Right now, the priority is Milestone 1: flush bugs, pin regressions, deepen P0 coverage, and make the platform solid for 10-15 pilot users. Everything else follows from that foundation.
The result is not a test count. It is a testing platform that your own agents can run, interpret, escalate, and repair — growing naturally from pilot to production.
Last updated

