PRD-56: Infrastructure Scaling, Physical Workspaces & Ephemeral Agent Compute
Changelog
Version
Date
Changes
Executive Summary
The Problem
The Solution
Phase
Infrastructure
Timeline
User Scale
Key Architecture Decisions
Table of Contents
1. Current Architecture Analysis
Execution Flow (As-Is)
Current Limitations
Limitation
Impact
Risk Level
Key Files Affected
File
Role
Lines
2. Target Architecture
Control Plane / Data Plane Separation
TaskRunner Interface (Core Abstraction)
3. Phase 1: TaskRunner Abstraction (This Week)
Goal
User Stories
US-01: TaskRunner Abstract Interface
US-02: LocalTaskRunner Implementation
US-03: TaskRunner Factory & Configuration
US-04: Integration Points
Phase 1 File Structure
Phase 1 Data Models
4. Phase 2: Queue-Based Worker + Physical Workspaces (Weeks 2-6)
Goal
Architecture
Technology Choice: ARQ (Async Redis Queue)
Factor
ARQ
Celery
Task Lifecycle
Progress Streaming
User Stories
US-05: QueuedTaskRunner Implementation
US-06: Workspace Worker Service
US-07: Workspace Worker Dockerfile
US-08: Task Persistence & Recovery
US-09: Per-Workspace Queue & Storage Limits
US-10: Docker Compose Worker Profile
Phase 2 Infrastructure
Phase 2 on Railway
5. Physical Workspace Architecture (NEW in v2.0)
Goal
Workspace Filesystem Layout
Persistence Model
Directory
Lifecycle
Purpose
Size Impact
Repo Caching (Key Performance Win)
Storage Quotas
Workspace Metadata
6. Tool Execution Routing (NEW in v2.0)
Goal
Routing Matrix
Detailed Tool Classification
Tool
Location
Security Level
Rationale
SSH Execute — Pilot Decision
Tool Routing Implementation
Command Whitelist
7. Workspace Worker Service (NEW in v2.0)
Service Location
Dockerfile
Worker Main (ARQ Consumer)
Workspace Manager
8. Pilot Security Model (NEW in v2.0)
Threat Model (15 Trusted Beta Users)
Path Traversal Prevention (Critical)
What Users CAN Do (Within Their Workspace)
What Users CANNOT Do
9. Phase 3: Kubernetes Ephemeral Pods (Months 3-6)
Goal
Architecture
K8s Primitives Mapping
Automatos Concept
K8s Primitive
Purpose
User Stories
US-11: KubernetesTaskRunner Implementation
US-12: Task Controller
US-13: Workspace Namespace Provisioning
US-14: Agent Task Pod Spec
US-15: KEDA Auto-Scaling
US-16: Agent-to-Agent Communication
K8s Job Manifest Template
10. Phase 4: Enterprise Multi-Tenant (Month 6+)
Goal
Capabilities
Dedicated Clusters
Bring-Your-Own-Cloud (BYOC)
Air-Gapped Deployments
Enterprise Features Matrix
Feature
Pro
Enterprise
Enterprise+ (BYOC)
11. Data Models & Schema
New Database Table: task_executions
task_executionsRedis Key Structure (Phase 2+)
12. API Changes
New Endpoints (Phase 2+)
Existing Endpoint Changes
13. Security Model (Full)
Phase 2 Security (Workers)
Concern
Mitigation
Phase 3 Security (K8s)
Concern
Mitigation
Compliance Alignment
Standard
Phase 2 Coverage
Phase 3 Coverage
14. Cost Analysis
Phase 1: No Change
Phase 2: Railway + Physical Workspaces (Pilot — 15 Users)
Phase 3: Managed Kubernetes
Cost Per Workspace (Phase 3)
Plan
Est. Monthly Compute
Charge to Customer
15. Implementation Roadmap
Phase 1: TaskRunner Abstraction (Week 1)
Day
Task
Deliverable
Phase 2: Physical Workspaces + Queue Workers (Weeks 2-6)
Week
Task
Deliverable
Phase 3: Kubernetes (Months 3-6)
Month
Task
Deliverable
Phase 4: Enterprise (Month 6+)
Quarter
Task
Deliverable
16. Risk Assessment
Risk
Likelihood
Impact
Phase
Mitigation
Appendix A: Technology Decisions
Why GKE Autopilot (Recommended for Phase 3)
Factor
GKE Autopilot
EKS + Karpenter
AKS
Why ARQ over Celery (Phase 2)
Why Not Serverless Functions (Lambda/Cloud Functions)
Appendix B: Monitoring & Observability
Metrics to Track
Dashboard (Grafana)
Last updated

