PRD-56: Infrastructure Scaling, Physical Workspaces & Ephemeral Agent Compute

Version: 2.0 Status: Planning Phase → Phase 2 Implementation Date: February 15, 2026 (v1.0) · February 25, 2026 (v2.0 — Physical Workspaces) Author: Automatos Core Team Prerequisites: PRD-37 (SaaS Foundation), PRD-54 (LLM Marketplace) Blocks: None (Foundation for enterprise scaling)

Changelog

Version
Date
Changes

1.0

2026-02-15

Initial 4-phase roadmap: TaskRunner → ARQ Workers → K8s → Enterprise

2.0

2026-02-25

Major extension: Physical Workspace architecture for pilot launch. Added: persistent workspace volumes, workspace filesystem model, tool execution routing (API vs Worker), command sandboxing & whitelist, storage quotas (5GB default), credential injection, Railway Volume integration, services/workspace-worker/ layout, security model for 15-user pilot. Extended Phase 2 from ephemeral /tmp/ to persistent /workspaces/{id}/ with repo caching.


Executive Summary

This PRD defines the infrastructure evolution path for Automatos AI — from the current Railway-hosted pilot to a fully scalable, workspace-isolated, enterprise-grade compute platform. The core architectural change: agent tasks execute in isolated compute environments with persistent physical workspaces rather than in-process with the API server.

The Problem

Today, all agent execution (workflows, subtasks, tool calls) runs inside the FastAPI process via asyncio.create_task(). This means:

  • No isolation — One workspace's heavy agent task starves all others

  • No persistence — Tasks are lost if the server restarts

  • No physical workspace — Agents can't clone repos, run tests, or persist build artifacts between tasks

  • No resource limits — Can't enforce plan-tier CPU/memory caps

  • No security boundary — Agent code execution (shell tools, file ops) shares the API server's filesystem and network

  • No horizontal scaling — Everything runs in one process on one container

  • No auditability — No infrastructure-level task lifecycle tracking

The Solution

A 4-phase migration introducing a TaskRunner abstraction that decouples task dispatch from task execution, with physical workspaces as the foundation for agent compute:

Phase
Infrastructure
Timeline
User Scale

Phase 1 (Now)

Railway + LocalTaskRunner

Week 1

Pilot (<50 users)

Phase 2 (Soft Launch)

Railway + ARQ Workers + Physical Workspaces + Persistent Volume

Weeks 2-6

Pilot (15 users)

Phase 3 (Scale)

Managed Kubernetes + Ephemeral Pods

Months 3-6

Growth (500+ workspaces)

Phase 4 (Enterprise)

Multi-cluster / Bring-Your-Own-Cloud

Month 6+

Enterprise tenants

Key Architecture Decisions

1. TaskRunner Interface: Abstract boundary between task dispatch and execution. Swap implementations without touching business logic:

2. Physical Workspaces (NEW in v2.0): Each workspace gets a persistent filesystem on the worker volume. Repos clone once and persist. Test results, build artifacts, and data survive between tasks. Agents work in a real development environment — not a throwaway /tmp/ dir.

3. Tool Execution Routing: Tools split between API (instant, stateless) and Worker (filesystem, subprocess). Agent code is unaware of which backend runs which tool.


Table of Contents


1. Current Architecture Analysis

Execution Flow (As-Is)

Current Limitations

Limitation
Impact
Risk Level

In-process execution (asyncio.create_task)

Tasks lost on restart/deploy

High

No resource isolation between workspaces

Noisy neighbor, DoS risk

High

Shared filesystem for tool execution

Cross-tenant data leakage

Critical (enterprise blocker)

No task queue persistence

Cannot retry failed tasks

Medium

Single-process concurrency limit

~50 concurrent agent tasks max

Medium

No per-workspace resource quotas

Can't enforce plan limits

Medium

No task priority system

Free-tier tasks block paid

Low (pilot only)

Key Files Affected

File
Role
Lines

modules/agents/execution/execution_manager.py

Agent task dispatch & tracking

1,309

modules/agents/factory/agent_factory.py

Agent runtime & LLM calls

2,499

modules/orchestrator/service.py

9-stage workflow pipeline

~800

api/workflows.py

Workflow execution endpoints

~1,100

api/workflow_recipes.py

Recipe execution endpoints

~800

consumers/chatbot/service.py

Chat-triggered agent execution

~1,300


2. Target Architecture

Control Plane / Data Plane Separation

TaskRunner Interface (Core Abstraction)


3. Phase 1: TaskRunner Abstraction (This Week)

Goal

Introduce the TaskRunner interface and LocalTaskRunner implementation without changing any runtime behavior. All existing agent execution paths route through the new abstraction.

User Stories

US-01: TaskRunner Abstract Interface

Description: Define the core TaskRunner ABC with data models for AgentTask, TaskHandle, TaskResult, TaskStatus, and TaskEvent.

Acceptance Criteria:

US-02: LocalTaskRunner Implementation

Description: Implement LocalTaskRunner that wraps current asyncio.create_task() behavior behind the TaskRunner interface. Zero behavior change.

Acceptance Criteria:

US-03: TaskRunner Factory & Configuration

Description: Factory function that returns the correct TaskRunner based on environment configuration.

Acceptance Criteria:

US-04: Integration Points

Description: Identify (but don't yet modify) all call sites that will route through TaskRunner in Phase 2.

Acceptance Criteria:

Phase 1 File Structure

Phase 1 Data Models


4. Phase 2: Queue-Based Worker + Physical Workspaces (Weeks 2-6)

Goal

Move agent task execution from the API process to a dedicated workspace worker container connected via a Redis task queue. Each workspace gets a persistent physical filesystem on a Railway Volume. Agents can clone repos, run tests, save artifacts, and work in a real development environment.

Architecture

Key difference from v1.0: The worker mounts a persistent Railway Volume instead of using throwaway /tmp/ dirs. Repos survive between tasks. Each workspace gets its own directory tree.

Technology Choice: ARQ (Async Redis Queue)

Why ARQ over Celery:

Factor
ARQ
Celery

Async native

Yes (asyncio)

No (sync workers, needs eventlet/gevent)

Dependencies

Just redis

Heavy (kombu, billiard, vine, amqp)

FastAPI compatibility

Native (same event loop)

Requires adapter

Memory footprint

~30MB per worker

~80MB per worker

Configuration

Minimal

Complex (broker, backend, serializer)

Our stack

Already using Redis

Would need Redis anyway

Task Lifecycle

Progress Streaming

User Stories

US-05: QueuedTaskRunner Implementation

Description: TaskRunner implementation that enqueues tasks to Redis and returns results via Redis pub/sub.

Acceptance Criteria:

US-06: Workspace Worker Service

Description: Standalone worker process that consumes tasks from Redis queue and executes agent work within persistent physical workspaces.

Acceptance Criteria:

US-07: Workspace Worker Dockerfile

Description: Docker image for the workspace worker with full DevOps toolchain.

Acceptance Criteria:

US-08: Task Persistence & Recovery

Description: Tasks survive API server restarts and worker crashes.

Acceptance Criteria:

US-09: Per-Workspace Queue & Storage Limits

Description: Enforce concurrent task limits and storage quotas based on workspace plan tier.

Acceptance Criteria:

US-10: Docker Compose Worker Profile

Description: Add workspace worker to docker-compose for local development.

Acceptance Criteria:

Phase 2 Infrastructure

Phase 2 on Railway

Railway supports multiple services per project. The workspace worker deploys as a separate service with a persistent volume:

Isolation guarantee: The API, workspace-worker, agent-opt-worker, and frontend are separate Railway containers. They share Postgres and Redis via internal networking, but their filesystems are completely isolated. A user in the workspace-worker cannot access the API container's filesystem, and vice versa.

Cost impact: ~$10-20/mo additional for 1 workspace-worker replica + ~$2/mo for persistent volume.


5. Physical Workspace Architecture (NEW in v2.0)

Goal

Give each workspace a persistent, isolated filesystem on the worker volume. Agents work in a real development environment — they can clone repos, run tests, build projects, and persist results between tasks. This is the foundation for Automatos AI's DevOps capabilities.

Workspace Filesystem Layout

Persistence Model

Directory
Lifecycle
Purpose
Size Impact

repos/

Persistent — survives across all tasks

Cloned repos; git pull instead of re-clone

High (biggest consumer)

tasks/

Ephemeral — cleaned after each task completes

Scratch space for active execution

Low (auto-cleaned)

artifacts/

Persistent — kept until workspace cleanup

Test reports, coverage, build outputs

Medium (user-managed)

.ssh/

Persistent — injected from credential store

Deploy keys for private repo access

Negligible

.gitconfig

Persistent — set once

Git author identity for commits

Negligible

Repo Caching (Key Performance Win)

Storage Quotas

Each workspace has a configurable storage limit. Enforced before task execution starts:

Enforcement flow:

For the 15-user pilot at 5GB each: 75GB max. Railway persistent volumes support up to 100GB on Pro plan. Plenty of headroom.

Workspace Metadata

Each workspace stores metadata for tracking and quota enforcement:


6. Tool Execution Routing (NEW in v2.0)

Goal

Define which tools execute in the API process (instant, stateless) vs. the workspace worker (filesystem, subprocess). Agents are unaware of the routing — the TaskRunner handles dispatch transparently.

Routing Matrix

Detailed Tool Classification

Tool
Location
Security Level
Rationale

search_codebase

API

SAFE

Reads from CodeGraph index in Postgres, no filesystem

semantic_search

API

SAFE

Reads from pgvector, no filesystem

search_documents

API

SAFE

Reads from document index in Postgres

search_images

API

SAFE

Reads from image index in Postgres

search_tables

API

SAFE

Reads structured data from Postgres

database_query

API

CAUTIOUS

NL2SQL against Postgres (read-only)

composio_execute

API

CAUTIOUS

Calls external APIs (Jira, Slack, GitHub)

http_request

API

CAUTIOUS

Whitelisted HTTP calls to internal/platform URLs

read_file

Worker

CAUTIOUS

Reads files from workspace filesystem

write_file

Worker

CAUTIOUS

Writes files to workspace filesystem

create_directory

Worker

CAUTIOUS

Creates dirs in workspace filesystem

list_directory

Worker

SAFE

Lists workspace directory contents

execute_command

Worker

DANGEROUS

Runs shell commands (git, pytest, npm, etc.)

ssh_execute

Disabled for pilot

DANGEROUS

See notes below

SSH Execute — Pilot Decision

ssh_execute lets agents SSH into arbitrary hosts. For the 15-user pilot:

Decision: DISABLE for pilot. Agents use execute_command locally in their workspace instead. Re-enable in Phase 3 with per-workspace allowed-host configuration.

Future (post-pilot): Each workspace registers allowed SSH targets via credentials. Agents can only SSH to hosts that workspace has credentials for.

Tool Routing Implementation

The WorkspaceToolExecutor wraps all worker-side tools with path validation and sandboxing:

Command Whitelist


7. Workspace Worker Service (NEW in v2.0)

Service Location

Follows existing pattern alongside services/agent-opt-worker/:

Key difference from agent-opt-worker: The agent-opt-worker is a FastAPI HTTP service (request/response). The workspace-worker is an ARQ queue consumer (pull-based, long-running tasks).

Dockerfile

Worker Main (ARQ Consumer)

Workspace Manager


8. Pilot Security Model (NEW in v2.0)

Threat Model (15 Trusted Beta Users)

Not bulletproof, but reasonable for trusted pilot. Hardens progressively in Phase 3.

Path Traversal Prevention (Critical)

What Users CAN Do (Within Their Workspace)

  • Clone any public or credentialed-private repo

  • Run test suites (pytest, vitest, jest, go test, cargo test)

  • Install dependencies (pip install, npm install — into workspace)

  • Read and write any file in their workspace

  • Run linters, formatters, type checkers

  • Build projects (make, npm run build, cargo build)

  • Save test reports, coverage data, build artifacts

  • Use curl/wget for API testing

What Users CANNOT Do

  • Access another workspace's files (path validation)

  • Run privileged commands (sudo, docker, kubectl)

  • Access system files outside /workspaces/{their_id}

  • SSH to external servers (disabled for pilot)

  • Send HTTP requests to non-whitelisted domains

  • Exceed their storage quota

  • Run tasks longer than the timeout (killed)

  • Access Postgres/Redis connection strings (not in subprocess env)


9. Phase 3: Kubernetes Ephemeral Pods (Months 3-6)

Goal

Replace static worker containers with dynamically scheduled Kubernetes Jobs. Each agent task runs in its own pod with workspace-scoped resource limits, network policies, and ephemeral storage.

Architecture

K8s Primitives Mapping

Automatos Concept
K8s Primitive
Purpose

Agent task

Job

Run-to-completion workload

Task workspace

Pod with emptyDir volume

Isolated filesystem

Workspace isolation

Namespace per workspace

Resource & network boundary

Plan limits

ResourceQuota

CPU/memory caps per workspace

Per-task limits

LimitRange

Default CPU/memory per pod

Task timeout

activeDeadlineSeconds

Kill runaway tasks

Auto-cleanup

ttlSecondsAfterFinished

Remove completed job pods

Security boundary

NetworkPolicy

Restrict pod network access

Repo cloning

emptyDir with sizeLimit

Temp disk for git clone

Inter-agent comms

Redis pub/sub (existing)

Cross-pod messaging

Task scaling

KEDA ScaledJob

Scale from zero on queue depth

User Stories

US-11: KubernetesTaskRunner Implementation

Description: TaskRunner that creates K8s Jobs for agent tasks.

Acceptance Criteria:

US-12: Task Controller

Description: Long-running controller that watches the Redis queue and creates K8s Jobs.

Acceptance Criteria:

US-13: Workspace Namespace Provisioning

Description: Automatic K8s namespace creation and configuration per workspace.

Acceptance Criteria:

US-14: Agent Task Pod Spec

Description: Pod template for agent task execution.

Acceptance Criteria:

US-15: KEDA Auto-Scaling

Description: Scale agent pods from zero based on queue depth.

Acceptance Criteria:

US-16: Agent-to-Agent Communication

Description: Enable pods to communicate with other agent tasks in the same workspace.

Acceptance Criteria:

K8s Job Manifest Template


10. Phase 4: Enterprise Multi-Tenant (Month 6+)

Goal

Support enterprise customers with dedicated compute, compliance requirements, and optional bring-your-own-cloud deployments.

Capabilities

Dedicated Clusters

  • Enterprise tenants get their own K8s cluster (or dedicated node pool)

  • Full network isolation from other tenants

  • Custom retention, compliance, and audit policies

  • SOC 2 / ISO 27001 scope per cluster

Bring-Your-Own-Cloud (BYOC)

  • Deploy agent worker pods into customer's cloud account

  • Customer provides K8s cluster credentials

  • Automatos control plane remains hosted

  • Agent tasks execute within customer's network perimeter

  • Data never leaves customer's environment

Air-Gapped Deployments

  • Full Automatos stack as Helm chart

  • Runs entirely within customer infrastructure

  • Offline LLM support (local models via Ollama/vLLM)

  • Manual update distribution

Enterprise Features Matrix

Feature
Pro
Enterprise
Enterprise+ (BYOC)

Workspace namespaces

Shared cluster

Dedicated node pool

Customer cluster

Data residency

Multi-region

Specific region

Customer-controlled

Network isolation

NetworkPolicy

VPC peering

Customer VPC

Compliance

SOC 2 shared

SOC 2 dedicated

Customer-audited

SLA

99.5%

99.9%

Customer-managed

Agent image customization

No

Base + extensions

Full control

Max concurrent tasks

10

50

Unlimited


11. Data Models & Schema

New Database Table: task_executions

Redis Key Structure (Phase 2+)


12. API Changes

New Endpoints (Phase 2+)

Existing Endpoint Changes

No breaking changes. The POST /api/workflows/{id}/execute endpoint continues to work identically — it calls TaskRunner.submit_task() instead of asyncio.create_task() internally. The execution ID and SSE streaming continue to work.


13. Security Model (Full)

Phase 2 Security (Workers)

Concern
Mitigation

Cross-workspace data

Each task gets isolated temp dir, cleaned after completion

Credential leakage

LLM keys loaded per-task from workspace credentials (Fernet-encrypted)

Resource exhaustion

Docker resource limits per worker container

Network access

Workers connect to Redis + Postgres + LLM APIs only

Phase 3 Security (K8s)

Concern
Mitigation

Cross-workspace data

Namespace isolation + NetworkPolicy

Pod escape

runAsNonRoot, readOnlyRootFilesystem, no privileged containers

K8s API access

No service account token mounted, RBAC minimal

Network lateral movement

NetworkPolicy: deny all ingress, egress allow-list only

Secret management

K8s Secrets + External Secrets Operator (AWS SM / Vault)

Image supply chain

Signed images, vulnerability scanning (Trivy)

DDoS via task submission

Per-workspace rate limits + queue depth limits

Compliance Alignment

Standard
Phase 2 Coverage
Phase 3 Coverage

SOC 2 Type II

Partial (audit logs, encryption)

Full (isolation, access controls)

GDPR

Data residency via region selection

Per-namespace data isolation

ISO 27001

Encryption at rest/transit

Full security controls

HIPAA

Not applicable yet

Dedicated clusters (Phase 4)


14. Cost Analysis

Phase 1: No Change

  • Railway: ~$20-40/mo (current pilot)

  • No additional infra cost

Phase 2: Railway + Physical Workspaces (Pilot — 15 Users)

  • Backend service (API): ~$10/mo

  • Workspace worker (1 replica): ~$10-15/mo

  • Agent-opt worker (existing): ~$5/mo

  • Persistent volume (100GB): ~$2/mo

  • Postgres: ~$10/mo

  • Redis: ~$5/mo

  • Total: ~$45-55/mo

  • Storage per workspace: 5GB default (15 users × 5GB = 75GB max capacity)

  • Per-workspace cost: ~$3/mo (amortized across pilot users)

Phase 3: Managed Kubernetes

  • GKE Autopilot (recommended):

    • Control plane: $72/mo (free tier available)

    • Pods: $0.0445/vCPU-hour + $0.0049/GB-hour

    • Estimated for 100 workspaces (avg 2 tasks/day, 10 min each):

      • ~$150-250/mo compute

      • ~$50/mo networking

      • Total: ~$300-400/mo

  • AWS EKS + Karpenter:

    • Control plane: $72/mo

    • Spot instances for workers: ~$200/mo

    • Total: ~$350-500/mo

Cost Per Workspace (Phase 3)

Plan
Est. Monthly Compute
Charge to Customer

Starter (2 tasks/day)

~$1.50

$29/mo

Pro (20 tasks/day)

~$15

$99/mo

Enterprise (100 tasks/day)

~$75

$499/mo

Healthy margins at scale. The ephemeral model means idle workspaces cost $0.


15. Implementation Roadmap

Phase 1: TaskRunner Abstraction (Week 1)

Effort: 2-3 days

Day
Task
Deliverable

1

Models + ABC + LocalTaskRunner

core/task_runner/ package

2

Factory + Configuration

get_task_runner(), env config

2

Integration point documentation

Call site inventory

3

Tests

Unit tests for LocalTaskRunner

Phase 2: Physical Workspaces + Queue Workers (Weeks 2-6)

Effort: 3-4 weeks

Week
Task
Deliverable

2

ARQ integration + QueuedTaskRunner

core/task_runner/queued.py

2

WorkspaceManager (dir provisioning, quotas)

services/workspace-worker/workspace_manager.py

2-3

WorkspaceToolExecutor (sandboxed commands)

services/workspace-worker/executor.py

3

Worker Dockerfile + DevOps toolchain

services/workspace-worker/Dockerfile

3

ARQ consumer entry point

services/workspace-worker/main.py

3-4

Wire TaskRunner into execution pipeline

Replace asyncio.create_task() calls

4

Tool routing (API vs Worker split)

Tool registry update + dispatcher

4

Storage quota enforcement + command whitelist

Security layer

5

Credential injection (SSH keys, git config)

Per-workspace credential flow

5

Docker Compose + Railway deployment

Multi-service with persistent volume

5-6

Path traversal hardening + security testing

Penetration test workspace isolation

6

End-to-end testing (clone → test → fix → push)

DevOps workflow validation

Phase 3: Kubernetes (Months 3-6)

Effort: 4-6 weeks

Month
Task
Deliverable

3

KubernetesTaskRunner

core/task_runner/kubernetes.py

3

Task Controller

worker/controller.py

3-4

Namespace provisioning

Auto-namespace per workspace

3-4

Migrate workspace volumes to PersistentVolumeClaims

Per-workspace PVCs

4

NetworkPolicy + RBAC

Security boundaries

4-5

KEDA autoscaling

Scale from zero

5

Helm chart

Deployment package

5-6

Load testing + hardening

Production readiness

Phase 4: Enterprise (Month 6+)

Effort: Ongoing

Quarter
Task
Deliverable

Q3 2026

Dedicated node pools

Enterprise isolation

Q3 2026

External Secrets Operator

Vault/AWS SM integration

Q3 2026

Per-workspace PersistentVolumeClaims

True storage isolation

Q4 2026

BYOC agent deployment

Customer-cluster support

Q4 2026

Helm chart for air-gap

Self-hosted package


16. Risk Assessment

Risk
Likelihood
Impact
Phase
Mitigation

Phase 2 introduces latency (queue overhead)

Medium

Low

2

Queue adds ~50-100ms; acceptable for agent tasks (seconds-long)

Worker container crashes during task

Medium

Medium

2

Task heartbeat + auto-requeue; result idempotency

Path traversal escape from workspace

Low

Critical

2

resolve_safe_path() with symlink resolution + null byte check. Security testing before pilot launch

Storage exhaustion (large repo clones)

Medium

Medium

2

Quota enforcement before each task; cleanup tooling for old repos; monitoring alerts at 80%

Cross-workspace data leak (shared worker)

Low

High

2

All paths validated per-request; subprocess cwd pinned; credentials cleaned per-task

Command injection via agent tool calls

Low

High

2

Whitelist enforcement; blocked patterns list; no shell expansion on user input

Railway volume data loss

Low

High

2

Railway volumes persist across deploys. Backup strategy: periodic tar to S3 for critical workspaces

Single worker bottleneck (15 users)

Medium

Low

2

1 worker handles ~3 concurrent tasks; pilot users unlikely to saturate. Scale to 2 replicas if needed

K8s complexity slows feature development

Medium

High

3

Phase 3 only when revenue justifies; managed K8s (Autopilot) reduces ops

Pod startup latency (cold start)

Medium

Medium

3

Pre-pull images on nodes; KEDA warm pool

Redis as task queue: message loss

Low

High

2-3

Redis AOF persistence; critical tasks also written to Postgres

Namespace proliferation (1000+ workspaces)

Low

Medium

3

Lazy provisioning; cleanup inactive namespaces after 30 days

Cost overrun on K8s

Medium

Medium

3

KEDA scale-to-zero; spot instances; per-workspace billing


Appendix A: Technology Decisions

Factor
GKE Autopilot
EKS + Karpenter
AKS

Node management

Fully managed

Self-managed (Karpenter helps)

Mostly managed

Pay-per-pod

Yes

No (pay per node)

No

Scale to zero

Yes

Yes (with Karpenter)

Partial

Setup complexity

Low

Medium

Medium

Cost (small scale)

Lowest

Higher (min node)

Medium

GPU support

Yes

Yes

Yes

Banking compliance

GCP FedRAMP

AWS GovCloud

Azure Gov

Given your banking IT background and that Azure/AWS are likely familiar, either GKE Autopilot (lowest ops) or EKS + Karpenter (most flexible) are strong choices.

Why ARQ over Celery (Phase 2)

  • Native asyncio (matches FastAPI)

  • Minimal dependencies (just arq + redis)

  • Result backend built-in

  • Simple configuration

  • Lower memory footprint

  • We already depend on Redis

Why Not Serverless Functions (Lambda/Cloud Functions)

  • 15-minute timeout limit (agent tasks can run longer)

  • Cold start latency (3-10s)

  • No persistent filesystem (can't clone repos)

  • Limited to 10GB memory

  • No GPU access

  • Vendor lock-in


Appendix B: Monitoring & Observability

Metrics to Track

Dashboard (Grafana)

  • Task throughput (tasks/min by workspace and type)

  • Queue depth over time (P2)

  • Pod scheduling latency (P3)

  • Per-workspace resource consumption

  • Error rates and failure reasons

  • Cost attribution per workspace


This PRD is a living document. Update as phases progress.

Last updated