PRD-73: Observability & Monitoring Stack
Version: 1.2 Status: Draft Priority: P0 Author: Gar Kavanagh + Auto CTO Created: 2026-03-08 Updated: 2026-03-08 Dependencies: PRD-06 (Monitoring & Analytics — COMPLETE), PRD-55 (Agent Heartbeats — COMPLETE), PRD-72 (Activity Command Centre — DRAFT) Repository: automatos-monitoring Branch: feat/observability-stack Deployment: Railway (all services deployed as Railway services within a single project)
Executive Summary
Automatos has internal application-level monitoring (PRD-06) and agent heartbeats (PRD-55), but zero infrastructure observability. No log aggregation, no metrics history, no alerting pipeline, no dashboards showing system health over time. When the backend OOMs at 3am, nobody knows until users complain. When an agent heartbeat silently fails, it's invisible. When Redis memory hits 256MB and starts evicting, there's no alert.
This PRD defines the external observability stack for the Automatos AI Platform — deployed as Railway services within the same Railway project, using Railway's private networking for inter-service communication and Railway's log drain for log forwarding.
What We're Building
Metrics Pipeline — Prometheus scraping all Automatos services (backend, postgres, redis, workspace-worker) via Railway private networking, with 15-day retention and Railway volume persistence
Log Aggregation — Loki receiving logs via Railway's HTTP log drain + direct push from the backend, queryable from Grafana with 7-day retention
Dashboards — Grafana with purpose-built dashboards: Platform Overview, Agent Performance, Database Health, Redis & Queues, Workspace Worker
Alerting Pipeline — AlertManager routing alerts by severity to webhooks that Automatos agents can consume for investigation and reporting
Agent-Readable Alert API — Structured webhook payloads that the orchestrator can ingest, allowing agents to investigate, classify, and recommend actions (automated remediation is out of scope for v1)
What We're NOT Building
Application-level metrics instrumentation (that's PRD-06, already done)
A replacement for the Activity Command Centre (PRD-72 handles operational visibility)
Custom exporters for Automatos-specific metrics (Phase 2)
Distributed tracing (OpenTelemetry — future PRD)
External uptime monitoring / synthetic checks
Promtail / Docker socket log collection (Railway doesn't expose Docker socket — we use Railway log drains instead)
1. Architecture Overview
Railway Networking Model
All services live in the same Railway project. Railway provides automatic private networking:
Private DNS: Every service is reachable at
<service-name>.railway.internal:<port>No Docker networks needed — Railway handles inter-service routing automatically
No Docker socket access — Log collection uses Railway's HTTP log drain → Loki, not Promtail Docker SD
Volumes: Railway persistent volumes for Prometheus data, Grafana data, and Loki storage
Public access: Grafana (dashboard access) and log-relay (Railway log drain target) get public Railway domains. Loki stays private — log-relay forwards to it internally.
Service DNS Map (private network):
Backend
backend.railway.internal
8000
PostgreSQL
postgres.railway.internal
5432
Redis
redis.railway.internal
6379
Workspace Worker
workspace-worker.railway.internal
8081
Prometheus
prometheus.railway.internal
9090
Grafana
grafana.railway.internal
3000
Loki
loki.railway.internal
3100
AlertManager
alertmanager.railway.internal
9093
Postgres Exporter
postgres-exporter.railway.internal
9187
Redis Exporter
redis-exporter.railway.internal
9121
2. Services & Configuration
2.1 Prometheus (Metrics Collection)
Image: prom/prometheus:v2.51.0 Port: 9090 Retention: 15 days Scrape Interval: 15s
Scrape Targets (via Railway private networking):
automatos-backend
backend.railway.internal:8000/health
HTTP health, response time
Custom JSON → needs adapter or /metrics endpoint
postgres-exporter
postgres-exporter.railway.internal:9187
Connections, query duration, DB size, replication lag
Sidecar service
redis-exporter
redis-exporter.railway.internal:9121
Memory, connected clients, ops/sec, keyspace, evictions
Sidecar service
workspace-worker
workspace-worker.railway.internal:8081/health
Worker health, task queue depth
Custom JSON
prometheus
localhost:9090/metrics
Self-monitoring
Built-in
loki
loki.railway.internal:3100/metrics
Log ingestion rate, storage
Built-in
alertmanager
alertmanager.railway.internal:9093/metrics
Alert pipeline health
Built-in
No node-exporter: Railway manages the underlying infrastructure. Host-level CPU/memory/disk metrics are available via Railway's built-in metrics dashboard. We monitor application-level resource usage through the backend's
/healthendpoint and process metrics instead.
Strongly recommended for Phase 1 completeness: Minimal Prometheus-format /metrics endpoints for backend and workspace-worker. Without them, application alerting is limited to health probing and log-derived signals — meaning HighErrorRate, SlowResponses, QueueBacklog, HeartbeatFailureRate, and other application alerts in Section 3.2 cannot fire until /metrics is implemented.
Minimum /metrics surface (Phase 1):
http_requests_total{method, path, status}— request counter with status codeshttp_request_duration_seconds{method, path}— histogram of response timesautomatos_heartbeat_executions_total/automatos_heartbeat_failures_total— heartbeat countersautomatos_workspace_queue_depth{priority}— current queue sizes
Phase 2 expansion: Workflow execution times, agent token usage, per-model cost tracking, embedding throughput.
2.2 Grafana (Visualisation)
Image: grafana/grafana:10.4.0 Port: 3000 (Railway assigns public domain, e.g. grafana-automatos.up.railway.app) Auth: Admin password from Railway environment variable
Datasources (auto-provisioned):
Prometheus
prometheus
http://prometheus.railway.internal:9090
Yes
Loki
loki
http://loki.railway.internal:3100
No
Dashboards (auto-provisioned):
Platform Overview
Single-pane-of-glass for the whole platform
Service health matrix, CPU/Memory/Disk gauges, alert count, uptime
Agent Performance
Agent heartbeat and execution monitoring
Heartbeat success rate, execution duration, token spend, agent status grid
Database Health
PostgreSQL deep-dive
Active connections, query duration p95, DB size growth, dead tuples, cache hit ratio
Redis & Queues
Redis + workspace task queue health
Memory usage, ops/sec, evicted keys, queue depth per priority, task throughput
Workspace Worker
Worker process monitoring
Active tasks, task duration histogram, error rate, queue backlog
Logs Explorer
Pre-configured Loki log views
Error log stream, service-filtered views, log volume over time
2.3 Loki (Log Aggregation)
Image: grafana/loki:2.9.4 Port: 3100 Retention: 7 days (168h) Storage: Local filesystem (/loki/chunks)
Configuration highlights:
Schema v13 with TSDB index
Ingestion rate limit: 16MB/s (burst 24MB)
Max streams per tenant: 10,000
Compaction: Every 10 minutes
Single-tenant mode (no auth required internally)
2.4 Log Collection (Railway Log Drain → Loki)
Railway doesn't expose Docker sockets, so we replace Promtail with two complementary approaches:
A. Railway HTTP Log Drain (all container stdout/stderr)
Railway supports HTTP log drains that forward all service logs to a URL. We configure a log drain pointing at Loki's push API:
Note: The log drain points at the log-relay service (public), NOT Loki directly. Loki remains private. The log-relay transforms Railway's JSON format into Loki's push format and forwards internally. See Section 2.4.1.
B. Direct Push from Backend (structured application logs)
The backend pushes structured logs directly to Loki via the python-logging-loki library or a simple HTTP handler:
This gives us structured JSON logs with labels: level, module, request_id, agent_id, workspace_id.
2.4.1 Log Relay Service (Railway Log Drain → Loki)
Image: Custom lightweight container or grafana/promtail:2.9.4 in HTTP receiver mode Purpose: Receives Railway log drain webhooks, transforms to Loki push format, forwards to Loki
Railway log drain sends JSON in this format:
The relay:
Listens on HTTP port for Railway log drain POSTs
Extracts
service,severityas Loki labelsBatches and pushes to
http://loki.railway.internal:3100/loki/api/v1/pushDrops health check noise (
GET /healthlines)
Alternative: If Railway's log drain format evolves to support Loki natively, this relay can be removed.
2.5 AlertManager (Alert Routing)
Image: prom/alertmanager:v0.27.0 Port: 9093
Alert Routing Tree:
Webhook Target: http://backend.railway.internal:8000/api/alerts/ingest
Inhibition Rules:
Critical service-down alerts suppress warning-level performance alerts for the same service
PostgreSQLDown suppresses all database performance alerts
Webhook Payload Format (AlertManager standard):
2.6 Exporters (Separate Railway Services)
On Railway, exporters run as independent services (not Docker sidecars), connecting to targets via private networking:
postgres-exporter
prometheuscommunity/postgres-exporter:0.15.0
postgres.railway.internal:5432
9187
postgres-exporter
redis-exporter
oliver006/redis_exporter:v1.58.0
redis.railway.internal:6379
9121
redis-exporter
No node-exporter: Railway doesn't expose host-level metrics to containers. Use Railway's built-in metrics dashboard for infrastructure-level visibility. Application process metrics (memory RSS, CPU time, open FDs) can be exposed via the backend's
/metricsendpoint in Phase 2.
3. Alert Rules
3.1 Infrastructure Alerts
3.2 Application Alerts
3.3 Alert Capability Matrix
Which alerts can fire on day one vs. after /metrics instrumentation:
Alert
Data Source
Phase 1 (health + exporters)
Phase 1 (with /metrics)
Phase 2
ServiceDown
up metric
Yes
Yes
BackendDown
up{job=automatos-backend}
Yes
Yes
WorkerDown
up{job=automatos-workspace-worker}
Yes
Yes
PostgreSQLDown
pg_up
Yes
Yes
HighConnections
pg_stat_activity_count
Yes
Yes
SlowQueries
pg_stat_activity_max_tx_duration
Yes
Yes
DeadTuples
pg_stat_user_tables_n_dead_tup
Yes
Yes
CacheHitLow
pg_stat_database_blks_*
Yes
Yes
DBSizeGrowth
pg_database_size_bytes
Yes
Yes
RedisDown
redis_up
Yes
Yes
RedisHighMemory
redis_memory_used_bytes
Yes
Yes
RedisEvictedKeys
redis_evicted_keys_total
Yes
Yes
RedisHighLatency
redis_commands_duration_seconds_total
Yes
Yes
RedisHighClients
redis_connected_clients
Yes
Yes
HighErrorRate
http_requests_total
No
Yes
SlowResponses
http_request_duration_seconds
No
Yes
QueueBacklog
automatos_workspace_queue_depth
No
Yes
QueueCritical
automatos_workspace_queue_depth
No
Yes
HeartbeatFailureRate
automatos_heartbeat_*
No
Yes
ErrorSpike
Loki log query
No
No
Yes
OOMKill
Loki log query
No
No
Yes
DatabaseError
Loki log query
No
No
Yes
Takeaway: Without
/metrics, only infrastructure alerts (exporters + health probes) fire on day one. With minimal/metricsinstrumentation (strongly recommended for Phase 1), application alerts also become operational.
3.4 Log-Based Alerts (Loki Ruler — Phase 2)
4. Agent Alert Integration (SENTINEL — Read-Only in v1)
Alerts feed back into Automatos agents for investigation and reporting only. Automated remediation is explicitly out of scope for v1 — agents observe, classify, and recommend; humans approve actions.
4.1 Alert Ingestion Endpoint
New endpoint on the backend:
Authentication: Bearer token validated against ALERT_INGEST_TOKEN env var. Requests without a valid token return 401.
Deduplication: Alerts are deduplicated by (alertname, service, instance, fingerprint). If a firing alert with the same key already exists and is still active, the last_seen_at timestamp is updated instead of creating a new row. AlertManager's fingerprint field (included in webhook payloads) is used as the primary dedupe key.
Resolved alert handling: When AlertManager sends status: "resolved", the matching active alert row is updated with resolved_at timestamp and status = "resolved". No new row is created.
This endpoint:
Validates auth token
Deduplicates against existing active alerts
Stores new alerts in the
infrastructure_alertstableUpdates resolved alerts when resolution webhook arrives
For
criticalseverity: triggers an investigation agent heartbeat (read-only)For
warningseverity: logs and surfaces in Activity Command Centre (PRD-72)For
infoseverity: logs only
infrastructure_alerts table schema:
4.2 Agent Investigation Actions (Read-Only)
When a critical alert fires, the platform triggers an agent heartbeat to investigate and report — not to take remediation actions.
RedisHighMemory
Query Redis INFO memory, identify large keys
Evidence summary + impact classification + recommended action
QueueBacklog
Query Redis queue lengths, check for stuck tasks
Stuck task report + queue depth breakdown
HighErrorRate
Query Loki for recent ERROR logs, group by module
Error summary + top error classes + recommended investigation
PostgreSQLHighConnections
Query pg_stat_activity for idle/active breakdown
Connection audit + idle connection list + recommendation
ServiceRestart
Check health endpoints, compare before/after
Recovery status report
4.3 Alert → Agent Flow
5. Railway Deployment Structure
Each monitoring component deploys as a separate Railway service. The repo contains a docker-compose.yml for local development and individual Dockerfiles/configs for Railway deployment.
5.1 Railway Services (Production)
prometheus
prom/prometheus:v2.51.0
5GB
No
Internal only
grafana
grafana/grafana:10.4.0
1GB
Yes
Public domain for dashboard access
loki
grafana/loki:2.9.4
10GB
No
Private — log-relay forwards to it internally
log-relay
Custom (see 2.4.1)
None
Yes
Public — receives Railway log drain webhooks
alertmanager
prom/alertmanager:v0.27.0
500MB
No
Internal only
postgres-exporter
prometheuscommunity/postgres-exporter:0.15.0
None
No
Stateless
redis-exporter
oliver006/redis_exporter:v1.58.0
None
No
Stateless
Total services: 7 (lean, purpose-built)
5.2 Docker Compose (Local Development)
For local development and testing, a docker-compose.yml is provided that mirrors the Railway setup using Docker networks instead of Railway private networking:
Local vs Railway: Locally, services use Docker DNS (
prometheus:9090). On Railway, they useprometheus.railway.internal:9090. Config files use environment variable substitution to handle both.
6. File Structure
Files to DELETE from current repo:
config/pgadmin/— Not needed (adminer is in automatos-ai)monitoring/prometheus/rules/xplaincrypto-alerts.yml— Crypto-specificmonitoring/prometheus/rules/phase1-alerts.yml— Crypto-specific referencesmonitoring/grafana/dashboards/crypto-overview.json— Crypto-specificmonitoring/grafana/dashboards/xplaincrypto-overview.json— Crypto-specificmonitoring/grafana/dashboards/n8n-*.json— No n8n in Automatosmonitoring/grafana/dashboards/unified-platform.json— Crypto referencesmonitoring/grafana/dashboards/infrastructure-testing.json— Crypto-specificmonitoring/grafana/dashboards/platform-status-comprehensive.json— Crypto-specificmonitoring/enhanced-n8n-exporter.py— No n8nmonitoring/promtail/— Replaced by Railway log drain + log-relay servicenginx/— Railway handles routing, no nginx neededtests/test_infrastructure.py— Crypto-specific testsdocs/— Will be rewrittenscripts/— Will be rewritten for Railway
7. Configuration Requirements
7.1 Environment Variables (Railway Service Variables)
Each Railway service gets its own environment variables. Shared variables use Railway's variable referencing (${{service.variable}}).
Prometheus:
Grafana:
Postgres Exporter:
Redis Exporter:
Log Relay:
AlertManager:
Local Development (.env):
7.2 Prerequisites
Railway project exists with automatos-ai services already deployed (backend, postgres, redis)
Private networking enabled on the Railway project (enabled by default)
Railway CLI installed for deployment scripts (
railway login,railway link)Backend must expose
/api/alerts/ingestendpoint (new — implemented as part of this PRD)Railway log drain configured to point at the log-relay service's public URL
8. Deploy Runbook
See DEPLOY-RUNBOOK.md for the full day-of execution plan with:
Exact deploy order (Loki → log-relay → Prometheus → AlertManager → exporters → Grafana → backend ingest → log drain)
Validation commands and pass/fail criteria per service
Rollback points after each batch
Common Railway gotchas
Smoke tests for log pipeline, metrics pipeline, alert pipeline, and restart tolerance
9. Implementation Phases
Phase 1: Core Stack (This PRD)
Safe Phase 1 — must ship:
Clean repo — remove all xplaincrypto/crypto references
P0
S
Write docker-compose.yml for local dev (7 services)
P0
M
Create Railway service configs (.toml files)
P0
M
Build log-relay service (Railway log drain → Loki)
P0
M
Configure Prometheus scrape targets (Railway private DNS)
P0
S
Configure Loki with 7-day retention
P0
S
Configure AlertManager with webhook routing + auth
P0
M
Create infrastructure + database alert rules (exporter-based)
P0
M
Deploy monitoring services to Railway
P0
M
Configure Railway log drain → log-relay → Loki
P0
S
Build Platform Overview dashboard
P1
M
Build Database Health dashboard
P1
M
Build Redis & Queues dashboard
P1
S
Build Logs Explorer dashboard
P1
S
Implement /api/alerts/ingest endpoint (with auth + dedupe)
P1
M
Create infrastructure_alerts DB table + migration
P1
S
Minimal /metrics endpoint on backend (prometheus_client)
P1
M
README with local dev + Railway deployment instructions
P2
S
Can defer without killing v1 (nice-to-haves):
Agent auto-investigation flows on critical alerts
Requires stable alert pipeline first
Application alert rules (HighErrorRate, SlowResponses)
Requires /metrics — can slip if /metrics isn't ready
Queue depth metrics (QueueBacklog, QueueCritical)
Requires worker /metrics or custom Redis key queries
Agent Performance dashboard
Placeholder until heartbeat metrics are instrumented
Workspace Worker dashboard
Placeholder until worker /metrics exists
Setup scripts (local + Railway)
Nice automation, not blocking
Health check script
Nice automation, not blocking
Phase 2: Enhanced Observability (Future PRD)
Loki alerting rules (log-based alerts — ErrorSpike, OOMKill)
OpenTelemetry distributed tracing
Grafana alerting (unified with AlertManager)
Dashboard annotations from deployments
SLA/uptime tracking dashboard
Cost monitoring dashboard (LLM token spend trends)
Automated remediation agent actions (graduated from read-only)
10. Success Criteria
All 7 Railway monitoring services healthy
100% uptime when automatos-ai is running
Prometheus scrape targets up
All configured targets returning metrics via Railway private network
Alert firing → webhook delivery
< 60s for critical, < 120s for warning
Log ingestion latency (Railway drain → Loki)
< 10s from log write to Loki queryable
Dashboard load time
< 3s for any dashboard
Zero crypto/xplaincrypto references
Clean repo audit passes
Agent can read alerts
/api/alerts/ingest stores and triggers correctly
Railway deployment reproducible
Fresh deploy from repo works in < 15 min
11. Security Considerations
No hardcoded credentials — All passwords via Railway environment variables (encrypted at rest)
Private networking — Prometheus, AlertManager, exporters communicate only via Railway private network (not publicly accessible)
Public services hardened — Only Grafana and log-relay have public domains. Loki is private (no public access). Grafana requires admin auth. Log-relay validates
X-Railway-Secretheader.Alert ingest auth —
/api/alerts/ingestrequiresAuthorization: Bearer <ALERT_INGEST_TOKEN>header. Token stored as Railway env var on AlertManager.Grafana auth — Admin password required, anonymous access disabled, consider SSO in Phase 2
Railway variable references — Database passwords use
${{service.VAR}}references, never copied as plain textExporters — Read-only database access (create a
monitoringPostgres role withpg_monitorgrants)Log relay auth — Validate
X-Railway-Secretheader on incoming log drain requests to prevent spoofing.env in .gitignore — Never commit credentials (local dev only)
12. Relationship to Existing PRDs
PRD-06 (Monitoring & Analytics)
PRD-06 = application-level metrics in the UI. PRD-73 = infrastructure observability. Complementary, not competing.
PRD-55 (Agent Heartbeats)
PRD-73 alerts can trigger heartbeats. Heartbeat results are scraped as metrics.
PRD-72 (Activity Command Centre)
Infrastructure alerts surface in the Activity feed as system events.
PRD-70 (Security Hardening)
PRD-73 implements monitoring best practices from the security audit.
Last updated

