PRD-73: Agent Monitoring Integration Guide

For: Auto CTO & Agent Configuration Status: Live — Stack Deployed on Railway Date: 2026-03-09


What's Running

7 monitoring services are live on Railway, all in the same project as the Automatos platform:

Service
Internal Address
Purpose

Prometheus

prometheus.railway.internal:9090

Scrapes metrics every 15s from all services

Grafana

grafana.railway.internal:3000

Dashboards — also public at https://grafana-production-5f61.up.railway.app

Loki

loki.railway.internal:3100

Log storage (7-day retention)

Log-Relay

log-relay.railway.internal:8080

Receives logs from services, forwards to Loki

AlertManager

alertmanager.railway.internal:9093

Routes alerts by severity to backend webhook

Postgres Exporter

postgres-exporter.railway.internal:9187

Exports PostgreSQL metrics

Redis Exporter

redis-exporter.railway.internal:9121

Exports Redis metrics


1. How Logs Flow (Agent-Readable)

Any Service (Python logger)
    ↓ POST /push (JSON)
log-relay.railway.internal:8080
    ↓ batches + transforms
loki.railway.internal:3100
    ↓ queryable via
Grafana Logs Explorer dashboard

Sending Logs from Any Service

Every Python service uses the automatos_logging handler installed at orchestrator/core/monitoring/automatos_logging.py:

Environment variables (set on each Railway service):

Log-Relay Push Format

Direct POST to http://log-relay.railway.internal:8080/push:

Supports arrays for batching:

Querying Logs (for Agents)

Agents can query Loki directly via its HTTP API:

Label filters available:

  • service — service name (automatos-backend, agent-opt-worker, etc.)

  • level — debug, info, warning, error, critical

  • environment — production, development

  • source — direct-push, railway-drain

  • module — extracted from extra fields


2. How Metrics Flow

Prometheus Scrape Targets

Target
Address
Metrics Available

Backend API

automatos-ai.railway.internal:8000/metrics

HTTP request count, latency, error rate, in-progress requests

Agent Worker

agent-opt-worker.railway.internal:8080/metrics

Same HTTP metrics (when instrumented)

PostgreSQL

postgres-exporter.railway.internal:9187/metrics

Connections, dead tuples, cache hit ratio, DB size, replication lag

Redis

redis-exporter.railway.internal:9121/metrics

Memory usage, connected clients, evicted keys, command latency

Prometheus

prometheus.railway.internal:9090/metrics

Self-monitoring

Loki

loki.railway.internal:3100/metrics

Ingestion rate, query performance

AlertManager

alertmanager.railway.internal:9093/metrics

Alert delivery stats

Custom Application Metrics (in code)

Available in orchestrator/core/monitoring/automatos_metrics.py:

Querying Metrics (for Agents)

Agents can query Prometheus directly:

Useful PromQL queries for agents:


3. How Alerts Flow

Active Alert Rules

Infrastructure (fire immediately)

Alert
Severity
Condition
Meaning

ServiceDown

critical

up == 0 for 1m

Any monitored service unreachable

BackendDown

critical

Backend up == 0 for 30s

API server down

WorkerDown

critical

Worker up == 0 for 2m

Agent worker down

PostgreSQL

Alert
Severity
Condition

PostgreSQLDown

critical

Exporter can't reach PG for 30s

HighConnections

warning

>150 connections for 5m

SlowQueries

warning

Transactions active >1s for 5m

DeadTuples

warning

>10k dead tuples for 10m

CacheHitLow

warning

Cache hit ratio <95% for 10m

DBSizeGrowth

info

>1GB growth in 24h

Redis

Alert
Severity
Condition

RedisDown

critical

Exporter can't reach Redis for 30s

RedisHighMemory

warning

>200MB (of 256MB) for 5m

RedisHighMemoryCritical

critical

>240MB for 2m

RedisEvictedKeys

warning

Key evictions detected in 5m

RedisHighLatency

warning

Avg command latency >10ms

RedisHighClients

warning

>100 connected clients for 5m

Application (requires /metrics — now deployed)

Alert
Severity
Condition

HighErrorRate

warning

>5% 5xx responses for 5m

SlowResponses

warning

p95 response time >5s

Querying Alerts (for Agents)

From the database (preferred — has full history):

Via API:

From AlertManager directly (current state only):


4. SENTINEL Pattern — Agent Investigation (Read-Only v1)

When a critical alert is stored, agents should investigate but NOT auto-remediate.

Investigation Playbook per Alert

Alert
Agent Action
Tools to Use

RedisHighMemory

Query Redis INFO, find large keys, report top consumers

Redis CLI via workspace tools

RedisHighMemoryCritical

Same as above + check eviction policy + flag urgency

Redis CLI

PostgreSQLHighConnections

Query pg_stat_activity, find idle connections, report breakdown

SQL query tool

PostgreSQLDown

Check if exporter is up, check PG logs in Loki, report status

Loki query + health endpoint

HighErrorRate

Query Loki for ERROR logs in last 15m, group by module, summarize

Loki HTTP API

SlowResponses

Query Prometheus for slow endpoints, check for correlation with load

Prometheus HTTP API

ServiceDown

Check health endpoint, query Loki for crash logs, report last known state

Health endpoint + Loki

BackendDown

Escalate immediately — check Railway deployment status, recent deploys

Railway API / health check

Agent Response Format

Store investigation results in infrastructure_alerts.agent_response (JSONB):

Triggering Investigation

When the alerts ingest endpoint receives a critical alert, trigger an agent heartbeat:


5. Grafana Dashboards

URL: https://grafana-production-5f61.up.railway.app Login: admin / (check GF_SECURITY_ADMIN_PASSWORD env var on grafana service)

Available Dashboards

Dashboard
Status
What It Shows

Platform Overview

Active

All services up/down, high-level health

Database Health

Active

PostgreSQL connections, cache hit, dead tuples, DB size

Redis & Queues

Active

Redis memory, clients, evictions, latency

Logs Explorer

Active

Full-text log search across all services

Agent Performance

Placeholder

Waiting for custom metrics instrumentation

Workspace Worker

Placeholder

Waiting for worker metrics instrumentation

Embedding Dashboards

Grafana panels can be embedded in the Automatos UI:

Set GF_SECURITY_ALLOW_EMBEDDING=true and configure CSP if embedding.


6. Wiring Checklist for New Services

When adding a new Automatos service to monitoring:

1. Logging

In the service code:

2. Metrics (if Python/FastAPI)

3. Add to Prometheus Scrape Config

Edit services/prometheus/prometheus-railway.yml in automatos-monitoring:

4. Add Alert Rules (if needed)

Create or update rules in services/prometheus/rules/ in automatos-monitoring.

5. Add Grafana Dashboard (if needed)

Create JSON dashboard in services/grafana/dashboards/ in automatos-monitoring.


7. Environment Variables Reference

On Backend (automatos-ai-api)

Variable
Value
Purpose

LOG_RELAY_URL

http://log-relay.railway.internal:8080/push

Where to ship logs

LOG_RELAY_ENABLED

true

Enable/disable log shipping

SERVICE_NAME

automatos-backend

Service identifier in logs/metrics

ALERT_INGEST_TOKEN

<token>

Shared secret with AlertManager

On Agent Worker

Variable
Value
Purpose

LOG_RELAY_URL

http://log-relay.railway.internal:8080/push

Where to ship logs

LOG_RELAY_ENABLED

true

Enable/disable log shipping

SERVICE_NAME

agent-opt-worker

Service identifier

On Monitoring Services

Service
Key Variables

log-relay

LOKI_PUSH_URL, LOG_RELAY_SECRET, PORT=8080

grafana

GF_SECURITY_ADMIN_PASSWORD, GF_SERVER_ROOT_URL, PORT=3000

alertmanager

ALERT_INGEST_TOKEN, PORT=9093

prometheus

PORT=9090

loki

PORT=3100

postgres-exporter

DATA_SOURCE_NAME=<postgres_dsn>

redis-exporter

REDIS_ADDR=<redis_url>


8. Repository Structure (automatos-monitoring)

All monitoring infrastructure is defined as code in AutomatosAI/automatos-monitoring:


TL;DR for Auto

  1. Logs → Query Loki at loki.railway.internal:3100 or use Grafana Logs Explorer

  2. Metrics → Query Prometheus at prometheus.railway.internal:9090 with PromQL

  3. Alerts → Read from infrastructure_alerts table or GET /api/alerts

  4. Investigation → On critical alerts, investigate using read-only tools, store findings in agent_response column

  5. New services → Set 3 env vars + 2 lines of Python + add to prometheus scrape config

  6. Everything is code → All config lives in automatos-monitoring repo, rebuildable anywhere

Last updated