System Health Monitoring
Purpose and Scope
This document describes the system health monitoring infrastructure in Automatos AI, which provides real-time visibility into the operational status of all platform components. The system implements comprehensive health checks, system metrics collection, and time-series monitoring to ensure platform reliability and performance.
For information about analytics and usage tracking, see Analytics & Monitoring. For cost tracking and LLM usage analytics specifically, see LLM Usage Tracking.
Health Monitoring Architecture
The health monitoring system performs two primary functions:
Component Health Checks - Verifies connectivity and operational status of critical services
System Metrics Collection - Collects and stores CPU, memory, disk, and network metrics
High-Level Health Check Flow
Sources: orchestrator/api/system.py:328-534
Health Check Endpoint
GET /api/system/health
Returns comprehensive health status for all platform components with individual component metrics.
Response Model: SystemHealthResponse
Sources: orchestrator/api/system.py:328-534, orchestrator/core/models/enhanced.py:248-300
Component Health Checks
Each component health check follows a consistent pattern with latency measurement and error capture.
Database Health Check
Tests PostgreSQL connectivity with a simple query.
Status
healthy if query succeeds, unhealthy on failure
Check
SELECT 1 execution
Metrics
connection: active or connection: failed
Implementation: orchestrator/api/system.py:337-345
Redis Health Check
Verifies Redis connectivity and measures ping latency.
Status
healthy if PING returns true, unhealthy otherwise
Check
redis_client.ping()
Metrics
ping: success/failed, latency_ms, connection: active/failed
Implementation: orchestrator/api/system.py:347-371
API Health Check
Verifies internal API readiness by loading core modules.
Status
healthy if modules load, unhealthy on import failure
Check
Import LLMManager and RAGService
Metrics
readiness: ready/not_ready, latency_ms, core_modules: loaded/failed
Implementation: orchestrator/api/system.py:373-402
Document Processor Health Check
Verifies document processing worker accessibility and database connectivity.
Status
healthy if worker accessible, unhealthy on error
Check
Import process_document, access RAG service, query Document table
Metrics
status: operational/error, documents_in_db, worker: accessible/unavailable
Implementation: orchestrator/api/system.py:404-437
RAG System Health Check
Verifies RAG system operational status and database connectivity.
Status
healthy if service accessible, unhealthy on error
Check
Access RAG service, query RAGConfiguration and document_chunks tables
Metrics
rag_configs, document_chunks, service: accessible/unavailable
Implementation: orchestrator/api/system.py:439-475
Component Health Model
The ComponentHealth model standardizes health information across all components.
ComponentHealth Schema
name
String
Component identifier (database, redis, api, etc.)
status
String
Health status: healthy, degraded, or unhealthy
last_check
DateTime
Timestamp of last health check
metrics
Dict
Component-specific metrics and diagnostic info
Sources: orchestrator/core/models/enhanced.py:248-268, orchestrator/api/system.py:478-510
System Metrics Collection
The /api/system/metrics endpoint provides detailed system performance metrics with optional time-series history.
GET /api/system/metrics
Returns current system metrics and optionally historical time-series data from the database.
Query Parameters:
timeRange(optional):1h,24h,7d, or30d- Returns historical data from database
Metrics Collection Architecture
Sources: orchestrator/api/system.py:577-775
Metrics Storage
The system stores metrics to the system_metrics table for historical analysis and trending.
Storage Schema
Stored Metrics
cpu_usage
percent
Average CPU usage across all cores
memory_usage
percent
Memory utilization percentage
memory_available
bytes
Available memory in bytes
disk_usage
percent
Disk usage percentage
disk_read_bytes
bytes
Cumulative bytes read from disk
disk_write_bytes
bytes
Cumulative bytes written to disk
network_sent
bytes
Cumulative bytes sent over network
network_recv
bytes
Cumulative bytes received over network
Storage Implementation
The _store_current_metrics() function collects and stores metrics asynchronously without blocking the response:
Implementation: orchestrator/api/system.py:536-574
Time-Series Data Retrieval
When the timeRange parameter is provided, the endpoint queries historical metrics from the database.
Time-Series Query Flow
Time Range Mapping
1h
1
Last hour of data
24h
24
Last 24 hours
7d
168
Last 7 days
30d
720
Last 30 days
Time-Series Response Format
Implementation: orchestrator/api/system.py:675-769
Analytics Integration
The metrics endpoint integrates with AnalyticsEngine to provide context optimization and learning metrics.
Context Optimization Metrics
Tracks token savings from the RecipeScratchpad system:
tokens_saved
Total tokens saved through context compression
compression_ratio
Average compression ratio (typically 0.1-0.2 for 80-90% savings)
total_optimizations
Number of optimization operations performed
efficiency
Overall efficiency score
Source: orchestrator/api/system.py:610-620
Learning Metrics
Tracks memory and knowledge accumulation:
total_memories
Total Mem0 memory items stored
recent_memories
Recent memory items (last 24h)
knowledge_nodes
Knowledge graph nodes
active_collaborations
Active agent collaborations
memory_consolidations
Memory consolidation operations
avg_improvement
Average improvement from learning
Source: orchestrator/api/system.py:622-642
Frontend Integration
The health monitoring system is consumed by the frontend dashboard for real-time status visualization.
React Query Integration
Frontend uses the use-unified-analytics.ts hook pattern for data fetching:
Dashboard Visualization
Sources: Diagram 5 from high-level architecture
Agent Status Monitoring
Individual agent health is tracked separately through agent-specific endpoints.
GET /api/agents/{agent_id}/status
Returns operational status for a specific agent:
Implementation: orchestrator/api/agents.py:481-505
GET /api/agents/stats
Returns aggregated statistics across all agents in the workspace:
Implementation: orchestrator/api/agents.py:269-297
Error Handling and Resilience
The health monitoring system is designed to never fail the application, following a defensive programming pattern.
Fault Isolation
Degraded State Detection
The overall system status is calculated based on component health:
All components healthy
healthy
Any component degraded or unhealthy
degraded
Critical components failing
unhealthy (depends on component criticality)
Implementation: orchestrator/api/system.py:513
Usage Tracking Integration
While not part of health monitoring directly, the UsageTracker provides complementary monitoring for LLM operations.
LLM Call Tracking
Every LLM call is tracked for analytics:
Note: Usage tracking runs in a separate database session and never fails the parent transaction.
Sources: orchestrator/core/llm/usage_tracker.py:17-116
Best Practices
For Platform Operators
Monitor
/api/system/healthregularly - Set up automated alerts whenoverall_statusis nothealthyCheck component-specific metrics - Each component provides diagnostic information in its
metricsfieldTrack time-series trends - Use the
timeRangeparameter to identify gradual degradationReview error messages - Component metrics include
errorfield with failure details
For Developers
Add new components to health checks - Follow the pattern in orchestrator/api/system.py:328-534
Use try-catch for isolation - Each component check should catch exceptions independently
Include diagnostic metrics - Return actionable information in the
metricsdictMeasure latency - Track check duration for performance monitoring
For Frontend Developers
Poll
/api/system/healthperiodically - Recommended: every 30-60 secondsUse
/api/system/metrics?timeRange=24h- For dashboard chartsCache metric queries - Use React Query with appropriate stale times
Handle degraded states gracefully - Show warnings without blocking user actions
Related Systems
Analytics Engine - See Analytics & Monitoring for overall analytics architecture
LLM Usage Tracking - See LLM Usage Tracking for detailed cost tracking
Agent Status - See Agent Lifecycle & Status for agent-specific health monitoring
Sources: orchestrator/api/system.py:1-842, orchestrator/api/agents.py:269-297, 481-505, orchestrator/core/models/enhanced.py:248-300, orchestrator/core/llm/usage_tracker.py:1-116
Last updated

