MUTX Production Monitoring
This document describes the monitoring infrastructure, endpoints, and procedures for MUTX.
Overview
MUTX uses Prometheus for metrics collection and Grafana for visualization. Alerting is handled by Prometheus alerts and Alertmanager.
Monitoring Stack
| Component | Purpose | Port |
|---|---|---|
| Prometheus | Metrics collection & alerting | 9090 |
| Alertmanager | Alert routing & notification | 9093 |
| Grafana | Dashboards & visualization | 3001 |
| Node Exporter | System metrics | 9100 |
| PostgreSQL Exporter | Database metrics | 9187 |
| Redis Exporter | Cache metrics | 9121 |
Service Endpoints
Health Check Endpoints
| Endpoint | Description | Auth |
|---|---|---|
GET /health |
Basic health status | None |
GET /ready |
Readiness probe (includes DB) | None |
GET /metrics |
Prometheus metrics | None |
Base URL
The monitoring stack is available at:
- Prometheus: http://localhost:9090 (development)
- Grafana: http://localhost:3001 (development)
- Alertmanager: http://localhost:9093 (development)
For production, replace localhost with your server's hostname or IP.
Prometheus Metrics
Available Metrics
HTTP Metrics
http_requests_total- Total HTTP requests by method, path, statushttp_request_duration_seconds- Request latency histogram
Agent Metrics
mutx_agents_total- Total number of agentsmutx_agents_active- Number of active agentsmutx_agent_tasks_total- Agent tasks processed by statusmutx_agent_task_duration_seconds- Task duration histogram
Deployment Metrics
mutx_deployments_total- Total deploymentsmutx_deployments_running- Running deploymentsmutx_deployments_by_status- Deployments by status
Queue Metrics
mutx_queue_size- Current queue size
Query Examples
# API request rate
rate(http_requests_total[5m])
# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Active agents
mutx_agents_active
# Failed tasks rate
rate(mutx_agent_tasks_total{status="failed"}[5m])
Alert Rules
Critical Alerts
| Alert | Expression | Description |
|---|---|---|
| MutxApiDown | up{job="mutx-api"} == 0 for 2m |
API is down |
| HostDiskAlmostFull | Disk available < 15% for 15m | Disk space critical |
Warning Alerts
| Alert | Expression | Description |
|---|---|---|
| HighApiP95Latency | p95 latency > 1s for 10m | High latency |
| RedisExporterDown | Redis exporter down for 5m | Redis monitoring unavailable |
| PostgresExporterDown | Postgres exporter down for 5m | DB monitoring unavailable |
| NodeExporterDown | Node exporter down for 5m | System monitoring unavailable |
| HostHighMemoryUsage | Memory usage > 90% for 10m | High memory pressure |
Custom Alerts
Add new alerts to infrastructure/monitoring/prometheus/alerts.yml:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 10% for 5 minutes"
Grafana Dashboards
Access dashboards at: http://localhost:3001/dashboards
Available Dashboards
- MUTX API Overview - Main API metrics
- Agent Performance - Agent task metrics
- Deployment Status - Deployment tracking
- System Overview - Node, Redis, PostgreSQL
Default Credentials
- Username:
admin - Password: Set via
GRAFANA_ADMIN_PASSWORDenvironment variable
Alert Notifications
Alertmanager is configured to send notifications. Update infrastructure/monitoring/prometheus/alertmanager.yml to configure notification receivers:
route:
group_by: ['alertname']
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'alerts@mutx.dev'
send_resolved: true
Application Health Checks
/health Endpoint
Returns overall health status:
{
"status": "healthy",
"timestamp": "2026-03-15T12:00:00Z",
"database": "ready",
"error": null
}
/ready Endpoint
Returns readiness including database connectivity:
{
"status": "ready",
"timestamp": "2026-03-15T12:00:00Z",
"database": "ready",
"error": null
}
Returns 503 if not ready.
Production Deployment
Starting Monitoring Stack
# Using Docker Compose
cd infrastructure/docker
docker-compose -f docker-compose.monitoring.yml up -d
# Or use the Makefile
make up-monitoring
Verify Services
# Check Prometheus
curl http://localhost:9090/-/healthy
# Check Alertmanager
curl http://localhost:9093/-/healthy
# Check exporters
curl http://localhost:9100/metrics | head
curl http://localhost:9187/metrics | head
curl http://localhost:9121/metrics | head
Troubleshooting
Prometheus not scraping targets
- Check target status: http://localhost:9090/targets
- Verify network connectivity
- Check exporter logs:
docker logs mutx-prometheus
Alerts not firing
- Check alert rules: http://localhost:9090/rules
- Verify Alertmanager connectivity: http://localhost:9090/status
- Check alert notifications in Alertmanager UI
High latency on queries
- Reduce scrape interval if needed
- Check retention settings
- Review query performance in Prometheus UI
Integration with External Monitoring
Adding New Exporters
- Add exporter service to
docker-compose.monitoring.yml - Add scrape config to
prometheus.yml - Restart Prometheus:
docker restart mutx-prometheus
Custom Metrics
Add custom metrics in your code:
from src.api.metrics import http_requests_total, http_request_duration_seconds
# Track a request
http_requests_total.labels(method="GET", path="/api/agents", status=200).inc()
# Track duration
http_request_duration_seconds.labels(method="GET", path="/api/agents").observe(0.125)
Environment Variables
| Variable | Description | Default |
|---|---|---|
PROMETHEUS_PORT |
Prometheus port | 9090 |
ALERTMANAGER_PORT |
Alertmanager port | 9093 |
GRAFANA_PORT |
Grafana port | 3001 |
GRAFANA_ADMIN_PASSWORD |
Grafana admin password | (required) |
NODE_EXPORTER_PORT |
Node exporter port | 9100 |
POSTGRES_EXPORTER_PORT |
Postgres exporter port | 9187 |
REDIS_EXPORTER_PORT |
Redis exporter port | 9121 |
REDIS_PASSWORD |
Redis password | (required) |
POSTGRES_PASSWORD |
Postgres password | (required) |
Related Files
infrastructure/docker/docker-compose.monitoring.yml- Monitoring stackinfrastructure/monitoring/prometheus/prometheus.yml- Prometheus configinfrastructure/monitoring/prometheus/alerts.yml- Alert rulesinfrastructure/monitoring/prometheus/alertmanager.yml- Alert routingsrc/api/metrics.py- Application metrics
