MUTX Infrastructure
Production-readiness baseline for Terraform, Ansible, Docker Compose, and monitoring.
Layout
terraform/β DigitalOcean VPC + droplet + firewall provisioningansible/β host hardening and container deployment playbookshelm/β Kubernetes Helm chart for MUTX deploymentmonitoring/β Prometheus + Grafana config
Terraform (DigitalOcean)
cd infrastructure
make tf-fmt
make tf-validate
make ansible-inventory
# Staging
make tf-plan-staging
# Production
make tf-plan-production
# Apply from terraform/ after reviewing plan
cd terraform
terraform apply -var-file=environments/production/terraform.tfvars
Health Checks
The API exposes two health check endpoints for Kubernetes/production use:
/health- Liveness probe, returns 200 if the service is alive/ready- Readiness probe, returns 503 if not ready to accept traffic
Both endpoints check database connectivity status.
Notes
- Backend is environment-scoped via
terraform/environments/<env>/backend.hcl. - Keep
admin_cidrscoped to VPN/home-office CIDR. Avoid0.0.0.0/0in production. - Customer IDs must be unique.
- Customer VPC CIDRs are validated at plan time.
Ansible Provisioning
cd infrastructure
make ansible-provision
make ansible-deploy-agent
Install Ansible collections and lint before applying:
cd infrastructure
make ansible-deps
make ansible-lint
Required environment variables before running playbooks:
POSTGRES_PASSWORDREDIS_PASSWORDAGENT_API_KEYAGENT_SECRET_KEYADMIN_CIDR(recommended, defaults to0.0.0.0/0)PRIVATE_CIDR(optional, defaults to10.0.0.0/8)TAILSCALE_AUTH_KEY(optional)
Monitoring Stack
cp .env.monitoring.example .env.monitoring
# set strong credentials/passwords
cd infrastructure
make monitor-up
The monitoring compose file binds UI and exporter ports to localhost by default. Stop it with cd infrastructure && make monitor-down. Prometheus now loads alerting rules from infrastructure/monitoring/prometheus/alerts.yml and scrapes the API via host.docker.internal:8000 (works for local Docker Desktop + Linux host-gateway mapping). Grafana provisioning and dashboard mounts are resolved relative to infrastructure/docker/docker-compose.monitoring.yml, so the stack reads from ../monitoring/grafana/... when launched from infrastructure/.
CI Drift Detection
A scheduled GitHub Actions workflow now checks Terraform drift daily for both staging and production using terraform plan -detailed-exitcode:
- Workflow:
.github/workflows/infrastructure-drift.yml - Trigger: daily cron + manual dispatch
- Behavior: uploads plan artifacts and fails when drift is detected
Required GitHub secrets:
DO_TOKENTF_STATE_ACCESS_KEY_IDTF_STATE_SECRET_ACCESS_KEY
Extra Validation Targets
From infrastructure/:
# Drift-style exit codes (0=no changes, 2=changes)
make tf-plan-staging-detailed
make tf-plan-production-detailed
# Validate Prometheus config + alert rules
make monitor-validate
These infrastructure checks are authoritative for infra changes, but they are not the same thing as the app CI lane in .github/workflows/ci.yml. Keep infra-specific validation explicit so application PRs do not fail on hidden cross-surface assumptions.
Next Hardening Items
- Replace static inventory usage with Terraform-generated inventory by default in Ansible playbook wrappers.
- Wire alert delivery (Alertmanager/notification channel) for critical rules.
- Default local Alertmanager routing now drops alerts unless they are explicitly labeled
notify="webhook", preventing noisy failed webhook retries in dev setups without a receiver. - Add automated backup/restore verification for PostgreSQL and Redis volumes.
