# MUTX Infrastructure
Production-readiness baseline for Terraform, Ansible, Docker Compose, and monitoring.
## Layout
* `terraform/` β DigitalOcean VPC + droplet + firewall provisioning
* `ansible/` β host hardening and container deployment playbooks
* `helm/` β Kubernetes Helm chart for MUTX deployment
* `monitoring/` β Prometheus + Grafana config
## Terraform (DigitalOcean)
```bash
cd infrastructure
make tf-fmt
make tf-validate
make ansible-inventory
# Staging
make tf-plan-staging
# Production
make tf-plan-production
# Apply from terraform/ after reviewing plan
cd terraform
terraform apply -var-file=environments/production/terraform.tfvars
```
## Health Checks
The API exposes two health check endpoints for Kubernetes/production use:
* `/health` - Liveness probe, returns 200 if the service is alive
* `/ready` - Readiness probe, returns 503 if not ready to accept traffic
Both endpoints check database connectivity status.
### Notes
* Backend is environment-scoped via `terraform/environments//backend.hcl`.
* Keep `admin_cidr` scoped to VPN/home-office CIDR. Avoid `0.0.0.0/0` in production.
* Customer IDs must be unique.
* Customer VPC CIDRs are validated at plan time.
## Ansible Provisioning
```bash
cd infrastructure
make ansible-provision
make ansible-deploy-agent
```
Install Ansible collections and lint before applying:
```bash
cd infrastructure
make ansible-deps
make ansible-lint
```
Required environment variables before running playbooks:
* `POSTGRES_PASSWORD`
* `REDIS_PASSWORD`
* `AGENT_API_KEY`
* `AGENT_SECRET_KEY`
* `ADMIN_CIDR` (recommended, defaults to `0.0.0.0/0`)
* `PRIVATE_CIDR` (optional, defaults to `10.0.0.0/8`)
* `TAILSCALE_AUTH_KEY` (optional)
## Monitoring Stack
```bash
cp .env.monitoring.example .env.monitoring
# set strong credentials/passwords
cd infrastructure
make monitor-up
```
The monitoring compose file binds UI and exporter ports to localhost by default. Stop it with `cd infrastructure && make monitor-down`. Prometheus now loads alerting rules from `infrastructure/monitoring/prometheus/alerts.yml` and scrapes the API via `host.docker.internal:8000` (works for local Docker Desktop + Linux host-gateway mapping). Grafana provisioning and dashboard mounts are resolved relative to `infrastructure/docker/docker-compose.monitoring.yml`, so the stack reads from `../monitoring/grafana/...` when launched from `infrastructure/`.
## CI Drift Detection
A scheduled GitHub Actions workflow now checks Terraform drift daily for both `staging` and `production` using `terraform plan -detailed-exitcode`:
* Workflow: `.github/workflows/infrastructure-drift.yml`
* Trigger: daily cron + manual dispatch
* Behavior: uploads plan artifacts and fails when drift is detected
Required GitHub secrets:
* `DO_TOKEN`
* `TF_STATE_ACCESS_KEY_ID`
* `TF_STATE_SECRET_ACCESS_KEY`
## Extra Validation Targets
From `infrastructure/`:
```bash
# Drift-style exit codes (0=no changes, 2=changes)
make tf-plan-staging-detailed
make tf-plan-production-detailed
# Validate Prometheus config + alert rules
make monitor-validate
```
These infrastructure checks are authoritative for infra changes, but they are not the same thing as the app CI lane in `.github/workflows/ci.yml`. Keep infra-specific validation explicit so application PRs do not fail on hidden cross-surface assumptions.
## Next Hardening Items
* Replace static inventory usage with Terraform-generated inventory by default in Ansible playbook wrappers.
* Wire alert delivery (Alertmanager/notification channel) for critical rules.
* Default local Alertmanager routing now drops alerts unless they are explicitly labeled `notify="webhook"`, preventing noisy failed webhook retries in dev setups without a receiver.
* Add automated backup/restore verification for PostgreSQL and Redis volumes.