MUTX Infrastructure

Production-readiness baseline for Terraform, Ansible, Docker Compose, and monitoring.

Layout

terraform/ – DigitalOcean VPC + droplet + firewall provisioning
ansible/ – host hardening and container deployment playbooks
helm/ – Kubernetes Helm chart for MUTX deployment
monitoring/ – Prometheus + Grafana config

Terraform (DigitalOcean)

cd infrastructure
make tf-fmt
make tf-validate
make ansible-inventory

# Staging
make tf-plan-staging

# Production
make tf-plan-production

# Apply from terraform/ after reviewing plan
cd terraform
terraform apply -var-file=environments/production/terraform.tfvars

Health Checks

The API exposes two health check endpoints for Kubernetes/production use:

/health - Liveness probe, returns 200 if the service is alive
/ready - Readiness probe, returns 503 if not ready to accept traffic

Both endpoints check database connectivity status.

Notes

Backend is environment-scoped via terraform/environments/<env>/backend.hcl.
Keep admin_cidr scoped to VPN/home-office CIDR. Avoid 0.0.0.0/0 in production.
Customer IDs must be unique.
Customer VPC CIDRs are validated at plan time.

Ansible Provisioning

cd infrastructure
make ansible-provision
make ansible-deploy-agent

Install Ansible collections and lint before applying:

cd infrastructure
make ansible-deps
make ansible-lint

Required environment variables before running playbooks:

POSTGRES_PASSWORD
REDIS_PASSWORD
AGENT_API_KEY
AGENT_SECRET_KEY
ADMIN_CIDR (recommended, defaults to 0.0.0.0/0)
PRIVATE_CIDR (optional, defaults to 10.0.0.0/8)
TAILSCALE_AUTH_KEY (optional)

Monitoring Stack

cp .env.monitoring.example .env.monitoring
# set strong credentials/passwords
cd infrastructure
make monitor-up

The monitoring compose file binds UI and exporter ports to localhost by default. Stop it with cd infrastructure && make monitor-down. Prometheus now loads alerting rules from infrastructure/monitoring/prometheus/alerts.yml and scrapes the API via host.docker.internal:8000 (works for local Docker Desktop + Linux host-gateway mapping). Grafana provisioning and dashboard mounts are resolved relative to infrastructure/docker/docker-compose.monitoring.yml, so the stack reads from ../monitoring/grafana/... when launched from infrastructure/.

CI Drift Detection

A scheduled GitHub Actions workflow now checks Terraform drift daily for both staging and production using terraform plan -detailed-exitcode:

Workflow: .github/workflows/infrastructure-drift.yml
Trigger: daily cron + manual dispatch
Behavior: uploads plan artifacts and fails when drift is detected

Required GitHub secrets:

DO_TOKEN
TF_STATE_ACCESS_KEY_ID
TF_STATE_SECRET_ACCESS_KEY

Extra Validation Targets

From infrastructure/:

# Drift-style exit codes (0=no changes, 2=changes)
make tf-plan-staging-detailed
make tf-plan-production-detailed

# Validate Prometheus config + alert rules
make monitor-validate

These infrastructure checks are authoritative for infra changes, but they are not the same thing as the app CI lane in .github/workflows/ci.yml. Keep infra-specific validation explicit so application PRs do not fail on hidden cross-surface assumptions.

Next Hardening Items

Replace static inventory usage with Terraform-generated inventory by default in Ansible playbook wrappers.
Wire alert delivery (Alertmanager/notification channel) for critical rules.
Default local Alertmanager routing now drops alerts unless they are explicitly labeled notify="webhook", preventing noisy failed webhook retries in dev setups without a receiver.
Add automated backup/restore verification for PostgreSQL and Redis volumes.

MUTX Infrastructure

Production-readiness baseline for Terraform, Ansible, Docker Compose, and monitoring.

Layout

terraform/ – DigitalOcean VPC + droplet + firewall provisioning
ansible/ – host hardening and container deployment playbooks
helm/ – Kubernetes Helm chart for MUTX deployment
monitoring/ – Prometheus + Grafana config

Terraform (DigitalOcean)

cd infrastructure
make tf-fmt
make tf-validate
make ansible-inventory

# Staging
make tf-plan-staging

# Production
make tf-plan-production

# Apply from terraform/ after reviewing plan
cd terraform
terraform apply -var-file=environments/production/terraform.tfvars

Health Checks

The API exposes two health check endpoints for Kubernetes/production use:

/health - Liveness probe, returns 200 if the service is alive
/ready - Readiness probe, returns 503 if not ready to accept traffic

Both endpoints check database connectivity status.

Notes

Backend is environment-scoped via terraform/environments/<env>/backend.hcl.
Keep admin_cidr scoped to VPN/home-office CIDR. Avoid 0.0.0.0/0 in production.
Customer IDs must be unique.
Customer VPC CIDRs are validated at plan time.

Ansible Provisioning

cd infrastructure
make ansible-provision
make ansible-deploy-agent

Install Ansible collections and lint before applying:

cd infrastructure
make ansible-deps
make ansible-lint

Required environment variables before running playbooks:

POSTGRES_PASSWORD
REDIS_PASSWORD
AGENT_API_KEY
AGENT_SECRET_KEY
ADMIN_CIDR (recommended, defaults to 0.0.0.0/0)
PRIVATE_CIDR (optional, defaults to 10.0.0.0/8)
TAILSCALE_AUTH_KEY (optional)

Monitoring Stack

cp .env.monitoring.example .env.monitoring
# set strong credentials/passwords
cd infrastructure
make monitor-up

CI Drift Detection

A scheduled GitHub Actions workflow now checks Terraform drift daily for both staging and production using terraform plan -detailed-exitcode:

Workflow: .github/workflows/infrastructure-drift.yml
Trigger: daily cron + manual dispatch
Behavior: uploads plan artifacts and fails when drift is detected

Required GitHub secrets:

DO_TOKEN
TF_STATE_ACCESS_KEY_ID
TF_STATE_SECRET_ACCESS_KEY

Extra Validation Targets

From infrastructure/:

# Drift-style exit codes (0=no changes, 2=changes)
make tf-plan-staging-detailed
make tf-plan-production-detailed

# Validate Prometheus config + alert rules
make monitor-validate

Next Hardening Items

Replace static inventory usage with Terraform-generated inventory by default in Ansible playbook wrappers.
Wire alert delivery (Alertmanager/notification channel) for critical rules.
Default local Alertmanager routing now drops alerts unless they are explicitly labeled notify="webhook", preventing noisy failed webhook retries in dev setups without a receiver.
Add automated backup/restore verification for PostgreSQL and Redis volumes.

MUTX Infrastructure#

Layout#

Terraform (DigitalOcean)#

Health Checks#

Notes#

Ansible Provisioning#

Monitoring Stack#

CI Drift Detection#

Extra Validation Targets#

Next Hardening Items#

MUTX Infrastructure#

Layout#

Terraform (DigitalOcean)#

Health Checks#

Notes#

Ansible Provisioning#

Monitoring Stack#

CI Drift Detection#

Extra Validation Targets#

Next Hardening Items#

MUTX Infrastructure

Layout

Terraform (DigitalOcean)

Health Checks

Notes

Ansible Provisioning

Monitoring Stack

CI Drift Detection

Extra Validation Targets

Next Hardening Items

MUTX Infrastructure

Layout

Terraform (DigitalOcean)

Health Checks

Notes

Ansible Provisioning

Monitoring Stack

CI Drift Detection

Extra Validation Targets

Next Hardening Items