# Terraform Datadog Monitors Module ## Overview This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog. ## Features - **CPU Monitoring**: Track EC2 instance CPU utilization - **Disk Monitoring**: Monitor disk usage across hosts - **Automated Alerting**: No-data notifications included - **Visualization**: Read-only timeboard with alert thresholds - **Configurable Thresholds**: Customizable warning and critical levels ## Resources Created - `datadog_monitor` (disk_usage): Metric alert for disk usage - `datadog_monitor` (cpu_usage): Query alert for CPU usage - `datadog_timeboard` (host_metrics): Read-only visualization dashboard ## Requirements | Name | Version | |------|---------| | terraform | >= 0.12 | | datadog | >= 3.2.0 | ## Usage ```hcl module "datadog_monitors" { source = "./terraform-datadog-monitors" datadog_api_key = var.datadog_api_key datadog_app_key = var.datadog_app_key api_url = "https://api.datadoghq.eu" disk_usage = { query = "max:system.disk.in_use" threshold = "85" } cpu_usage = { query = "avg:aws.ec2.cpuutilization" threshold = "85" } } ``` ## Inputs | Name | Description | Type | Required | Default | |------|-------------|------|----------|---------| | `datadog_api_key` | Datadog API key | `string` | yes | - | | `datadog_app_key` | Datadog APP key | `string` | yes | - | | `api_url` | API endpoint | `string` | no | `"https://api.datadoghq.eu"` | | `http_client_retry_enabled` | Enable request retries (429, 5xx) | `bool` | no | `true` | | `http_client_retry_timeout` | HTTP retry timeout | `string` | no | `""` | | `validate` | Validate API/APP keys on init | `bool` | no | `true` | | `disk_usage` | Query and threshold for disk monitor | `map` | no | See default | | `cpu_usage` | Query and threshold for CPU monitor | `map` | no | See default | | `datadog_alert_footer` | Alert message footer | `string` | no | PagerDuty + Slack template | | `trigger_by` | Grouping for alerts | `string` | no | `"{host,env}"` | ## Monitor Configuration ### Disk Usage Monitor - **Query**: `avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85` - **Type**: Metric alert - **Threshold**: 85% (configurable) - **Evaluation**: Last 5 minutes average - **Grouping**: By host and env - **No Data**: Notifies after 10 minutes ### CPU Usage Monitor - **Query**: `avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85` - **Type**: Query alert - **Threshold**: 85% (configurable) - **Evaluation**: Last 5 minutes average - **Grouping**: By host and env - **No Data**: Notifies after 10 minutes ## Timeboard The module creates a read-only timeboard with: - CPU usage graph with alert threshold marker - Disk usage graph with alert threshold marker - Alert overlay showing when thresholds are breached ## Alert Message Template Default alert footer includes integration with: - PagerDuty: `@pagerduty-service_name` - Slack: `@slack-channel_name` Customize via the `datadog_alert_footer` variable. ## Outputs Currently, this module does not export any outputs. ## Customization ### Custom Thresholds ```hcl disk_usage = { query = "max:system.disk.in_use" threshold = "90" # Raise to 90% } cpu_usage = { query = "avg:aws.ec2.cpuutilization" threshold = "75" # Lower to 75% } ``` ### Custom Grouping ```hcl trigger_by = "{host,env,service}" ``` ## Notes - Monitors include no-data alerting by default - Timeboard is read-only to prevent accidental modifications - Uses 5-minute evaluation windows - Supports HTTP client retries for reliability - Can be reused across multiple environments via variable configuration ## License Internal use only - Sanoma/WeBuildYourCloud ## Authors Created and maintained by the Platform Engineering team.