141 lines
3.8 KiB
Markdown

# Terraform Datadog Monitors Module
## Overview
This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog.
## Features
- **CPU Monitoring**: Track EC2 instance CPU utilization
- **Disk Monitoring**: Monitor disk usage across hosts
- **Automated Alerting**: No-data notifications included
- **Visualization**: Read-only timeboard with alert thresholds
- **Configurable Thresholds**: Customizable warning and critical levels
## Resources Created
- `datadog_monitor` (disk_usage): Metric alert for disk usage
- `datadog_monitor` (cpu_usage): Query alert for CPU usage
- `datadog_timeboard` (host_metrics): Read-only visualization dashboard
## Requirements
| Name | Version |
|------|---------|
| terraform | >= 0.12 |
| datadog | >= 3.2.0 |
## Usage
```hcl
module "datadog_monitors" {
source = "./terraform-datadog-monitors"
datadog_api_key = var.datadog_api_key
datadog_app_key = var.datadog_app_key
api_url = "https://api.datadoghq.eu"
disk_usage = {
query = "max:system.disk.in_use"
threshold = "85"
}
cpu_usage = {
query = "avg:aws.ec2.cpuutilization"
threshold = "85"
}
}
```
## Inputs
| Name | Description | Type | Required | Default |
|------|-------------|------|----------|---------|
| `datadog_api_key` | Datadog API key | `string` | yes | - |
| `datadog_app_key` | Datadog APP key | `string` | yes | - |
| `api_url` | API endpoint | `string` | no | `"https://api.datadoghq.eu"` |
| `http_client_retry_enabled` | Enable request retries (429, 5xx) | `bool` | no | `true` |
| `http_client_retry_timeout` | HTTP retry timeout | `string` | no | `""` |
| `validate` | Validate API/APP keys on init | `bool` | no | `true` |
| `disk_usage` | Query and threshold for disk monitor | `map` | no | See default |
| `cpu_usage` | Query and threshold for CPU monitor | `map` | no | See default |
| `datadog_alert_footer` | Alert message footer | `string` | no | PagerDuty + Slack template |
| `trigger_by` | Grouping for alerts | `string` | no | `"{host,env}"` |
## Monitor Configuration
### Disk Usage Monitor
- **Query**: `avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85`
- **Type**: Metric alert
- **Threshold**: 85% (configurable)
- **Evaluation**: Last 5 minutes average
- **Grouping**: By host and env
- **No Data**: Notifies after 10 minutes
### CPU Usage Monitor
- **Query**: `avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85`
- **Type**: Query alert
- **Threshold**: 85% (configurable)
- **Evaluation**: Last 5 minutes average
- **Grouping**: By host and env
- **No Data**: Notifies after 10 minutes
## Timeboard
The module creates a read-only timeboard with:
- CPU usage graph with alert threshold marker
- Disk usage graph with alert threshold marker
- Alert overlay showing when thresholds are breached
## Alert Message Template
Default alert footer includes integration with:
- PagerDuty: `@pagerduty-service_name`
- Slack: `@slack-channel_name`
Customize via the `datadog_alert_footer` variable.
## Outputs
Currently, this module does not export any outputs.
## Customization
### Custom Thresholds
```hcl
disk_usage = {
query = "max:system.disk.in_use"
threshold = "90" # Raise to 90%
}
cpu_usage = {
query = "avg:aws.ec2.cpuutilization"
threshold = "75" # Lower to 75%
}
```
### Custom Grouping
```hcl
trigger_by = "{host,env,service}"
```
## Notes
- Monitors include no-data alerting by default
- Timeboard is read-only to prevent accidental modifications
- Uses 5-minute evaluation windows
- Supports HTTP client retries for reliability
- Can be reused across multiple environments via variable configuration
## License
Internal use only - Sanoma/WeBuildYourCloud
## Authors
Created and maintained by the Platform Engineering team.