diff --git a/.gitignore b/.gitignore old mode 100644 new mode 100755 diff --git a/.terraform.lock.hcl b/.terraform.lock.hcl old mode 100644 new mode 100755 diff --git a/README.md b/README.md new file mode 100644 index 0000000..29069aa --- /dev/null +++ b/README.md @@ -0,0 +1,140 @@ +# Terraform Datadog Monitors Module + +## Overview + +This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog. + +## Features + +- **CPU Monitoring**: Track EC2 instance CPU utilization +- **Disk Monitoring**: Monitor disk usage across hosts +- **Automated Alerting**: No-data notifications included +- **Visualization**: Read-only timeboard with alert thresholds +- **Configurable Thresholds**: Customizable warning and critical levels + +## Resources Created + +- `datadog_monitor` (disk_usage): Metric alert for disk usage +- `datadog_monitor` (cpu_usage): Query alert for CPU usage +- `datadog_timeboard` (host_metrics): Read-only visualization dashboard + +## Requirements + +| Name | Version | +|------|---------| +| terraform | >= 0.12 | +| datadog | >= 3.2.0 | + +## Usage + +```hcl +module "datadog_monitors" { + source = "./terraform-datadog-monitors" + + datadog_api_key = var.datadog_api_key + datadog_app_key = var.datadog_app_key + api_url = "https://api.datadoghq.eu" + + disk_usage = { + query = "max:system.disk.in_use" + threshold = "85" + } + + cpu_usage = { + query = "avg:aws.ec2.cpuutilization" + threshold = "85" + } +} +``` + +## Inputs + +| Name | Description | Type | Required | Default | +|------|-------------|------|----------|---------| +| `datadog_api_key` | Datadog API key | `string` | yes | - | +| `datadog_app_key` | Datadog APP key | `string` | yes | - | +| `api_url` | API endpoint | `string` | no | `"https://api.datadoghq.eu"` | +| `http_client_retry_enabled` | Enable request retries (429, 5xx) | `bool` | no | `true` | +| `http_client_retry_timeout` | HTTP retry timeout | `string` | no | `""` | +| `validate` | Validate API/APP keys on init | `bool` | no | `true` | +| `disk_usage` | Query and threshold for disk monitor | `map` | no | See default | +| `cpu_usage` | Query and threshold for CPU monitor | `map` | no | See default | +| `datadog_alert_footer` | Alert message footer | `string` | no | PagerDuty + Slack template | +| `trigger_by` | Grouping for alerts | `string` | no | `"{host,env}"` | + +## Monitor Configuration + +### Disk Usage Monitor + +- **Query**: `avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85` +- **Type**: Metric alert +- **Threshold**: 85% (configurable) +- **Evaluation**: Last 5 minutes average +- **Grouping**: By host and env +- **No Data**: Notifies after 10 minutes + +### CPU Usage Monitor + +- **Query**: `avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85` +- **Type**: Query alert +- **Threshold**: 85% (configurable) +- **Evaluation**: Last 5 minutes average +- **Grouping**: By host and env +- **No Data**: Notifies after 10 minutes + +## Timeboard + +The module creates a read-only timeboard with: +- CPU usage graph with alert threshold marker +- Disk usage graph with alert threshold marker +- Alert overlay showing when thresholds are breached + +## Alert Message Template + +Default alert footer includes integration with: +- PagerDuty: `@pagerduty-service_name` +- Slack: `@slack-channel_name` + +Customize via the `datadog_alert_footer` variable. + +## Outputs + +Currently, this module does not export any outputs. + +## Customization + +### Custom Thresholds + +```hcl +disk_usage = { + query = "max:system.disk.in_use" + threshold = "90" # Raise to 90% +} + +cpu_usage = { + query = "avg:aws.ec2.cpuutilization" + threshold = "75" # Lower to 75% +} +``` + +### Custom Grouping + +```hcl +trigger_by = "{host,env,service}" +``` + +## Notes + +- Monitors include no-data alerting by default +- Timeboard is read-only to prevent accidental modifications +- Uses 5-minute evaluation windows +- Supports HTTP client retries for reliability +- Can be reused across multiple environments via variable configuration + +## License + +Internal use only - Sanoma/WeBuildYourCloud + +## Authors + +Created and maintained by the Platform Engineering team. diff --git a/main.tf b/main.tf old mode 100644 new mode 100755 diff --git a/provider.tf b/provider.tf old mode 100644 new mode 100755 diff --git a/variables.tf b/variables.tf old mode 100644 new mode 100755