From 372fa8fabc0916b8e3310f9a4b6fddd8054b5d1b Mon Sep 17 00:00:00 2001 From: Patrick de Ruiter Date: Sat, 1 Nov 2025 10:43:46 +0100 Subject: [PATCH] Initial commit with README and module files --- .gitignore | 0 .terraform.lock.hcl | 0 README.md | 140 ++++++++++++++++++++++++++++++++++++++++++++ main.tf | 0 provider.tf | 0 variables.tf | 0 6 files changed, 140 insertions(+) mode change 100644 => 100755 .gitignore mode change 100644 => 100755 .terraform.lock.hcl create mode 100644 README.md mode change 100644 => 100755 main.tf mode change 100644 => 100755 provider.tf mode change 100644 => 100755 variables.tf diff --git a/.gitignore b/.gitignore old mode 100644 new mode 100755 diff --git a/.terraform.lock.hcl b/.terraform.lock.hcl old mode 100644 new mode 100755 diff --git a/README.md b/README.md new file mode 100644 index 0000000..29069aa --- /dev/null +++ b/README.md @@ -0,0 +1,140 @@ +# Terraform Datadog Monitors Module + +## Overview + +This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog. + +## Features + +- **CPU Monitoring**: Track EC2 instance CPU utilization +- **Disk Monitoring**: Monitor disk usage across hosts +- **Automated Alerting**: No-data notifications included +- **Visualization**: Read-only timeboard with alert thresholds +- **Configurable Thresholds**: Customizable warning and critical levels + +## Resources Created + +- `datadog_monitor` (disk_usage): Metric alert for disk usage +- `datadog_monitor` (cpu_usage): Query alert for CPU usage +- `datadog_timeboard` (host_metrics): Read-only visualization dashboard + +## Requirements + +| Name | Version | +|------|---------| +| terraform | >= 0.12 | +| datadog | >= 3.2.0 | + +## Usage + +```hcl +module "datadog_monitors" { + source = "./terraform-datadog-monitors" + + datadog_api_key = var.datadog_api_key + datadog_app_key = var.datadog_app_key + api_url = "https://api.datadoghq.eu" + + disk_usage = { + query = "max:system.disk.in_use" + threshold = "85" + } + + cpu_usage = { + query = "avg:aws.ec2.cpuutilization" + threshold = "85" + } +} +``` + +## Inputs + +| Name | Description | Type | Required | Default | +|------|-------------|------|----------|---------| +| `datadog_api_key` | Datadog API key | `string` | yes | - | +| `datadog_app_key` | Datadog APP key | `string` | yes | - | +| `api_url` | API endpoint | `string` | no | `"https://api.datadoghq.eu"` | +| `http_client_retry_enabled` | Enable request retries (429, 5xx) | `bool` | no | `true` | +| `http_client_retry_timeout` | HTTP retry timeout | `string` | no | `""` | +| `validate` | Validate API/APP keys on init | `bool` | no | `true` | +| `disk_usage` | Query and threshold for disk monitor | `map` | no | See default | +| `cpu_usage` | Query and threshold for CPU monitor | `map` | no | See default | +| `datadog_alert_footer` | Alert message footer | `string` | no | PagerDuty + Slack template | +| `trigger_by` | Grouping for alerts | `string` | no | `"{host,env}"` | + +## Monitor Configuration + +### Disk Usage Monitor + +- **Query**: `avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85` +- **Type**: Metric alert +- **Threshold**: 85% (configurable) +- **Evaluation**: Last 5 minutes average +- **Grouping**: By host and env +- **No Data**: Notifies after 10 minutes + +### CPU Usage Monitor + +- **Query**: `avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85` +- **Type**: Query alert +- **Threshold**: 85% (configurable) +- **Evaluation**: Last 5 minutes average +- **Grouping**: By host and env +- **No Data**: Notifies after 10 minutes + +## Timeboard + +The module creates a read-only timeboard with: +- CPU usage graph with alert threshold marker +- Disk usage graph with alert threshold marker +- Alert overlay showing when thresholds are breached + +## Alert Message Template + +Default alert footer includes integration with: +- PagerDuty: `@pagerduty-service_name` +- Slack: `@slack-channel_name` + +Customize via the `datadog_alert_footer` variable. + +## Outputs + +Currently, this module does not export any outputs. + +## Customization + +### Custom Thresholds + +```hcl +disk_usage = { + query = "max:system.disk.in_use" + threshold = "90" # Raise to 90% +} + +cpu_usage = { + query = "avg:aws.ec2.cpuutilization" + threshold = "75" # Lower to 75% +} +``` + +### Custom Grouping + +```hcl +trigger_by = "{host,env,service}" +``` + +## Notes + +- Monitors include no-data alerting by default +- Timeboard is read-only to prevent accidental modifications +- Uses 5-minute evaluation windows +- Supports HTTP client retries for reliability +- Can be reused across multiple environments via variable configuration + +## License + +Internal use only - Sanoma/WeBuildYourCloud + +## Authors + +Created and maintained by the Platform Engineering team. diff --git a/main.tf b/main.tf old mode 100644 new mode 100755 diff --git a/provider.tf b/provider.tf old mode 100644 new mode 100755 diff --git a/variables.tf b/variables.tf old mode 100644 new mode 100755