terraform-datadog-monitors/README.md

# Terraform Datadog Monitors Module

## Overview

This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog.

## Features

- **CPU Monitoring**: Track EC2 instance CPU utilization
- **Disk Monitoring**: Monitor disk usage across hosts
- **Automated Alerting**: No-data notifications included
- **Visualization**: Read-only timeboard with alert thresholds
- **Configurable Thresholds**: Customizable warning and critical levels

## Resources Created

- `datadog_monitor` (disk_usage): Metric alert for disk usage
- `datadog_monitor` (cpu_usage): Query alert for CPU usage
- `datadog_timeboard` (host_metrics): Read-only visualization dashboard

## Requirements

| Name | Version |
|------|---------|
| terraform | >= 0.12 |
| datadog | >= 3.2.0 |

## Usage

```hcl
module "datadog_monitors" {
  source = "./terraform-datadog-monitors"

  datadog_api_key = var.datadog_api_key
  datadog_app_key = var.datadog_app_key
  api_url         = "https://api.datadoghq.eu"

  disk_usage = {
    query     = "max:system.disk.in_use"
    threshold = "85"
  }

  cpu_usage = {
    query     = "avg:aws.ec2.cpuutilization"
    threshold = "85"
  }
}
```

## Inputs

| Name | Description | Type | Required | Default |
|------|-------------|------|----------|---------|
| `datadog_api_key` | Datadog API key | `string` | yes | - |
| `datadog_app_key` | Datadog APP key | `string` | yes | - |
| `api_url` | API endpoint | `string` | no | `"https://api.datadoghq.eu"` |
| `http_client_retry_enabled` | Enable request retries (429, 5xx) | `bool` | no | `true` |
| `http_client_retry_timeout` | HTTP retry timeout | `string` | no | `""` |
| `validate` | Validate API/APP keys on init | `bool` | no | `true` |
| `disk_usage` | Query and threshold for disk monitor | `map` | no | See default |
| `cpu_usage` | Query and threshold for CPU monitor | `map` | no | See default |
| `datadog_alert_footer` | Alert message footer | `string` | no | PagerDuty + Slack template |
| `trigger_by` | Grouping for alerts | `string` | no | `"{host,env}"` |

## Monitor Configuration

### Disk Usage Monitor

- **Query**: `avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85`
- **Type**: Metric alert
- **Threshold**: 85% (configurable)
- **Evaluation**: Last 5 minutes average
- **Grouping**: By host and env
- **No Data**: Notifies after 10 minutes

### CPU Usage Monitor

- **Query**: `avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85`
- **Type**: Query alert
- **Threshold**: 85% (configurable)
- **Evaluation**: Last 5 minutes average
- **Grouping**: By host and env
- **No Data**: Notifies after 10 minutes

## Timeboard

The module creates a read-only timeboard with:
- CPU usage graph with alert threshold marker
- Disk usage graph with alert threshold marker
- Alert overlay showing when thresholds are breached

## Alert Message Template

Default alert footer includes integration with:
- PagerDuty: `@pagerduty-service_name`
- Slack: `@slack-channel_name`

Customize via the `datadog_alert_footer` variable.

## Outputs

Currently, this module does not export any outputs.

## Customization

### Custom Thresholds

```hcl
disk_usage = {
  query     = "max:system.disk.in_use"
  threshold = "90"  # Raise to 90%
}

cpu_usage = {
  query     = "avg:aws.ec2.cpuutilization"
  threshold = "75"  # Lower to 75%
}
```

### Custom Grouping

```hcl
trigger_by = "{host,env,service}"
```

## Notes

- Monitors include no-data alerting by default
- Timeboard is read-only to prevent accidental modifications
- Uses 5-minute evaluation windows
- Supports HTTP client retries for reliability
- Can be reused across multiple environments via variable configuration

## License

Internal use only - Sanoma/WeBuildYourCloud

## Authors

Created and maintained by the Platform Engineering team.