module "datadog_monitors" {
  source = "./terraform-datadog-monitors"

  datadog_api_key = var.datadog_api_key
  datadog_app_key = var.datadog_app_key
  api_url         = "https://api.datadoghq.eu"
  
  disk_usage = {
    query     = "max:system.disk.in_use"
    threshold = "85"
  }
  
  cpu_usage = {
    query     = "avg:aws.ec2.cpuutilization"
    threshold = "85"
  }
}

Inputs

Name	Description	Type	Required	Default
`datadog_api_key`	Datadog API key	`string`	yes	-
`datadog_app_key`	Datadog APP key	`string`	yes	-
`api_url`	API endpoint	`string`	no	`"https://api.datadoghq.eu"`
`http_client_retry_enabled`	Enable request retries (429, 5xx)	`bool`	no	`true`
`http_client_retry_timeout`	HTTP retry timeout	`string`	no	`""`
`validate`	Validate API/APP keys on init	`bool`	no	`true`
`disk_usage`	Query and threshold for disk monitor	`map`	no	See default
`cpu_usage`	Query and threshold for CPU monitor	`map`	no	See default
`datadog_alert_footer`	Alert message footer	`string`	no	PagerDuty + Slack template
`trigger_by`	Grouping for alerts	`string`	no	`"{host,env}"`

Monitor Configuration

Disk Usage Monitor

Query: avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85
Type: Metric alert
Threshold: 85% (configurable)
Evaluation: Last 5 minutes average
Grouping: By host and env
No Data: Notifies after 10 minutes

CPU Usage Monitor

Query: avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85
Type: Query alert
Threshold: 85% (configurable)
Evaluation: Last 5 minutes average
Grouping: By host and env
No Data: Notifies after 10 minutes

Timeboard

The module creates a read-only timeboard with:

CPU usage graph with alert threshold marker
Disk usage graph with alert threshold marker
Alert overlay showing when thresholds are breached

Alert Message Template

Default alert footer includes integration with:

PagerDuty: @pagerduty-service_name
Slack: @slack-channel_name

Customize via the datadog_alert_footer variable.

Outputs

Currently, this module does not export any outputs.

Customization

Custom Thresholds

disk_usage = {
  query     = "max:system.disk.in_use"
  threshold = "90"  # Raise to 90%
}

cpu_usage = {
  query     = "avg:aws.ec2.cpuutilization"
  threshold = "75"  # Lower to 75%
}

Custom Grouping

trigger_by = "{host,env,service}"

Notes

Monitors include no-data alerting by default
Timeboard is read-only to prevent accidental modifications
Uses 5-minute evaluation windows
Supports HTTP client retries for reliability
Can be reused across multiple environments via variable configuration

License

Internal use only - Sanoma/WeBuildYourCloud

Authors

Created and maintained by the Platform Engineering team.