3.8 KiB

Terraform Datadog Monitors Module

Overview

This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog.

Features

  • CPU Monitoring: Track EC2 instance CPU utilization
  • Disk Monitoring: Monitor disk usage across hosts
  • Automated Alerting: No-data notifications included
  • Visualization: Read-only timeboard with alert thresholds
  • Configurable Thresholds: Customizable warning and critical levels

Resources Created

  • datadog_monitor (disk_usage): Metric alert for disk usage
  • datadog_monitor (cpu_usage): Query alert for CPU usage
  • datadog_timeboard (host_metrics): Read-only visualization dashboard

Requirements

Name Version
terraform >= 0.12
datadog >= 3.2.0

Usage

module "datadog_monitors" {
  source = "./terraform-datadog-monitors"

  datadog_api_key = var.datadog_api_key
  datadog_app_key = var.datadog_app_key
  api_url         = "https://api.datadoghq.eu"
  
  disk_usage = {
    query     = "max:system.disk.in_use"
    threshold = "85"
  }
  
  cpu_usage = {
    query     = "avg:aws.ec2.cpuutilization"
    threshold = "85"
  }
}

Inputs

Name Description Type Required Default
datadog_api_key Datadog API key string yes -
datadog_app_key Datadog APP key string yes -
api_url API endpoint string no "https://api.datadoghq.eu"
http_client_retry_enabled Enable request retries (429, 5xx) bool no true
http_client_retry_timeout HTTP retry timeout string no ""
validate Validate API/APP keys on init bool no true
disk_usage Query and threshold for disk monitor map no See default
cpu_usage Query and threshold for CPU monitor map no See default
datadog_alert_footer Alert message footer string no PagerDuty + Slack template
trigger_by Grouping for alerts string no "{host,env}"

Monitor Configuration

Disk Usage Monitor

  • Query: avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85
  • Type: Metric alert
  • Threshold: 85% (configurable)
  • Evaluation: Last 5 minutes average
  • Grouping: By host and env
  • No Data: Notifies after 10 minutes

CPU Usage Monitor

  • Query: avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85
  • Type: Query alert
  • Threshold: 85% (configurable)
  • Evaluation: Last 5 minutes average
  • Grouping: By host and env
  • No Data: Notifies after 10 minutes

Timeboard

The module creates a read-only timeboard with:

  • CPU usage graph with alert threshold marker
  • Disk usage graph with alert threshold marker
  • Alert overlay showing when thresholds are breached

Alert Message Template

Default alert footer includes integration with:

  • PagerDuty: @pagerduty-service_name
  • Slack: @slack-channel_name

Customize via the datadog_alert_footer variable.

Outputs

Currently, this module does not export any outputs.

Customization

Custom Thresholds

disk_usage = {
  query     = "max:system.disk.in_use"
  threshold = "90"  # Raise to 90%
}

cpu_usage = {
  query     = "avg:aws.ec2.cpuutilization"
  threshold = "75"  # Lower to 75%
}

Custom Grouping

trigger_by = "{host,env,service}"

Notes

  • Monitors include no-data alerting by default
  • Timeboard is read-only to prevent accidental modifications
  • Uses 5-minute evaluation windows
  • Supports HTTP client retries for reliability
  • Can be reused across multiple environments via variable configuration

License

Internal use only - Sanoma/WeBuildYourCloud

Authors

Created and maintained by the Platform Engineering team.