Terraform Datadog Monitors Module

Overview

This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog.

Features

  • CPU Monitoring: Track EC2 instance CPU utilization
  • Disk Monitoring: Monitor disk usage across hosts
  • Automated Alerting: No-data notifications included
  • Visualization: Read-only timeboard with alert thresholds
  • Configurable Thresholds: Customizable warning and critical levels

Resources Created

  • datadog_monitor (disk_usage): Metric alert for disk usage
  • datadog_monitor (cpu_usage): Query alert for CPU usage
  • datadog_timeboard (host_metrics): Read-only visualization dashboard

Requirements

Name Version
terraform >= 0.12
datadog >= 3.2.0

Usage

module "datadog_monitors" {
  source = "./terraform-datadog-monitors"

  datadog_api_key = var.datadog_api_key
  datadog_app_key = var.datadog_app_key
  api_url         = "https://api.datadoghq.eu"
  
  disk_usage = {
    query     = "max:system.disk.in_use"
    threshold = "85"
  }
  
  cpu_usage = {
    query     = "avg:aws.ec2.cpuutilization"
    threshold = "85"
  }
}

Inputs

Name Description Type Required Default
datadog_api_key Datadog API key string yes -
datadog_app_key Datadog APP key string yes -
api_url API endpoint string no "https://api.datadoghq.eu"
http_client_retry_enabled Enable request retries (429, 5xx) bool no true
http_client_retry_timeout HTTP retry timeout string no ""
validate Validate API/APP keys on init bool no true
disk_usage Query and threshold for disk monitor map no See default
cpu_usage Query and threshold for CPU monitor map no See default
datadog_alert_footer Alert message footer string no PagerDuty + Slack template
trigger_by Grouping for alerts string no "{host,env}"

Monitor Configuration

Disk Usage Monitor

  • Query: avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85
  • Type: Metric alert
  • Threshold: 85% (configurable)
  • Evaluation: Last 5 minutes average
  • Grouping: By host and env
  • No Data: Notifies after 10 minutes

CPU Usage Monitor

  • Query: avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85
  • Type: Query alert
  • Threshold: 85% (configurable)
  • Evaluation: Last 5 minutes average
  • Grouping: By host and env
  • No Data: Notifies after 10 minutes

Timeboard

The module creates a read-only timeboard with:

  • CPU usage graph with alert threshold marker
  • Disk usage graph with alert threshold marker
  • Alert overlay showing when thresholds are breached

Alert Message Template

Default alert footer includes integration with:

  • PagerDuty: @pagerduty-service_name
  • Slack: @slack-channel_name

Customize via the datadog_alert_footer variable.

Outputs

Currently, this module does not export any outputs.

Customization

Custom Thresholds

disk_usage = {
  query     = "max:system.disk.in_use"
  threshold = "90"  # Raise to 90%
}

cpu_usage = {
  query     = "avg:aws.ec2.cpuutilization"
  threshold = "75"  # Lower to 75%
}

Custom Grouping

trigger_by = "{host,env,service}"

Notes

  • Monitors include no-data alerting by default
  • Timeboard is read-only to prevent accidental modifications
  • Uses 5-minute evaluation windows
  • Supports HTTP client retries for reliability
  • Can be reused across multiple environments via variable configuration

License

Internal use only - Sanoma/WeBuildYourCloud

Authors

Created and maintained by the Platform Engineering team.

Description
Terraform module for creating basic host metrics monitors (CPU and disk usage) with visualization timeboards in Datadog
Readme 28 KiB
Languages
HCL 100%