Patrick de Ruiter 553f03285a
Some checks failed
Monitors / pre_job (push) Successful in 14s
Monitors / check (push) Failing after 1m30s
Initial commit with README and module files
2025-11-01 10:43:48 +01:00

8.5 KiB
Executable File

Terraform Datadog Old Monitors Module

Overview

This is a comprehensive, enterprise-ready monitoring module repository (based on Claranet's Datadog monitors repository) with pre-configured monitors for various infrastructure components including middleware, databases, cloud services, and container platforms.

Features

  • Enterprise Monitoring Templates: Production-ready monitor configurations
  • Multi-Platform Support: AWS, Azure, GCP cloud providers
  • Component Coverage: Middleware, databases, containers, networking
  • Flexible Configuration: Extensive customization options
  • Best Practices: Based on industry standards and real-world deployments

Structure

This module contains multiple sub-modules organized by component type:

terraform-datadog-old-monitors/
├── middleware/          # Nginx, Kong, Apache, PHP-FPM
├── database/           # PostgreSQL, MySQL, Redis, MongoDB, etc.
├── system/             # Generic system and unreachable monitors
├── network/            # HTTP, DNS, TLS monitoring
├── cloud/              # AWS, Azure, GCP specific monitors
│   ├── aws/           # ECS, RDS, Lambda, ALB, etc.
│   ├── azure/         # App Services, Functions, SQL, etc.
│   └── gcp/           # Compute, Cloud SQL, Pub/Sub, etc.
├── caas/              # Docker, Kubernetes monitoring
└── common/            # Shared alerting and filtering modules

Requirements

Name Version
terraform >= 0.12
datadog >= 2.0

Usage

Basic Monitor Configuration

module "nginx_monitor" {
  source = "./terraform-datadog-old-monitors/middleware/nginx"

  environment         = "production"
  message            = "Nginx issue detected @slack-channel"
  evaluation_delay   = 15
  new_host_delay    = 300
  
  # Enable/disable specific monitors
  nginx_connect_enabled = "true"
  nginx_dropped_enabled = "true"
  
  # Customize thresholds
  nginx_dropped_connections_critical = 5
  nginx_dropped_connections_warning  = 3
}

AWS RDS Monitoring

module "rds_monitor" {
  source = "./terraform-datadog-old-monitors/cloud/aws/rds/common"

  environment = "production"
  message     = "RDS alert @pagerduty"
  
  # CPU monitoring
  cpu_enabled          = "true"
  cpu_critical         = 90
  cpu_warning          = 75
  
  # Disk monitoring
  disk_space_enabled   = "true"
  disk_space_critical  = 90
  disk_space_warning   = 80
}

Kubernetes Monitoring

module "k8s_pod_monitor" {
  source = "./terraform-datadog-old-monitors/caas/kubernetes/pod"

  environment = "production"
  message     = "Kubernetes pod issue @slack-ops"
  
  pod_crash_enabled        = "true"
  pod_not_running_enabled  = "true"
  container_restart_enabled = "true"
}

Common Variables

Most sub-modules share these common variables:

Name Description Type Default
environment Architecture environment string Required
message Alert message with notification channels string Required
evaluation_delay Metric evaluation delay (seconds) number 15
new_host_delay Delay before monitoring new resources number 300
prefix_slug Prefix for monitor names string ""
notify_no_data Alert on no data bool true
filter_tags_use_defaults Use default filter convention bool true
filter_tags_custom Custom filter tags string ""

Available Monitor Types

Middleware Monitors

  • Nginx: Connection, dropped connections, workers
  • Apache: Server status, connections
  • Kong: API gateway health and performance
  • PHP-FPM: Pool status, slow requests

Database Monitors

  • PostgreSQL: Connections, replication lag, locks
  • MySQL: Connections, slow queries, replication
  • Redis: Memory, connections, evictions
  • MongoDB: Connections, replication lag, operations
  • Elasticsearch: Cluster health, JVM heap
  • SQL Server: Connections, locks, performance

Cloud Services

AWS

  • RDS (Aurora PostgreSQL, Aurora MySQL, common)
  • EC2 / ECS (Fargate, EC2 cluster)
  • Lambda
  • ALB / ELB / NLB
  • ElastiCache (Redis, Memcached)
  • SQS
  • API Gateway
  • Elasticsearch

Azure

  • App Services
  • Functions
  • SQL Database / Elastic Pool
  • PostgreSQL
  • Storage
  • Key Vault
  • Event Hub
  • Service Bus

GCP

  • Compute Engine
  • Cloud SQL (MySQL, common)
  • Pub/Sub (topics, subscriptions)
  • Load Balancer
  • Memorystore Redis

Container Platforms

  • Docker: Container status, resource usage
  • Kubernetes:
    • Pod monitors (crash, restart, not running)
    • Node monitors (resource usage, status)
    • Cluster monitors (API server, scheduler)
    • Workload monitors (deployments, statefulsets)
    • Velero/Ark backup monitors

Network Monitors

  • HTTP: Webcheck, SSL certificate expiry
  • DNS: Query response time, availability
  • TLS: Certificate expiration

Monitor Configuration Pattern

Each monitor module follows this pattern:

module "service_monitor" {
  source = "./path/to/monitor"

  # Environment and messaging
  environment = var.environment
  message     = var.alert_message
  
  # Timing configuration
  evaluation_delay = 15
  new_host_delay   = 300
  
  # Enable/disable monitors
  monitor_name_enabled = "true"
  
  # Thresholds
  monitor_name_critical = 90
  monitor_name_warning  = 75
  
  # Filtering
  filter_tags_custom = "env:production,team:platform"
}

Alerting Integration

The common/alerting-message module provides templates for:

  • PagerDuty integration
  • Slack notifications
  • Email alerts
  • Webhook notifications

Example:

module "alerting" {
  source = "./terraform-datadog-old-monitors/common/alerting-message"
  
  message_alert   = "@pagerduty-critical"
  message_warning = "@slack-warnings"
  message_nodata  = "@slack-monitoring"
}

Filter Tags

The common/filter-tags module helps with tag-based filtering:

module "filter_tags" {
  source = "./terraform-datadog-old-monitors/common/filter-tags"
  
  environment           = "production"
  filter_tags_use_defaults = true
  filter_tags_custom    = "service:api,tier:backend"
}

Best Practices

  1. Start with defaults: Use default thresholds first, then customize
  2. Gradual rollout: Enable monitors incrementally
  3. Tag strategy: Use consistent tagging across infrastructure
  4. Alert fatigue: Tune thresholds to reduce false positives
  5. Documentation: Document custom threshold decisions
  6. Testing: Test monitors in non-production first

Customization Examples

Custom Thresholds

# More aggressive CPU monitoring
cpu_critical = 85
cpu_warning  = 70

# Relaxed disk space monitoring
disk_space_critical = 95
disk_space_warning  = 90

Conditional Monitoring

# Only monitor specific services
filter_tags_custom = "service:critical-app"

# Skip new hosts for longer period
new_host_delay = 600  # 10 minutes

Custom Alert Messages

message = <<-EOT
  {{#is_alert}}
  CRITICAL: {{check}} on {{host.name}}
  @pagerduty-critical
  {{/is_alert}}
  
  {{#is_warning}}
  WARNING: {{check}} on {{host.name}}
  @slack-warnings
  {{/is_warning}}
EOT

Maintenance

This module appears to be a legacy/archived version (hence "old-monitors" name). Consider:

  • Reviewing for updates from Claranet repository
  • Migrating to newer monitoring solutions if available
  • Documenting which monitors are actively used
  • Deprecating unused monitor configurations

Outputs

Each sub-module may export:

  • Monitor IDs
  • Monitor names
  • Alert status

Check individual module outputs.tf files for specifics.

Notes

  • This is a comprehensive library of monitor templates
  • Based on Claranet's open-source Datadog monitors
  • Covers most common infrastructure components
  • Highly customizable with sensible defaults
  • May contain more monitors than needed for your use case
  • Review and enable only required monitors to avoid alert fatigue

Migration Path

If migrating from this module:

  1. Audit currently active monitors
  2. Document custom thresholds
  3. Test new monitoring solutions in parallel
  4. Gradually migrate monitor by monitor
  5. Keep this module for reference

Resources

License

Based on Claranet's open-source work. Internal use: Sanoma/WeBuildYourCloud

Authors

  • Original: Claranet team
  • Maintained by: Platform Engineering team