Initial commit with README and module files
This commit is contained in:
parent
94ba287166
commit
372fa8fabc
0
.gitignore
vendored
Normal file → Executable file
0
.gitignore
vendored
Normal file → Executable file
0
.terraform.lock.hcl
generated
Normal file → Executable file
0
.terraform.lock.hcl
generated
Normal file → Executable file
140
README.md
Normal file
140
README.md
Normal file
@ -0,0 +1,140 @@
|
||||
# Terraform Datadog Monitors Module
|
||||
|
||||
## Overview
|
||||
|
||||
This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog.
|
||||
|
||||
## Features
|
||||
|
||||
- **CPU Monitoring**: Track EC2 instance CPU utilization
|
||||
- **Disk Monitoring**: Monitor disk usage across hosts
|
||||
- **Automated Alerting**: No-data notifications included
|
||||
- **Visualization**: Read-only timeboard with alert thresholds
|
||||
- **Configurable Thresholds**: Customizable warning and critical levels
|
||||
|
||||
## Resources Created
|
||||
|
||||
- `datadog_monitor` (disk_usage): Metric alert for disk usage
|
||||
- `datadog_monitor` (cpu_usage): Query alert for CPU usage
|
||||
- `datadog_timeboard` (host_metrics): Read-only visualization dashboard
|
||||
|
||||
## Requirements
|
||||
|
||||
| Name | Version |
|
||||
|------|---------|
|
||||
| terraform | >= 0.12 |
|
||||
| datadog | >= 3.2.0 |
|
||||
|
||||
## Usage
|
||||
|
||||
```hcl
|
||||
module "datadog_monitors" {
|
||||
source = "./terraform-datadog-monitors"
|
||||
|
||||
datadog_api_key = var.datadog_api_key
|
||||
datadog_app_key = var.datadog_app_key
|
||||
api_url = "https://api.datadoghq.eu"
|
||||
|
||||
disk_usage = {
|
||||
query = "max:system.disk.in_use"
|
||||
threshold = "85"
|
||||
}
|
||||
|
||||
cpu_usage = {
|
||||
query = "avg:aws.ec2.cpuutilization"
|
||||
threshold = "85"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Inputs
|
||||
|
||||
| Name | Description | Type | Required | Default |
|
||||
|------|-------------|------|----------|---------|
|
||||
| `datadog_api_key` | Datadog API key | `string` | yes | - |
|
||||
| `datadog_app_key` | Datadog APP key | `string` | yes | - |
|
||||
| `api_url` | API endpoint | `string` | no | `"https://api.datadoghq.eu"` |
|
||||
| `http_client_retry_enabled` | Enable request retries (429, 5xx) | `bool` | no | `true` |
|
||||
| `http_client_retry_timeout` | HTTP retry timeout | `string` | no | `""` |
|
||||
| `validate` | Validate API/APP keys on init | `bool` | no | `true` |
|
||||
| `disk_usage` | Query and threshold for disk monitor | `map` | no | See default |
|
||||
| `cpu_usage` | Query and threshold for CPU monitor | `map` | no | See default |
|
||||
| `datadog_alert_footer` | Alert message footer | `string` | no | PagerDuty + Slack template |
|
||||
| `trigger_by` | Grouping for alerts | `string` | no | `"{host,env}"` |
|
||||
|
||||
## Monitor Configuration
|
||||
|
||||
### Disk Usage Monitor
|
||||
|
||||
- **Query**: `avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85`
|
||||
- **Type**: Metric alert
|
||||
- **Threshold**: 85% (configurable)
|
||||
- **Evaluation**: Last 5 minutes average
|
||||
- **Grouping**: By host and env
|
||||
- **No Data**: Notifies after 10 minutes
|
||||
|
||||
### CPU Usage Monitor
|
||||
|
||||
- **Query**: `avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85`
|
||||
- **Type**: Query alert
|
||||
- **Threshold**: 85% (configurable)
|
||||
- **Evaluation**: Last 5 minutes average
|
||||
- **Grouping**: By host and env
|
||||
- **No Data**: Notifies after 10 minutes
|
||||
|
||||
## Timeboard
|
||||
|
||||
The module creates a read-only timeboard with:
|
||||
- CPU usage graph with alert threshold marker
|
||||
- Disk usage graph with alert threshold marker
|
||||
- Alert overlay showing when thresholds are breached
|
||||
|
||||
## Alert Message Template
|
||||
|
||||
Default alert footer includes integration with:
|
||||
- PagerDuty: `@pagerduty-service_name`
|
||||
- Slack: `@slack-channel_name`
|
||||
|
||||
Customize via the `datadog_alert_footer` variable.
|
||||
|
||||
## Outputs
|
||||
|
||||
Currently, this module does not export any outputs.
|
||||
|
||||
## Customization
|
||||
|
||||
### Custom Thresholds
|
||||
|
||||
```hcl
|
||||
disk_usage = {
|
||||
query = "max:system.disk.in_use"
|
||||
threshold = "90" # Raise to 90%
|
||||
}
|
||||
|
||||
cpu_usage = {
|
||||
query = "avg:aws.ec2.cpuutilization"
|
||||
threshold = "75" # Lower to 75%
|
||||
}
|
||||
```
|
||||
|
||||
### Custom Grouping
|
||||
|
||||
```hcl
|
||||
trigger_by = "{host,env,service}"
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Monitors include no-data alerting by default
|
||||
- Timeboard is read-only to prevent accidental modifications
|
||||
- Uses 5-minute evaluation windows
|
||||
- Supports HTTP client retries for reliability
|
||||
- Can be reused across multiple environments via variable configuration
|
||||
|
||||
## License
|
||||
|
||||
Internal use only - Sanoma/WeBuildYourCloud
|
||||
|
||||
## Authors
|
||||
|
||||
Created and maintained by the Platform Engineering team.
|
||||
0
provider.tf
Normal file → Executable file
0
provider.tf
Normal file → Executable file
0
variables.tf
Normal file → Executable file
0
variables.tf
Normal file → Executable file
Loading…
x
Reference in New Issue
Block a user