Initial commit with README and module files
This commit is contained in:
parent
94ba287166
commit
372fa8fabc
0
.gitignore
vendored
Normal file → Executable file
0
.gitignore
vendored
Normal file → Executable file
0
.terraform.lock.hcl
generated
Normal file → Executable file
0
.terraform.lock.hcl
generated
Normal file → Executable file
140
README.md
Normal file
140
README.md
Normal file
@ -0,0 +1,140 @@
|
|||||||
|
# Terraform Datadog Monitors Module
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This Terraform module creates basic host metrics monitors for CPU and disk usage with accompanying visualization timeboard in Datadog.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **CPU Monitoring**: Track EC2 instance CPU utilization
|
||||||
|
- **Disk Monitoring**: Monitor disk usage across hosts
|
||||||
|
- **Automated Alerting**: No-data notifications included
|
||||||
|
- **Visualization**: Read-only timeboard with alert thresholds
|
||||||
|
- **Configurable Thresholds**: Customizable warning and critical levels
|
||||||
|
|
||||||
|
## Resources Created
|
||||||
|
|
||||||
|
- `datadog_monitor` (disk_usage): Metric alert for disk usage
|
||||||
|
- `datadog_monitor` (cpu_usage): Query alert for CPU usage
|
||||||
|
- `datadog_timeboard` (host_metrics): Read-only visualization dashboard
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
| Name | Version |
|
||||||
|
|------|---------|
|
||||||
|
| terraform | >= 0.12 |
|
||||||
|
| datadog | >= 3.2.0 |
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
module "datadog_monitors" {
|
||||||
|
source = "./terraform-datadog-monitors"
|
||||||
|
|
||||||
|
datadog_api_key = var.datadog_api_key
|
||||||
|
datadog_app_key = var.datadog_app_key
|
||||||
|
api_url = "https://api.datadoghq.eu"
|
||||||
|
|
||||||
|
disk_usage = {
|
||||||
|
query = "max:system.disk.in_use"
|
||||||
|
threshold = "85"
|
||||||
|
}
|
||||||
|
|
||||||
|
cpu_usage = {
|
||||||
|
query = "avg:aws.ec2.cpuutilization"
|
||||||
|
threshold = "85"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Inputs
|
||||||
|
|
||||||
|
| Name | Description | Type | Required | Default |
|
||||||
|
|------|-------------|------|----------|---------|
|
||||||
|
| `datadog_api_key` | Datadog API key | `string` | yes | - |
|
||||||
|
| `datadog_app_key` | Datadog APP key | `string` | yes | - |
|
||||||
|
| `api_url` | API endpoint | `string` | no | `"https://api.datadoghq.eu"` |
|
||||||
|
| `http_client_retry_enabled` | Enable request retries (429, 5xx) | `bool` | no | `true` |
|
||||||
|
| `http_client_retry_timeout` | HTTP retry timeout | `string` | no | `""` |
|
||||||
|
| `validate` | Validate API/APP keys on init | `bool` | no | `true` |
|
||||||
|
| `disk_usage` | Query and threshold for disk monitor | `map` | no | See default |
|
||||||
|
| `cpu_usage` | Query and threshold for CPU monitor | `map` | no | See default |
|
||||||
|
| `datadog_alert_footer` | Alert message footer | `string` | no | PagerDuty + Slack template |
|
||||||
|
| `trigger_by` | Grouping for alerts | `string` | no | `"{host,env}"` |
|
||||||
|
|
||||||
|
## Monitor Configuration
|
||||||
|
|
||||||
|
### Disk Usage Monitor
|
||||||
|
|
||||||
|
- **Query**: `avg(last_5m):max:system.disk.in_use{*} by {host,env} * 100 > 85`
|
||||||
|
- **Type**: Metric alert
|
||||||
|
- **Threshold**: 85% (configurable)
|
||||||
|
- **Evaluation**: Last 5 minutes average
|
||||||
|
- **Grouping**: By host and env
|
||||||
|
- **No Data**: Notifies after 10 minutes
|
||||||
|
|
||||||
|
### CPU Usage Monitor
|
||||||
|
|
||||||
|
- **Query**: `avg(last_5m):avg:aws.ec2.cpuutilization{*} by {host,env} > 85`
|
||||||
|
- **Type**: Query alert
|
||||||
|
- **Threshold**: 85% (configurable)
|
||||||
|
- **Evaluation**: Last 5 minutes average
|
||||||
|
- **Grouping**: By host and env
|
||||||
|
- **No Data**: Notifies after 10 minutes
|
||||||
|
|
||||||
|
## Timeboard
|
||||||
|
|
||||||
|
The module creates a read-only timeboard with:
|
||||||
|
- CPU usage graph with alert threshold marker
|
||||||
|
- Disk usage graph with alert threshold marker
|
||||||
|
- Alert overlay showing when thresholds are breached
|
||||||
|
|
||||||
|
## Alert Message Template
|
||||||
|
|
||||||
|
Default alert footer includes integration with:
|
||||||
|
- PagerDuty: `@pagerduty-service_name`
|
||||||
|
- Slack: `@slack-channel_name`
|
||||||
|
|
||||||
|
Customize via the `datadog_alert_footer` variable.
|
||||||
|
|
||||||
|
## Outputs
|
||||||
|
|
||||||
|
Currently, this module does not export any outputs.
|
||||||
|
|
||||||
|
## Customization
|
||||||
|
|
||||||
|
### Custom Thresholds
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
disk_usage = {
|
||||||
|
query = "max:system.disk.in_use"
|
||||||
|
threshold = "90" # Raise to 90%
|
||||||
|
}
|
||||||
|
|
||||||
|
cpu_usage = {
|
||||||
|
query = "avg:aws.ec2.cpuutilization"
|
||||||
|
threshold = "75" # Lower to 75%
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Custom Grouping
|
||||||
|
|
||||||
|
```hcl
|
||||||
|
trigger_by = "{host,env,service}"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Monitors include no-data alerting by default
|
||||||
|
- Timeboard is read-only to prevent accidental modifications
|
||||||
|
- Uses 5-minute evaluation windows
|
||||||
|
- Supports HTTP client retries for reliability
|
||||||
|
- Can be reused across multiple environments via variable configuration
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
Internal use only - Sanoma/WeBuildYourCloud
|
||||||
|
|
||||||
|
## Authors
|
||||||
|
|
||||||
|
Created and maintained by the Platform Engineering team.
|
||||||
0
provider.tf
Normal file → Executable file
0
provider.tf
Normal file → Executable file
0
variables.tf
Normal file → Executable file
0
variables.tf
Normal file → Executable file
Loading…
x
Reference in New Issue
Block a user