336 lines
8.5 KiB
Markdown
Executable File
336 lines
8.5 KiB
Markdown
Executable File
# Terraform Datadog Old Monitors Module
|
|
|
|
## Overview
|
|
|
|
This is a comprehensive, enterprise-ready monitoring module repository (based on Claranet's Datadog monitors repository) with pre-configured monitors for various infrastructure components including middleware, databases, cloud services, and container platforms.
|
|
|
|
## Features
|
|
|
|
- **Enterprise Monitoring Templates**: Production-ready monitor configurations
|
|
- **Multi-Platform Support**: AWS, Azure, GCP cloud providers
|
|
- **Component Coverage**: Middleware, databases, containers, networking
|
|
- **Flexible Configuration**: Extensive customization options
|
|
- **Best Practices**: Based on industry standards and real-world deployments
|
|
|
|
## Structure
|
|
|
|
This module contains multiple sub-modules organized by component type:
|
|
|
|
```
|
|
terraform-datadog-old-monitors/
|
|
├── middleware/ # Nginx, Kong, Apache, PHP-FPM
|
|
├── database/ # PostgreSQL, MySQL, Redis, MongoDB, etc.
|
|
├── system/ # Generic system and unreachable monitors
|
|
├── network/ # HTTP, DNS, TLS monitoring
|
|
├── cloud/ # AWS, Azure, GCP specific monitors
|
|
│ ├── aws/ # ECS, RDS, Lambda, ALB, etc.
|
|
│ ├── azure/ # App Services, Functions, SQL, etc.
|
|
│ └── gcp/ # Compute, Cloud SQL, Pub/Sub, etc.
|
|
├── caas/ # Docker, Kubernetes monitoring
|
|
└── common/ # Shared alerting and filtering modules
|
|
```
|
|
|
|
## Requirements
|
|
|
|
| Name | Version |
|
|
|------|---------|
|
|
| terraform | >= 0.12 |
|
|
| datadog | >= 2.0 |
|
|
|
|
## Usage
|
|
|
|
### Basic Monitor Configuration
|
|
|
|
```hcl
|
|
module "nginx_monitor" {
|
|
source = "./terraform-datadog-old-monitors/middleware/nginx"
|
|
|
|
environment = "production"
|
|
message = "Nginx issue detected @slack-channel"
|
|
evaluation_delay = 15
|
|
new_host_delay = 300
|
|
|
|
# Enable/disable specific monitors
|
|
nginx_connect_enabled = "true"
|
|
nginx_dropped_enabled = "true"
|
|
|
|
# Customize thresholds
|
|
nginx_dropped_connections_critical = 5
|
|
nginx_dropped_connections_warning = 3
|
|
}
|
|
```
|
|
|
|
### AWS RDS Monitoring
|
|
|
|
```hcl
|
|
module "rds_monitor" {
|
|
source = "./terraform-datadog-old-monitors/cloud/aws/rds/common"
|
|
|
|
environment = "production"
|
|
message = "RDS alert @pagerduty"
|
|
|
|
# CPU monitoring
|
|
cpu_enabled = "true"
|
|
cpu_critical = 90
|
|
cpu_warning = 75
|
|
|
|
# Disk monitoring
|
|
disk_space_enabled = "true"
|
|
disk_space_critical = 90
|
|
disk_space_warning = 80
|
|
}
|
|
```
|
|
|
|
### Kubernetes Monitoring
|
|
|
|
```hcl
|
|
module "k8s_pod_monitor" {
|
|
source = "./terraform-datadog-old-monitors/caas/kubernetes/pod"
|
|
|
|
environment = "production"
|
|
message = "Kubernetes pod issue @slack-ops"
|
|
|
|
pod_crash_enabled = "true"
|
|
pod_not_running_enabled = "true"
|
|
container_restart_enabled = "true"
|
|
}
|
|
```
|
|
|
|
## Common Variables
|
|
|
|
Most sub-modules share these common variables:
|
|
|
|
| Name | Description | Type | Default |
|
|
|------|-------------|------|---------|
|
|
| `environment` | Architecture environment | `string` | Required |
|
|
| `message` | Alert message with notification channels | `string` | Required |
|
|
| `evaluation_delay` | Metric evaluation delay (seconds) | `number` | `15` |
|
|
| `new_host_delay` | Delay before monitoring new resources | `number` | `300` |
|
|
| `prefix_slug` | Prefix for monitor names | `string` | `""` |
|
|
| `notify_no_data` | Alert on no data | `bool` | `true` |
|
|
| `filter_tags_use_defaults` | Use default filter convention | `bool` | `true` |
|
|
| `filter_tags_custom` | Custom filter tags | `string` | `""` |
|
|
|
|
## Available Monitor Types
|
|
|
|
### Middleware Monitors
|
|
|
|
- **Nginx**: Connection, dropped connections, workers
|
|
- **Apache**: Server status, connections
|
|
- **Kong**: API gateway health and performance
|
|
- **PHP-FPM**: Pool status, slow requests
|
|
|
|
### Database Monitors
|
|
|
|
- **PostgreSQL**: Connections, replication lag, locks
|
|
- **MySQL**: Connections, slow queries, replication
|
|
- **Redis**: Memory, connections, evictions
|
|
- **MongoDB**: Connections, replication lag, operations
|
|
- **Elasticsearch**: Cluster health, JVM heap
|
|
- **SQL Server**: Connections, locks, performance
|
|
|
|
### Cloud Services
|
|
|
|
#### AWS
|
|
- RDS (Aurora PostgreSQL, Aurora MySQL, common)
|
|
- EC2 / ECS (Fargate, EC2 cluster)
|
|
- Lambda
|
|
- ALB / ELB / NLB
|
|
- ElastiCache (Redis, Memcached)
|
|
- SQS
|
|
- API Gateway
|
|
- Elasticsearch
|
|
|
|
#### Azure
|
|
- App Services
|
|
- Functions
|
|
- SQL Database / Elastic Pool
|
|
- PostgreSQL
|
|
- Storage
|
|
- Key Vault
|
|
- Event Hub
|
|
- Service Bus
|
|
|
|
#### GCP
|
|
- Compute Engine
|
|
- Cloud SQL (MySQL, common)
|
|
- Pub/Sub (topics, subscriptions)
|
|
- Load Balancer
|
|
- Memorystore Redis
|
|
|
|
### Container Platforms
|
|
|
|
- **Docker**: Container status, resource usage
|
|
- **Kubernetes**:
|
|
- Pod monitors (crash, restart, not running)
|
|
- Node monitors (resource usage, status)
|
|
- Cluster monitors (API server, scheduler)
|
|
- Workload monitors (deployments, statefulsets)
|
|
- Velero/Ark backup monitors
|
|
|
|
### Network Monitors
|
|
|
|
- **HTTP**: Webcheck, SSL certificate expiry
|
|
- **DNS**: Query response time, availability
|
|
- **TLS**: Certificate expiration
|
|
|
|
## Monitor Configuration Pattern
|
|
|
|
Each monitor module follows this pattern:
|
|
|
|
```hcl
|
|
module "service_monitor" {
|
|
source = "./path/to/monitor"
|
|
|
|
# Environment and messaging
|
|
environment = var.environment
|
|
message = var.alert_message
|
|
|
|
# Timing configuration
|
|
evaluation_delay = 15
|
|
new_host_delay = 300
|
|
|
|
# Enable/disable monitors
|
|
monitor_name_enabled = "true"
|
|
|
|
# Thresholds
|
|
monitor_name_critical = 90
|
|
monitor_name_warning = 75
|
|
|
|
# Filtering
|
|
filter_tags_custom = "env:production,team:platform"
|
|
}
|
|
```
|
|
|
|
## Alerting Integration
|
|
|
|
The `common/alerting-message` module provides templates for:
|
|
- PagerDuty integration
|
|
- Slack notifications
|
|
- Email alerts
|
|
- Webhook notifications
|
|
|
|
Example:
|
|
```hcl
|
|
module "alerting" {
|
|
source = "./terraform-datadog-old-monitors/common/alerting-message"
|
|
|
|
message_alert = "@pagerduty-critical"
|
|
message_warning = "@slack-warnings"
|
|
message_nodata = "@slack-monitoring"
|
|
}
|
|
```
|
|
|
|
## Filter Tags
|
|
|
|
The `common/filter-tags` module helps with tag-based filtering:
|
|
|
|
```hcl
|
|
module "filter_tags" {
|
|
source = "./terraform-datadog-old-monitors/common/filter-tags"
|
|
|
|
environment = "production"
|
|
filter_tags_use_defaults = true
|
|
filter_tags_custom = "service:api,tier:backend"
|
|
}
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Start with defaults**: Use default thresholds first, then customize
|
|
2. **Gradual rollout**: Enable monitors incrementally
|
|
3. **Tag strategy**: Use consistent tagging across infrastructure
|
|
4. **Alert fatigue**: Tune thresholds to reduce false positives
|
|
5. **Documentation**: Document custom threshold decisions
|
|
6. **Testing**: Test monitors in non-production first
|
|
|
|
## Customization Examples
|
|
|
|
### Custom Thresholds
|
|
|
|
```hcl
|
|
# More aggressive CPU monitoring
|
|
cpu_critical = 85
|
|
cpu_warning = 70
|
|
|
|
# Relaxed disk space monitoring
|
|
disk_space_critical = 95
|
|
disk_space_warning = 90
|
|
```
|
|
|
|
### Conditional Monitoring
|
|
|
|
```hcl
|
|
# Only monitor specific services
|
|
filter_tags_custom = "service:critical-app"
|
|
|
|
# Skip new hosts for longer period
|
|
new_host_delay = 600 # 10 minutes
|
|
```
|
|
|
|
### Custom Alert Messages
|
|
|
|
```hcl
|
|
message = <<-EOT
|
|
{{#is_alert}}
|
|
CRITICAL: {{check}} on {{host.name}}
|
|
@pagerduty-critical
|
|
{{/is_alert}}
|
|
|
|
{{#is_warning}}
|
|
WARNING: {{check}} on {{host.name}}
|
|
@slack-warnings
|
|
{{/is_warning}}
|
|
EOT
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
This module appears to be a legacy/archived version (hence "old-monitors" name). Consider:
|
|
- Reviewing for updates from Claranet repository
|
|
- Migrating to newer monitoring solutions if available
|
|
- Documenting which monitors are actively used
|
|
- Deprecating unused monitor configurations
|
|
|
|
## Outputs
|
|
|
|
Each sub-module may export:
|
|
- Monitor IDs
|
|
- Monitor names
|
|
- Alert status
|
|
|
|
Check individual module outputs.tf files for specifics.
|
|
|
|
## Notes
|
|
|
|
- This is a comprehensive library of monitor templates
|
|
- Based on Claranet's open-source Datadog monitors
|
|
- Covers most common infrastructure components
|
|
- Highly customizable with sensible defaults
|
|
- May contain more monitors than needed for your use case
|
|
- Review and enable only required monitors to avoid alert fatigue
|
|
|
|
## Migration Path
|
|
|
|
If migrating from this module:
|
|
1. Audit currently active monitors
|
|
2. Document custom thresholds
|
|
3. Test new monitoring solutions in parallel
|
|
4. Gradually migrate monitor by monitor
|
|
5. Keep this module for reference
|
|
|
|
## Resources
|
|
|
|
- Original Claranet repository: [terraform-datadog-monitors](https://github.com/claranet/terraform-datadog-monitors)
|
|
- Datadog monitor documentation: [Datadog Monitors](https://docs.datadoghq.com/monitors/)
|
|
|
|
## License
|
|
|
|
Based on Claranet's open-source work.
|
|
Internal use: Sanoma/WeBuildYourCloud
|
|
|
|
## Authors
|
|
|
|
- Original: Claranet team
|
|
- Maintained by: Platform Engineering team
|