Patrick de Ruiter 553f03285a
Some checks failed
Monitors / pre_job (push) Successful in 14s
Monitors / check (push) Failing after 1m30s
Initial commit with README and module files
2025-11-01 10:43:48 +01:00

336 lines
8.5 KiB
Markdown
Executable File

# Terraform Datadog Old Monitors Module
## Overview
This is a comprehensive, enterprise-ready monitoring module repository (based on Claranet's Datadog monitors repository) with pre-configured monitors for various infrastructure components including middleware, databases, cloud services, and container platforms.
## Features
- **Enterprise Monitoring Templates**: Production-ready monitor configurations
- **Multi-Platform Support**: AWS, Azure, GCP cloud providers
- **Component Coverage**: Middleware, databases, containers, networking
- **Flexible Configuration**: Extensive customization options
- **Best Practices**: Based on industry standards and real-world deployments
## Structure
This module contains multiple sub-modules organized by component type:
```
terraform-datadog-old-monitors/
├── middleware/ # Nginx, Kong, Apache, PHP-FPM
├── database/ # PostgreSQL, MySQL, Redis, MongoDB, etc.
├── system/ # Generic system and unreachable monitors
├── network/ # HTTP, DNS, TLS monitoring
├── cloud/ # AWS, Azure, GCP specific monitors
│ ├── aws/ # ECS, RDS, Lambda, ALB, etc.
│ ├── azure/ # App Services, Functions, SQL, etc.
│ └── gcp/ # Compute, Cloud SQL, Pub/Sub, etc.
├── caas/ # Docker, Kubernetes monitoring
└── common/ # Shared alerting and filtering modules
```
## Requirements
| Name | Version |
|------|---------|
| terraform | >= 0.12 |
| datadog | >= 2.0 |
## Usage
### Basic Monitor Configuration
```hcl
module "nginx_monitor" {
source = "./terraform-datadog-old-monitors/middleware/nginx"
environment = "production"
message = "Nginx issue detected @slack-channel"
evaluation_delay = 15
new_host_delay = 300
# Enable/disable specific monitors
nginx_connect_enabled = "true"
nginx_dropped_enabled = "true"
# Customize thresholds
nginx_dropped_connections_critical = 5
nginx_dropped_connections_warning = 3
}
```
### AWS RDS Monitoring
```hcl
module "rds_monitor" {
source = "./terraform-datadog-old-monitors/cloud/aws/rds/common"
environment = "production"
message = "RDS alert @pagerduty"
# CPU monitoring
cpu_enabled = "true"
cpu_critical = 90
cpu_warning = 75
# Disk monitoring
disk_space_enabled = "true"
disk_space_critical = 90
disk_space_warning = 80
}
```
### Kubernetes Monitoring
```hcl
module "k8s_pod_monitor" {
source = "./terraform-datadog-old-monitors/caas/kubernetes/pod"
environment = "production"
message = "Kubernetes pod issue @slack-ops"
pod_crash_enabled = "true"
pod_not_running_enabled = "true"
container_restart_enabled = "true"
}
```
## Common Variables
Most sub-modules share these common variables:
| Name | Description | Type | Default |
|------|-------------|------|---------|
| `environment` | Architecture environment | `string` | Required |
| `message` | Alert message with notification channels | `string` | Required |
| `evaluation_delay` | Metric evaluation delay (seconds) | `number` | `15` |
| `new_host_delay` | Delay before monitoring new resources | `number` | `300` |
| `prefix_slug` | Prefix for monitor names | `string` | `""` |
| `notify_no_data` | Alert on no data | `bool` | `true` |
| `filter_tags_use_defaults` | Use default filter convention | `bool` | `true` |
| `filter_tags_custom` | Custom filter tags | `string` | `""` |
## Available Monitor Types
### Middleware Monitors
- **Nginx**: Connection, dropped connections, workers
- **Apache**: Server status, connections
- **Kong**: API gateway health and performance
- **PHP-FPM**: Pool status, slow requests
### Database Monitors
- **PostgreSQL**: Connections, replication lag, locks
- **MySQL**: Connections, slow queries, replication
- **Redis**: Memory, connections, evictions
- **MongoDB**: Connections, replication lag, operations
- **Elasticsearch**: Cluster health, JVM heap
- **SQL Server**: Connections, locks, performance
### Cloud Services
#### AWS
- RDS (Aurora PostgreSQL, Aurora MySQL, common)
- EC2 / ECS (Fargate, EC2 cluster)
- Lambda
- ALB / ELB / NLB
- ElastiCache (Redis, Memcached)
- SQS
- API Gateway
- Elasticsearch
#### Azure
- App Services
- Functions
- SQL Database / Elastic Pool
- PostgreSQL
- Storage
- Key Vault
- Event Hub
- Service Bus
#### GCP
- Compute Engine
- Cloud SQL (MySQL, common)
- Pub/Sub (topics, subscriptions)
- Load Balancer
- Memorystore Redis
### Container Platforms
- **Docker**: Container status, resource usage
- **Kubernetes**:
- Pod monitors (crash, restart, not running)
- Node monitors (resource usage, status)
- Cluster monitors (API server, scheduler)
- Workload monitors (deployments, statefulsets)
- Velero/Ark backup monitors
### Network Monitors
- **HTTP**: Webcheck, SSL certificate expiry
- **DNS**: Query response time, availability
- **TLS**: Certificate expiration
## Monitor Configuration Pattern
Each monitor module follows this pattern:
```hcl
module "service_monitor" {
source = "./path/to/monitor"
# Environment and messaging
environment = var.environment
message = var.alert_message
# Timing configuration
evaluation_delay = 15
new_host_delay = 300
# Enable/disable monitors
monitor_name_enabled = "true"
# Thresholds
monitor_name_critical = 90
monitor_name_warning = 75
# Filtering
filter_tags_custom = "env:production,team:platform"
}
```
## Alerting Integration
The `common/alerting-message` module provides templates for:
- PagerDuty integration
- Slack notifications
- Email alerts
- Webhook notifications
Example:
```hcl
module "alerting" {
source = "./terraform-datadog-old-monitors/common/alerting-message"
message_alert = "@pagerduty-critical"
message_warning = "@slack-warnings"
message_nodata = "@slack-monitoring"
}
```
## Filter Tags
The `common/filter-tags` module helps with tag-based filtering:
```hcl
module "filter_tags" {
source = "./terraform-datadog-old-monitors/common/filter-tags"
environment = "production"
filter_tags_use_defaults = true
filter_tags_custom = "service:api,tier:backend"
}
```
## Best Practices
1. **Start with defaults**: Use default thresholds first, then customize
2. **Gradual rollout**: Enable monitors incrementally
3. **Tag strategy**: Use consistent tagging across infrastructure
4. **Alert fatigue**: Tune thresholds to reduce false positives
5. **Documentation**: Document custom threshold decisions
6. **Testing**: Test monitors in non-production first
## Customization Examples
### Custom Thresholds
```hcl
# More aggressive CPU monitoring
cpu_critical = 85
cpu_warning = 70
# Relaxed disk space monitoring
disk_space_critical = 95
disk_space_warning = 90
```
### Conditional Monitoring
```hcl
# Only monitor specific services
filter_tags_custom = "service:critical-app"
# Skip new hosts for longer period
new_host_delay = 600 # 10 minutes
```
### Custom Alert Messages
```hcl
message = <<-EOT
{{#is_alert}}
CRITICAL: {{check}} on {{host.name}}
@pagerduty-critical
{{/is_alert}}
{{#is_warning}}
WARNING: {{check}} on {{host.name}}
@slack-warnings
{{/is_warning}}
EOT
```
## Maintenance
This module appears to be a legacy/archived version (hence "old-monitors" name). Consider:
- Reviewing for updates from Claranet repository
- Migrating to newer monitoring solutions if available
- Documenting which monitors are actively used
- Deprecating unused monitor configurations
## Outputs
Each sub-module may export:
- Monitor IDs
- Monitor names
- Alert status
Check individual module outputs.tf files for specifics.
## Notes
- This is a comprehensive library of monitor templates
- Based on Claranet's open-source Datadog monitors
- Covers most common infrastructure components
- Highly customizable with sensible defaults
- May contain more monitors than needed for your use case
- Review and enable only required monitors to avoid alert fatigue
## Migration Path
If migrating from this module:
1. Audit currently active monitors
2. Document custom thresholds
3. Test new monitoring solutions in parallel
4. Gradually migrate monitor by monitor
5. Keep this module for reference
## Resources
- Original Claranet repository: [terraform-datadog-monitors](https://github.com/claranet/terraform-datadog-monitors)
- Datadog monitor documentation: [Datadog Monitors](https://docs.datadoghq.com/monitors/)
## License
Based on Claranet's open-source work.
Internal use: Sanoma/WeBuildYourCloud
## Authors
- Original: Claranet team
- Maintained by: Platform Engineering team