MON-366 Change static healthy host check to a ratio

This commit is contained in:
Laurent Piroelle 2019-11-05 12:05:16 +01:00
parent 0097fd15ac
commit 1de02e53d8
4 changed files with 52 additions and 33 deletions

View File

@ -17,9 +17,9 @@ module "datadog-monitors-cloud-azure-app-gateway" {
Creates DataDog monitors with the following checks: Creates DataDog monitors with the following checks:
- App Gateway backend connect time is too high - App Gateway backend connect time is too high
- App Gateway backend has no healthy host
- App Gateway backend HTTP 4xx errors rate is too high - App Gateway backend HTTP 4xx errors rate is too high
- App Gateway backend HTTP 5xx errors rate is too high - App Gateway backend HTTP 5xx errors rate is too high
- App Gateway backend unhealthy host ratio is too high
- App Gateway failed requests - App Gateway failed requests
- App Gateway has no connection - App Gateway has no connection
- App Gateway HTTP 4xx errors rate is too high - App Gateway HTTP 4xx errors rate is too high
@ -33,8 +33,8 @@ Creates DataDog monitors with the following checks:
| appgateway\_backend\_connect\_time\_enabled | Flag to enable App Gateway backend_connect_time monitor | string | `"true"` | no | | appgateway\_backend\_connect\_time\_enabled | Flag to enable App Gateway backend_connect_time monitor | string | `"true"` | no |
| appgateway\_backend\_connect\_time\_extra\_tags | Extra tags for App Gateway backend_connect_time monitor | list(string) | `[]` | no | | appgateway\_backend\_connect\_time\_extra\_tags | Extra tags for App Gateway backend_connect_time monitor | list(string) | `[]` | no |
| appgateway\_backend\_connect\_time\_message | Custom message for App Gateway backend_connect_time monitor | string | `""` | no | | appgateway\_backend\_connect\_time\_message | Custom message for App Gateway backend_connect_time monitor | string | `""` | no |
| appgateway\_backend\_connect\_time\_threshold\_critical | Maximum critical backend_connect_time errors in seconds | string | `"50"` | no | | appgateway\_backend\_connect\_time\_threshold\_critical | Maximum critical backend_connect_time errors in milliseconds | string | `"50"` | no |
| appgateway\_backend\_connect\_time\_threshold\_warning | Warning regarding backend_connect_time errors in seconds | string | `"40"` | no | | appgateway\_backend\_connect\_time\_threshold\_warning | Warning regarding backend_connect_time errors in milliseconds | string | `"40"` | no |
| appgateway\_backend\_connect\_time\_time\_aggregator | Monitor aggregator for App Gateway backend_connect_time [available values: min, max or avg] | string | `"max"` | no | | appgateway\_backend\_connect\_time\_time\_aggregator | Monitor aggregator for App Gateway backend_connect_time [available values: min, max or avg] | string | `"max"` | no |
| appgateway\_backend\_connect\_time\_timeframe | Monitor timeframe for App Gateway backend_connect_time [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | string | `"last_5m"` | no | | appgateway\_backend\_connect\_time\_timeframe | Monitor timeframe for App Gateway backend_connect_time [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | string | `"last_5m"` | no |
| appgateway\_backend\_http\_4xx\_errors\_enabled | Flag to enable App Gateway http 4xx errors monitor | string | `"true"` | no | | appgateway\_backend\_http\_4xx\_errors\_enabled | Flag to enable App Gateway http 4xx errors monitor | string | `"true"` | no |
@ -58,11 +58,6 @@ Creates DataDog monitors with the following checks:
| appgateway\_failed\_requests\_threshold\_warning | Warning regarding acceptable percent of failed errors | string | `"80"` | no | | appgateway\_failed\_requests\_threshold\_warning | Warning regarding acceptable percent of failed errors | string | `"80"` | no |
| appgateway\_failed\_requests\_time\_aggregator | Monitor aggregator for App Gateway failed requests [available values: min, max or avg] | string | `"min"` | no | | appgateway\_failed\_requests\_time\_aggregator | Monitor aggregator for App Gateway failed requests [available values: min, max or avg] | string | `"min"` | no |
| appgateway\_failed\_requests\_timeframe | Monitor timeframe for App Gateway failed requests [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | string | `"last_5m"` | no | | appgateway\_failed\_requests\_timeframe | Monitor timeframe for App Gateway failed requests [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | string | `"last_5m"` | no |
| appgateway\_healthy\_host\_count\_enabled | Flag to enable App Gateway healthy host monitor | string | `"true"` | no |
| appgateway\_healthy\_host\_count\_extra\_tags | Extra tags for App Gateway healthy host monitor | list(string) | `[]` | no |
| appgateway\_healthy\_host\_count\_message | Custom message for App Gateway healthy host monitor | string | `""` | no |
| appgateway\_healthy\_host\_count\_time\_aggregator | Monitor aggregator for App Gateway healthy host [available values: min, max or avg] | string | `"max"` | no |
| appgateway\_healthy\_host\_count\_timeframe | Monitor timeframe for App Gateway healthy host [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | string | `"last_5m"` | no |
| appgateway\_http\_4xx\_errors\_enabled | Flag to enable App Gateway http 4xx errors monitor | string | `"true"` | no | | appgateway\_http\_4xx\_errors\_enabled | Flag to enable App Gateway http 4xx errors monitor | string | `"true"` | no |
| appgateway\_http\_4xx\_errors\_extra\_tags | Extra tags for App Gateway http 4xx errors monitor | list(string) | `[]` | no | | appgateway\_http\_4xx\_errors\_extra\_tags | Extra tags for App Gateway http 4xx errors monitor | list(string) | `[]` | no |
| appgateway\_http\_4xx\_errors\_message | Custom message for App Gateway http 4xx errors monitor | string | `""` | no | | appgateway\_http\_4xx\_errors\_message | Custom message for App Gateway http 4xx errors monitor | string | `""` | no |
@ -77,6 +72,13 @@ Creates DataDog monitors with the following checks:
| appgateway\_http\_5xx\_errors\_threshold\_warning | Warning regarding acceptable percent of 5xx error | string | `"80"` | no | | appgateway\_http\_5xx\_errors\_threshold\_warning | Warning regarding acceptable percent of 5xx error | string | `"80"` | no |
| appgateway\_http\_5xx\_errors\_time\_aggregator | Monitor aggregator for App Gateway http 5xx errors [available values: min, max or avg] | string | `"max"` | no | | appgateway\_http\_5xx\_errors\_time\_aggregator | Monitor aggregator for App Gateway http 5xx errors [available values: min, max or avg] | string | `"max"` | no |
| appgateway\_http\_5xx\_errors\_timeframe | Monitor timeframe for App Gateway http 5xx errors [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | string | `"last_5m"` | no | | appgateway\_http\_5xx\_errors\_timeframe | Monitor timeframe for App Gateway http 5xx errors [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | string | `"last_5m"` | no |
| appgateway\_unhealthy\_host\_ratio\_enabled | Flag to enable App Gateway unhealthy host ratio monitor | string | `"true"` | no |
| appgateway\_unhealthy\_host\_ratio\_extra\_tags | Extra tags for App Gateway unhealthy host ratio monitor | list(string) | `[]` | no |
| appgateway\_unhealthy\_host\_ratio\_message | Custom message for App Gateway unhealthy host ratio monitor | string | `""` | no |
| appgateway\_unhealthy\_host\_ratio\_threshold\_critical | Maximum critical acceptable ratio of unhealthy host | string | `"75"` | no |
| appgateway\_unhealthy\_host\_ratio\_threshold\_warning | Warning regarding acceptable ratio of unhealthy host | string | `"50"` | no |
| appgateway\_unhealthy\_host\_ratio\_time\_aggregator | Monitor aggregator for App Gateway unhealthy host ratio [available values: min, max or avg] | string | `"max"` | no |
| appgateway\_unhealthy\_host\_ratio\_timeframe | Monitor timeframe for App Gateway unhealthy host ratio [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | string | `"last_5m"` | no |
| current\_connection\_enabled | Flag to enable App Gateway current connections monitor | string | `"true"` | no | | current\_connection\_enabled | Flag to enable App Gateway current connections monitor | string | `"true"` | no |
| current\_connection\_extra\_tags | Extra tags for App Gateway current connections monitor | list(string) | `[]` | no | | current\_connection\_extra\_tags | Extra tags for App Gateway current connections monitor | list(string) | `[]` | no |
| current\_connection\_message | Custom message for App Gateway current connections monitor | string | `""` | no | | current\_connection\_message | Custom message for App Gateway current connections monitor | string | `""` | no |
@ -104,7 +106,7 @@ Creates DataDog monitors with the following checks:
| appgateway\_backend\_http\_4xx\_errors\_id | id for monitor appgateway_backend_http_4xx_errors | | appgateway\_backend\_http\_4xx\_errors\_id | id for monitor appgateway_backend_http_4xx_errors |
| appgateway\_backend\_http\_5xx\_errors\_id | id for monitor appgateway_backend_http_5xx_errors | | appgateway\_backend\_http\_5xx\_errors\_id | id for monitor appgateway_backend_http_5xx_errors |
| appgateway\_failed\_requests\_id | id for monitor appgateway_failed_requests | | appgateway\_failed\_requests\_id | id for monitor appgateway_failed_requests |
| appgateway\_healthy\_host\_count\_id | id for monitor appgateway_healthy_host_count | | appgateway\_healthy\_host\_ratio\_id | id for monitor appgateway_healthy_host_ratio |
| appgateway\_http\_4xx\_errors\_id | id for monitor appgateway_http_4xx_errors | | appgateway\_http\_4xx\_errors\_id | id for monitor appgateway_http_4xx_errors |
| appgateway\_http\_5xx\_errors\_id | id for monitor appgateway_http_5xx_errors | | appgateway\_http\_5xx\_errors\_id | id for monitor appgateway_http_5xx_errors |
| appgateway\_status\_id | id for monitor appgateway_status | | appgateway\_status\_id | id for monitor appgateway_status |

View File

@ -134,12 +134,12 @@ variable "appgateway_backend_connect_time_timeframe" {
variable "appgateway_backend_connect_time_threshold_critical" { variable "appgateway_backend_connect_time_threshold_critical" {
default = 50 default = 50
description = "Maximum critical backend_connect_time errors in seconds" description = "Maximum critical backend_connect_time errors in milliseconds"
} }
variable "appgateway_backend_connect_time_threshold_warning" { variable "appgateway_backend_connect_time_threshold_warning" {
default = 40 default = 40
description = "Warning regarding backend_connect_time errors in seconds" description = "Warning regarding backend_connect_time errors in milliseconds"
} }
# Monitoring App Gateway failed_requests # Monitoring App Gateway failed_requests
@ -183,37 +183,47 @@ variable "appgateway_failed_requests_threshold_warning" {
description = "Warning regarding acceptable percent of failed errors" description = "Warning regarding acceptable percent of failed errors"
} }
# Monitoring App Gateway healthy_host_count # Monitoring App Gateway unhealthy_host_ratio
variable "appgateway_healthy_host_count_enabled" { variable "appgateway_unhealthy_host_ratio_enabled" {
description = "Flag to enable App Gateway healthy host monitor" description = "Flag to enable App Gateway unhealthy host ratio monitor"
type = string type = string
default = "true" default = "true"
} }
variable "appgateway_healthy_host_count_extra_tags" { variable "appgateway_unhealthy_host_ratio_extra_tags" {
description = "Extra tags for App Gateway healthy host monitor" description = "Extra tags for App Gateway unhealthy host ratio monitor"
type = list(string) type = list(string)
default = [] default = []
} }
variable "appgateway_healthy_host_count_message" { variable "appgateway_unhealthy_host_ratio_message" {
description = "Custom message for App Gateway healthy host monitor" description = "Custom message for App Gateway unhealthy host ratio monitor"
type = string type = string
default = "" default = ""
} }
variable "appgateway_healthy_host_count_time_aggregator" { variable "appgateway_unhealthy_host_ratio_time_aggregator" {
description = "Monitor aggregator for App Gateway healthy host [available values: min, max or avg]" description = "Monitor aggregator for App Gateway unhealthy host ratio [available values: min, max or avg]"
type = string type = string
default = "max" default = "max"
} }
variable "appgateway_healthy_host_count_timeframe" { variable "appgateway_unhealthy_host_ratio_timeframe" {
description = "Monitor timeframe for App Gateway healthy host [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`]" description = "Monitor timeframe for App Gateway unhealthy host ratio [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`]"
type = string type = string
default = "last_5m" default = "last_5m"
} }
variable "appgateway_unhealthy_host_ratio_threshold_critical" {
default = 75
description = "Maximum critical acceptable ratio of unhealthy host"
}
variable "appgateway_unhealthy_host_ratio_threshold_warning" {
default = 50
description = "Warning regarding acceptable ratio of unhealthy host"
}
# Monitoring App Gateway response_status 4xx # Monitoring App Gateway response_status 4xx
variable "appgateway_http_4xx_errors_enabled" { variable "appgateway_http_4xx_errors_enabled" {
description = "Flag to enable App Gateway http 4xx errors monitor" description = "Flag to enable App Gateway http 4xx errors monitor"

View File

@ -127,18 +127,25 @@ EOQ
} }
} }
# Monitoring App Gateway healthy_host_count # Monitoring App Gateway unhealthy_host_ratio
resource "datadog_monitor" "appgateway_healthy_host_count" { resource "datadog_monitor" "appgateway_healthy_host_ratio" {
count = var.appgateway_healthy_host_count_enabled == "true" ? 1 : 0 count = var.appgateway_unhealthy_host_ratio_enabled == "true" ? 1 : 0
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] App Gateway backend has no healthy host" name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] App Gateway backend unhealthy host ratio is too high {{#is_alert}}{{{comparator}}} {{threshold}}% ({{value}}%){{/is_alert}}{{#is_warning}}{{{comparator}}} {{warn_threshold}}% ({{value}}%){{/is_warning}}"
message = coalesce(var.appgateway_healthy_host_count_message, var.message) message = coalesce(var.appgateway_unhealthy_host_ratio_message, var.message)
type = "query alert" type = "query alert"
query = <<EOQ query = <<EOQ
${var.appgateway_healthy_host_count_time_aggregator}(${var.appgateway_healthy_host_count_timeframe}): ${var.appgateway_unhealthy_host_ratio_time_aggregator}(${var.appgateway_unhealthy_host_ratio_timeframe}):
sum:azure.network_applicationgateways.healthy_host_count${module.filter-tags.query_alert} by {resource_group,region,name,backendsettingspool} < 1 sum:azure.network_applicationgateways.unhealthy_host_count${module.filter-tags.query_alert} by {resource_group,region,name,backendsettingspool} /
(sum:azure.network_applicationgateways.unhealthy_host_count${module.filter-tags.query_alert} by {resource_group,region,name,backendsettingspool} +
sum:azure.network_applicationgateways.healthy_host_count${module.filter-tags.query_alert} by {resource_group,region,name,backendsettingspool})
* 100 > ${var.appgateway_unhealthy_host_ratio_threshold_critical}
EOQ EOQ
thresholds = {
critical = var.appgateway_unhealthy_host_ratio_threshold_critical
warning = var.appgateway_unhealthy_host_ratio_threshold_warning
}
evaluation_delay = var.evaluation_delay evaluation_delay = var.evaluation_delay
new_host_delay = var.new_host_delay new_host_delay = var.new_host_delay
notify_no_data = false notify_no_data = false
@ -149,7 +156,7 @@ EOQ
locked = false locked = false
require_full_window = false require_full_window = false
tags = concat(["env:${var.environment}", "type:cloud", "provider:azure", "resource:app-gateway", "team:claranet", "created-by:terraform"], var.appgateway_healthy_host_count_extra_tags) tags = concat(["env:${var.environment}", "type:cloud", "provider:azure", "resource:app-gateway", "team:claranet", "created-by:terraform"], var.appgateway_unhealthy_host_ratio_extra_tags)
lifecycle { lifecycle {
ignore_changes = ["silenced"] ignore_changes = ["silenced"]

View File

@ -18,9 +18,9 @@ output "appgateway_failed_requests_id" {
value = datadog_monitor.appgateway_failed_requests.*.id value = datadog_monitor.appgateway_failed_requests.*.id
} }
output "appgateway_healthy_host_count_id" { output "appgateway_healthy_host_ratio_id" {
description = "id for monitor appgateway_healthy_host_count" description = "id for monitor appgateway_healthy_host_ratio"
value = datadog_monitor.appgateway_healthy_host_count.*.id value = datadog_monitor.appgateway_healthy_host_ratio.*.id
} }
output "appgateway_http_4xx_errors_id" { output "appgateway_http_4xx_errors_id" {