Alerting on Webhook Delivery Failures Without Drowning in Noise
The goal of webhook alerting is to page a human exactly when delivery is failing fast enough to matter, and never otherwise — a balance this guide strikes by combining burn-rate, queue-depth, and retry-exhaustion alerts, extending the practices in Webhook Observability & Monitoring. The scenario is a team that already publishes a delivery SLO and now needs alerting that catches both sudden outages and slow budget bleed without firing on every transient blip. We will build multi-window burn-rate alerts against the error budget from defining SLOs for webhook delivery, add dead-letter and retry-exhaustion alerts, and route them sensibly. The underlying signals are the spans and metrics produced by instrumenting webhooks with OpenTelemetry.
Prerequisites
- The SLO recording rules from the SLO guide, especially
webhook:delivery_success_ratio:30d. - Metrics for DLQ depth and retry exhaustion: a
webhook_dlq_depthgauge and awebhook_retries_exhausted_totalcounter. - Prometheus Alertmanager (or equivalent) with routing and inhibition configured.
- Familiarity with the dead-letter queue and exponential backoff so alert thresholds reflect the real retry budget.
Step 1: Define Burn-Rate Expressions
Burn rate is how fast you are consuming the error budget relative to the rate that would exhaust it exactly at the window’s end. A burn rate of 1 means you will spend the whole month’s budget in a month; a burn rate of 14.4 means you would spend it in roughly two days. Compute the short-window failure ratio and divide by the allowed budget.
groups:
- name: webhook_burn_rate
rules:
# Failure ratio over a short window divided by the 0.1% budget = burn rate.
- record: webhook:burn_rate:5m
expr: |
(1 - (
sum(rate(webhook_deliveries_total{outcome="acked"}[5m]))
/ sum(rate(webhook_deliveries_total[5m]))
)) / 0.001
- record: webhook:burn_rate:1h
expr: |
(1 - (
sum(rate(webhook_deliveries_total{outcome="acked"}[1h]))
/ sum(rate(webhook_deliveries_total[1h]))
)) / 0.001
Step 2: Add Multi-Window Burn-Rate Alerts
A single short window is twitchy; a single long window is slow. The multi-window technique pages only when a fast window and a slower confirmation window both exceed the threshold, which catches real outages quickly while suppressing one-off spikes. Use a high-burn pair for fast pages and a lower-burn pair for slow-bleed tickets.
- name: webhook_slo_alerts
rules:
- alert: WebhookFastBudgetBurn
expr: webhook:burn_rate:5m > 14.4 and webhook:burn_rate:1h > 14.4
for: 2m
labels:
severity: page
annotations:
summary: "Webhook delivery burning error budget fast"
description: "5m and 1h burn rate both exceed 14.4x; budget exhausts in ~2 days."
- alert: WebhookSlowBudgetBurn
expr: webhook:burn_rate:1h > 3 and webhook:burn_rate:30m > 3
for: 15m
labels:
severity: ticket
annotations:
summary: "Webhook delivery slowly missing SLO"
description: "Sustained elevated failure ratio; investigate before budget depletes."
Step 3: Add DLQ-Depth and Retry-Exhaustion Alerts
Burn-rate alerts can lag when traffic is low, so back them with direct queue-health alerts. A dead-letter queue that is growing means events are permanently failing; a spike in retry exhaustion means an endpoint is down and is about to flood the DLQ.
- alert: WebhookDLQGrowing
expr: increase(webhook_dlq_depth[15m]) > 50
for: 5m
labels:
severity: page
annotations:
summary: "Dead-letter queue growing for "
description: "More than 50 events dead-lettered in 15m; consumer likely broken."
- alert: WebhookRetriesExhausting
expr: sum(rate(webhook_retries_exhausted_total[10m])) by (endpoint) > 1
for: 5m
labels:
severity: page
annotations:
summary: "Retry budget exhausting for "
description: "Events are giving up after full backoff; endpoint may be down."
Step 4: Route and Silence Alerts
Routing turns alerts into the right action. Send severity: page to the on-call pager and severity: ticket to a queue; route per-endpoint DLQ alerts to the team that owns that integration. Use inhibition so a broad outage page suppresses the dozens of per-endpoint alerts it would otherwise spawn.
route:
receiver: tickets
group_by: [alertname, endpoint]
routes:
- matchers: [severity="page"]
receiver: oncall-pager
group_wait: 30s
repeat_interval: 4h
inhibit_rules:
- source_matchers: [alertname="WebhookFastBudgetBurn"]
target_matchers: [alertname="WebhookDLQGrowing"]
equal: [endpoint]
Verification and Testing
Validate alert logic with Prometheus’s promtool test rules, which runs sample series through your rules and asserts which alerts fire — no real outage required:
# alert_tests.yml
rule_files: [webhook_burn_rate.yml, webhook_slo_alerts.yml]
tests:
- interval: 1m
input_series:
- series: 'webhook_deliveries_total{outcome="acked"}'
values: '0+0x60' # nothing acked = 100% failure
- series: 'webhook_deliveries_total'
values: '0+100x60'
alert_rule_test:
- eval_time: 5m
alertname: WebhookFastBudgetBurn
exp_alerts:
- exp_labels: { severity: page }
Run it in CI with promtool test rules alert_tests.yml, and do a quarterly live fire drill by pausing a test endpoint to confirm the page actually reaches the on-call phone.
Failure Modes and Gotchas
- Single-window alerts that flap. A lone 5m rule pages on every transient 500. Always require a confirming longer window before paging.
- Burn-rate blind spots at low volume. With few deliveries, ratios are noisy and slow to trip. DLQ-depth and retry-exhaustion alerts cover that gap; keep both.
- No inhibition between broad and narrow alerts. A platform outage pages once for the SLO and once per endpoint, burying the signal. Add inhibition rules keyed on
endpoint. - Thresholds divorced from the retry budget. A DLQ-growth threshold tuned without knowing the backoff schedule fires too early or too late. Set it relative to the configured exponential backoff attempts.