Alerting on Webhook Delivery Failures Without Drowning in Noise

The goal of webhook alerting is to page a human exactly when delivery is failing fast enough to matter, and never otherwise — a balance this guide strikes by combining burn-rate, queue-depth, and retry-exhaustion alerts, extending the practices in Webhook Observability & Monitoring. The scenario is a team that already publishes a delivery SLO and now needs alerting that catches both sudden outages and slow budget bleed without firing on every transient blip. We will build multi-window burn-rate alerts against the error budget from defining SLOs for webhook delivery, add dead-letter and retry-exhaustion alerts, and route them sensibly. The underlying signals are the spans and metrics produced by instrumenting webhooks with OpenTelemetry.

Multi-window alert routing A fast and slow window must both breach before a page fires; DLQ and retry alerts route to teams. Fast window 5m burn > 14.4x Slow window 1h burn > 14.4x AND gate page if both On-call: pager DLQ alert: ticket Independent alerts DLQ depth rising over 15m retry-exhaustion rate spike routed by endpoint and severity
A page fires only when both windows agree; DLQ and retry-exhaustion alerts run independently and route by severity.

Prerequisites

Step 1: Define Burn-Rate Expressions

Burn rate is how fast you are consuming the error budget relative to the rate that would exhaust it exactly at the window’s end. A burn rate of 1 means you will spend the whole month’s budget in a month; a burn rate of 14.4 means you would spend it in roughly two days. Compute the short-window failure ratio and divide by the allowed budget.

groups:
  - name: webhook_burn_rate
    rules:
      # Failure ratio over a short window divided by the 0.1% budget = burn rate.
      - record: webhook:burn_rate:5m
        expr: |
          (1 - (
            sum(rate(webhook_deliveries_total{outcome="acked"}[5m]))
            / sum(rate(webhook_deliveries_total[5m]))
          )) / 0.001
      - record: webhook:burn_rate:1h
        expr: |
          (1 - (
            sum(rate(webhook_deliveries_total{outcome="acked"}[1h]))
            / sum(rate(webhook_deliveries_total[1h]))
          )) / 0.001

Step 2: Add Multi-Window Burn-Rate Alerts

A single short window is twitchy; a single long window is slow. The multi-window technique pages only when a fast window and a slower confirmation window both exceed the threshold, which catches real outages quickly while suppressing one-off spikes. Use a high-burn pair for fast pages and a lower-burn pair for slow-bleed tickets.

  - name: webhook_slo_alerts
    rules:
      - alert: WebhookFastBudgetBurn
        expr: webhook:burn_rate:5m > 14.4 and webhook:burn_rate:1h > 14.4
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Webhook delivery burning error budget fast"
          description: "5m and 1h burn rate both exceed 14.4x; budget exhausts in ~2 days."

      - alert: WebhookSlowBudgetBurn
        expr: webhook:burn_rate:1h > 3 and webhook:burn_rate:30m > 3
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Webhook delivery slowly missing SLO"
          description: "Sustained elevated failure ratio; investigate before budget depletes."

Step 3: Add DLQ-Depth and Retry-Exhaustion Alerts

Burn-rate alerts can lag when traffic is low, so back them with direct queue-health alerts. A dead-letter queue that is growing means events are permanently failing; a spike in retry exhaustion means an endpoint is down and is about to flood the DLQ.

      - alert: WebhookDLQGrowing
        expr: increase(webhook_dlq_depth[15m]) > 50
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Dead-letter queue growing for "
          description: "More than 50 events dead-lettered in 15m; consumer likely broken."

      - alert: WebhookRetriesExhausting
        expr: sum(rate(webhook_retries_exhausted_total[10m])) by (endpoint) > 1
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Retry budget exhausting for "
          description: "Events are giving up after full backoff; endpoint may be down."

Step 4: Route and Silence Alerts

Routing turns alerts into the right action. Send severity: page to the on-call pager and severity: ticket to a queue; route per-endpoint DLQ alerts to the team that owns that integration. Use inhibition so a broad outage page suppresses the dozens of per-endpoint alerts it would otherwise spawn.

route:
  receiver: tickets
  group_by: [alertname, endpoint]
  routes:
    - matchers: [severity="page"]
      receiver: oncall-pager
      group_wait: 30s
      repeat_interval: 4h
inhibit_rules:
  - source_matchers: [alertname="WebhookFastBudgetBurn"]
    target_matchers: [alertname="WebhookDLQGrowing"]
    equal: [endpoint]

Verification and Testing

Validate alert logic with Prometheus’s promtool test rules, which runs sample series through your rules and asserts which alerts fire — no real outage required:

# alert_tests.yml
rule_files: [webhook_burn_rate.yml, webhook_slo_alerts.yml]
tests:
  - interval: 1m
    input_series:
      - series: 'webhook_deliveries_total{outcome="acked"}'
        values: '0+0x60'   # nothing acked = 100% failure
      - series: 'webhook_deliveries_total'
        values: '0+100x60'
    alert_rule_test:
      - eval_time: 5m
        alertname: WebhookFastBudgetBurn
        exp_alerts:
          - exp_labels: { severity: page }

Run it in CI with promtool test rules alert_tests.yml, and do a quarterly live fire drill by pausing a test endpoint to confirm the page actually reaches the on-call phone.

Failure Modes and Gotchas