Alerting on Webhook Delivery Failures Without Drowning in Noise

The goal of webhook alerting is to page a human exactly when delivery is failing fast enough to matter, and never otherwise — a balance this guide strikes by combining burn-rate, queue-depth, and retry-exhaustion alerts, extending the practices in Webhook Observability & Monitoring. The scenario is a team that already publishes a delivery SLO and now needs alerting that catches both sudden outages and slow budget bleed without firing on every transient blip. We will build multi-window burn-rate alerts against the error budget from defining SLOs for webhook delivery, add dead-letter and retry-exhaustion alerts, and route them sensibly. The underlying signals are the spans and metrics produced by instrumenting webhooks with OpenTelemetry.

A page fires only when both windows agree; DLQ and retry-exhaustion alerts run independently and route by severity.

Prerequisites

The SLO recording rules from the SLO guide, especially webhook:delivery_success_ratio:30d.
Metrics for DLQ depth and retry exhaustion: a webhook_dlq_depth gauge and a webhook_retries_exhausted_total counter.
Prometheus Alertmanager (or equivalent) with routing and inhibition configured.
Familiarity with the dead-letter queue and exponential backoff so alert thresholds reflect the real retry budget.

Step 1: Define Burn-Rate Expressions

Burn rate is how fast you are consuming the error budget relative to the rate that would exhaust it exactly at the window’s end. A burn rate of 1 means you will spend the whole month’s budget in a month; a burn rate of 14.4 means you would spend it in roughly two days. Compute the short-window failure ratio and divide by the allowed budget.

groups:
  - name: webhook_burn_rate
    rules:
      # Failure ratio over a short window divided by the 0.1% budget = burn rate.
      - record: webhook:burn_rate:5m
        expr: |
          (1 - (
            sum(rate(webhook_deliveries_total{outcome="acked"}[5m]))
            / sum(rate(webhook_deliveries_total[5m]))
          )) / 0.001
      - record: webhook:burn_rate:1h
        expr: |
          (1 - (
            sum(rate(webhook_deliveries_total{outcome="acked"}[1h]))
            / sum(rate(webhook_deliveries_total[1h]))
          )) / 0.001

The specific numbers people copy — 14.4, 6, 3, 1 — are not folklore, they fall out of one equation. A burn rate of r sustained for a window w consumes r × w / 30d of a 30-day budget. Solve for the pairs you care about: burning 2% of the budget in one hour needs r = 0.02 × 720h / 1h = 14.4; burning 5% in six hours needs r = 6; burning 10% in three days needs r ≈ 1. Choose the pairs by how much budget you are willing to lose before a human is involved, then let the arithmetic produce the threshold. This is also why the same rule file works unchanged for a 99.9% and a 99.95% target: the divisor changes, the multipliers do not.

Long window	Short window	Burn rate	Budget spent when it fires	Routing
1 hour	5 minutes	14.4	2%	Page immediately
6 hours	30 minutes	6	5%	Page immediately
1 day	2 hours	3	10%	Ticket, same business day
3 days	6 hours	1	10%	Ticket, next planning cycle

The short window in each row exists only to make recovery fast. Without it, a 1-hour rule that fired at minute 10 of an outage keeps firing for a full hour after the outage ends, because the window still contains the failures — an hour of a resolved incident sitting on someone’s pager is how people learn to ignore the pager. Requiring the short window to also be breaching means the alert clears within roughly the short window’s length of the fix landing. The conventional ratio is a short window one twelfth the length of the long one, which keeps the short window long enough to avoid single-scrape noise.

Step 2: Add Multi-Window Burn-Rate Alerts

A single short window is twitchy; a single long window is slow. The multi-window technique pages only when a fast window and a slower confirmation window both exceed the threshold, which catches real outages quickly while suppressing one-off spikes. Use a high-burn pair for fast pages and a lower-burn pair for slow-bleed tickets.

Watching one incident play out makes the mechanism obvious. An endpoint starts returning 5xx twenty minutes in; the 5m window crosses the threshold almost immediately, but nothing pages until the 1h window has accumulated enough failures to agree, plus the two-minute for clause. A transient blip that resolves inside ten minutes never reaches that second condition and never wakes anyone.

Fifteen minutes of confirmation is the price of not paging on every transient 5xx spike.

  - name: webhook_slo_alerts
    rules:
      - alert: WebhookFastBudgetBurn
        expr: webhook:burn_rate:5m > 14.4 and webhook:burn_rate:1h > 14.4
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Webhook delivery burning error budget fast"
          description: "5m and 1h burn rate both exceed 14.4x; budget exhausts in ~2 days."

      - alert: WebhookSlowBudgetBurn
        expr: webhook:burn_rate:1h > 3 and webhook:burn_rate:30m > 3
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Webhook delivery slowly missing SLO"
          description: "Sustained elevated failure ratio; investigate before budget depletes."

Two details in those rules deserve more than a glance. The for: 2m on the fast pair is not a safety margin against noise — the two-window and already provides that — it is protection against a single bad scrape. If one Prometheus scrape misses and rate() extrapolates oddly, the expression can be true for exactly one evaluation cycle, and for: 2m requires the condition to hold across several. Keep it short: every second added to for is a second added to detection time on a real outage, and past about five minutes you have quietly widened your incident by more than the alert saves you.

The second detail is the denominator. sum(rate(...)) over a five-minute window on an endpoint receiving three deliveries an hour is statistically meaningless: a single failure produces a failure ratio of 1.0 and a burn rate of 1000. Two mitigations work together. Guard the burn-rate rules with a minimum-volume clause so they only evaluate where the arithmetic is sound — appending and sum(rate(webhook_deliveries_total[5m])) > 0.05 requires roughly 15 deliveries in the window — and accept that low-volume endpoints simply are not covered by ratio alerts. They are covered by the absolute alerts in the next step, which is exactly why both families exist. A platform where 90% of endpoints are low volume and 10% carry 99% of traffic needs the absolute rules more than the burn-rate rules, even though the burn-rate rules are the ones tied to the published objective.

Step 3: Add DLQ-Depth and Retry-Exhaustion Alerts

Burn-rate alerts can lag when traffic is low, so back them with direct queue-health alerts. A dead-letter queue that is growing means events are permanently failing; a spike in retry exhaustion means an endpoint is down and is about to flood the DLQ. If you also alert on slow-but-successful delivery, drive that rule off a pre-computed quantile series rather than an ad-hoc histogram_quantile in the alert expression — see tracking webhook delivery latency percentiles for why the aggregation order matters.

      - alert: WebhookDLQGrowing
        expr: increase(webhook_dlq_depth[15m]) > 50
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Dead-letter queue growing (endpoint label carries the destination)"
          description: "More than 50 events dead-lettered in 15m; consumer likely broken."

      - alert: WebhookRetriesExhausting
        expr: sum(rate(webhook_retries_exhausted_total[10m])) by (endpoint) > 1
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Retry budget exhausting (endpoint label carries the destination)"
          description: "Events are giving up after full backoff; endpoint may be down."

Alertmanager can interpolate the firing series’ labels into these annotation strings using its templating syntax; the summaries above are written as static text so they read unambiguously here, and adding the label interpolation is a one-line change in your own rule files.

A third alert belongs in this group and is the one most teams discover the hard way: no deliveries at all. Every rule above is a ratio or a rate over deliveries, and all of them evaluate to no data when the dispatcher stops dispatching. An alert that returns no data does not fire, so a dispatcher that crash-loops, loses its database lease, or deadlocks on a connection pool produces total silence from the entire alerting stack. Guard it with an absolute floor on throughput, sized from your quietest hour rather than your average.

      - alert: WebhookDispatchStalled
        expr: sum(rate(webhook_deliveries_total[10m])) < 0.2
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Webhook dispatch rate near zero"
          description: "Fewer than 12 deliveries in 10m; dispatcher may be stalled or wedged."

      - alert: WebhookOldestEventAging
        expr: max(webhook_oldest_undelivered_age_seconds) > 900
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Oldest undelivered webhook event is over 15 minutes old"
          description: "Queue may be draining slower than intake, or one shard is stuck."

WebhookOldestEventAging is the complement to depth-based alerting and catches the case depth cannot. A dispatcher stalled while intake also stalled shows a perfectly flat queue depth, and a depth threshold will never trip; only the age of the oldest undelivered row keeps climbing. Emit that gauge from a cheap periodic query over the outbox, and set the threshold at roughly half your latency objective so the page arrives with time left to act.

Sizing the DLQ threshold requires the retry schedule, not intuition. If backoff is 1s, 4s, 16s, 64s, 256s, 1024s with a maximum of six attempts, an endpoint that goes hard-down at time zero produces its first dead-lettered event about 22 minutes later, and events dead-letter thereafter at the rate they were originally created. For an endpoint receiving 200 events an hour that is 50 dead letters roughly 15 minutes after the first one appears — so the increase(...) > 50 over 15 minutes above pages about 37 minutes into a complete outage. If that is too slow for your tier, do not lower the threshold blindly: lower it and simultaneously check what a routine deploy on the consumer side produces, because a five-minute rolling restart under the same schedule dead-letters nothing at all and a threshold below the noise floor turns every consumer deploy into a page.

Step 4: Route and Silence Alerts

Routing turns alerts into the right action. Send severity: page to the on-call pager and severity: ticket to a queue; route per-endpoint DLQ alerts to the team that owns that integration. Use inhibition so a broad outage page suppresses the dozens of per-endpoint alerts it would otherwise spawn. Every alert that reaches Alertmanager therefore passes through the same two questions before it becomes a notification.

Severity decides the channel and inhibition decides whether the notification is redundant — both run before anything reaches a phone.

route:
  receiver: tickets
  group_by: [alertname, endpoint]
  routes:
    - matchers: [severity="page"]
      receiver: oncall-pager
      group_wait: 30s
      repeat_interval: 4h
inhibit_rules:
  - source_matchers: [alertname="WebhookFastBudgetBurn"]
    target_matchers: [alertname="WebhookDLQGrowing"]
    equal: [endpoint]

Two routing decisions matter more than the config that expresses them. First, every severity: page rule needs a runbook_url annotation pointing at a document that names the three most likely causes and the command to check each. An engineer woken at 03:00 by “webhook delivery burning error budget fast” has to reconstruct the entire mental model before touching anything; the runbook link is what turns a 40-minute incident into a 10-minute one, and its absence is the single most common reason a technically correct alert produces a bad outcome. Second, set repeat_interval deliberately. Four hours is right for a page: long enough that a person working the incident is not re-notified every fifteen minutes, short enough that an alert acknowledged and then forgotten resurfaces before the next shift. Tickets should not repeat at all — they should be grouped, because an integration failing for three days should produce one ticket with a growing count, not seventy-two.

Tuning Thresholds Against Your Own Incident History

The thresholds above are defensible starting points, not answers. The way to turn them into answers is to replay history: export twelve months of delivery metrics, run each candidate rule against them offline, and record two numbers per rule — how long after the true incident start it would have fired, and how many times it would have fired when nothing was wrong. That converts an argument about whether 50 dead letters is the right number into a measurement, and it usually reveals that the rules disagree productively rather than redundantly.

Each rule is kept because of its blind-spot column, not its detection column — the set is chosen so the blind spots do not overlap.

Read that table as a coverage argument. The fast burn-rate rule is the quickest detector but is blind wherever traffic is too thin for a ratio to mean anything. The dead-letter rule is slow — it cannot fire until the retry budget has actually been exhausted — but it is the only rule that proves events were permanently lost rather than merely delayed. The dispatch-rate floor almost never fires, and the one time a year it does is the time every other rule is silent. Delete any rule whose blind spot is already covered by a faster rule, and keep every rule that is the sole detector of some failure shape, even if it fires twice a year.

A page budget makes the trade-off concrete. If the on-call rotation can absorb roughly three pages a month before people start pre-emptively acknowledging without reading, the four rules above at 3.0 pages a month are at the limit, and adding a fifth means removing or demoting one. That is a real constraint and it should be applied before the rule is merged, not after the team has quietly built a habit of ignoring one of them.

Verification and Testing

Validate alert logic with Prometheus’s promtool test rules, which runs sample series through your rules and asserts which alerts fire — no real outage required:

# alert_tests.yml
rule_files: [webhook_burn_rate.yml, webhook_slo_alerts.yml]
tests:
  - interval: 1m
    input_series:
      - series: 'webhook_deliveries_total{outcome="acked"}'
        values: '0+0x60'   # nothing acked = 100% failure
      - series: 'webhook_deliveries_total'
        values: '0+100x60'
    alert_rule_test:
      - eval_time: 5m
        alertname: WebhookFastBudgetBurn
        exp_alerts:
          - exp_labels: { severity: page }

Run it in CI with promtool test rules alert_tests.yml, and do a quarterly live fire drill by pausing a test endpoint to confirm the page actually reaches the on-call phone.

Failure Modes and Gotchas

Single-window alerts that flap. A lone 5m rule pages on every transient 500. Always require a confirming longer window before paging.
Burn-rate blind spots at low volume. With few deliveries, ratios are noisy and slow to trip. DLQ-depth and retry-exhaustion alerts cover that gap; keep both.
No inhibition between broad and narrow alerts. A platform outage pages once for the SLO and once per endpoint, burying the signal. Add inhibition rules keyed on endpoint.
Thresholds divorced from the retry budget. A DLQ-growth threshold tuned without knowing the backoff schedule fires too early or too late. Set it relative to the configured exponential backoff attempts.
Rules that evaluate to no data during a total outage. Every ratio-based expression silently stops firing when the denominator disappears, so the worst failure produces the quietest alerting stack. Pair each ratio rule with an absolute throughput floor and treat absent() on your core series as pageable in its own right.
A stale gauge read as a healthy gauge. If the exporter that publishes dead-letter depth dies, the last scraped value persists for the staleness window and then vanishes — either way the threshold stops being crossed. Alert on the gauge’s own freshness, not only on its value.
Silences that outlive the incident. A four-hour silence created during a deploy and never revoked is indistinguishable from working alerting until the next outage. Cap silence duration at one shift by policy and review active silences at handover.
Alerts that fire before the automatic remedy has had a chance. If circuit breakers already isolate a failing endpoint within two minutes, an alert with a one-minute for clause pages for something the system fixes on its own. Set the for clause longer than the automated recovery it overlaps with.

Frequently Asked Questions

Where do the burn-rate numbers 14.4 and 6 actually come from?

They are solutions to a single equation: a burn rate of r sustained for a window w consumes r times w divided by the window length of the budget. Two percent of a 30-day budget in one hour needs a rate of 14.4, and five percent in six hours needs 6.

Because only the divisor changes when your target changes, the same multipliers work unmodified whether you are defending 99.9% or 99.95%.

What should page an on-call engineer for an endpoint that receives three events an hour?

Not a ratio. At that volume one failure produces a failure ratio of 1.0 and a burn rate in the hundreds, so any ratio-based rule is pure noise. Guard the burn-rate rules with a minimum-volume clause and let low-traffic endpoints be covered by absolute rules instead.

Dead-letter growth and retry exhaustion both work on counts rather than proportions, which makes them the correct instruments at low volume even though they detect more slowly.

Why did nothing page when the dispatcher was completely down?

Because every rule was a ratio and the denominator went to zero. An expression that returns no data cannot cross a threshold, so a wedged or crash-looping dispatcher produces total silence from an alerting stack that otherwise looks well designed.

Add an absolute floor on delivery throughput sized from your quietest hour, and an alert on the age of the oldest undelivered event so a stalled queue is visible even when depth is flat.

How long should the for clause on a burn-rate alert be?

Just long enough to survive one bad scrape, which usually means two to five minutes. The two-window condition is already doing the noise suppression, so a long for clause adds detection delay without adding precision.

The exception is an alert that overlaps an automated remedy such as a circuit breaker; there the for clause should exceed the time the automation needs, or you page for something the system resolves without you.

How do I decide whether a proposed alert is worth adding?

Replay it against a year of stored metrics and record two numbers: how long after each real incident began it would have fired, and how often it would have fired with nothing wrong. Then ask which failure shape it is the only detector for.

If a faster existing rule already covers that shape, the new rule is redundant regardless of how sensible it looks; if it is the sole detector of something, keep it even at one firing a year.

Should dead-letter alerts page or open a ticket?

It depends on whether the events are recoverable. If your dead-letter store retains payloads and supports replay, growth is urgent but not irreversible, and a ticket with a same-day service level is usually right. If the store expires entries or the events are time-sensitive, the deadline is real and it should page.

Route by the owning team rather than to a central rotation where possible, since a single broken integration is almost always fixed by the people who own that integration.

How many pages a month is too many for one rotation?

Roughly three is the practical ceiling before engineers start acknowledging without reading, which destroys the value of every rule at once. Treat the page budget as a fixed resource: adding a fifth rule to a set that already spends the budget means demoting or deleting one of the four.

Tickets and dashboard-only signals have no such ceiling, so most new detection work belongs there rather than on the pager.