Defining SLOs for Webhook Delivery: SLIs, Targets, and Error Budgets
Promising a customer that “webhooks are reliable” means nothing until you can express reliability as a number you measure and defend — which is the gap this guide closes, building on the signals established in Webhook Observability & Monitoring. The specific scenario here is a webhook product that integrators depend on for time-sensitive flows (payment notifications, provisioning, sync), where you must publish a service level objective and hold yourself to it. We will choose two service level indicators — delivery success ratio and end-to-end latency — set defensible targets, and turn the gap between target and reality into an error budget that governs how aggressively you ship. The indicators come straight from the spans and metrics you wired up in instrumenting webhooks with OpenTelemetry, and a depleting budget is what triggers alerting on webhook delivery failures.
Prerequisites
- Delivery metrics already emitted: an attempts counter labeled by outcome and a latency histogram measured from event creation, as built in the observability overview.
- A Prometheus-compatible store with recording-rule support, or an equivalent SLO platform.
- Agreement with stakeholders on the measurement window (28–30 days rolling is typical) before publishing any number.
- An understanding of delivery guarantee levels, because an at-least-once system’s “success” means eventual acknowledgement, not first-attempt success.
Step 1: Choose Service Level Indicators
Two indicators capture what integrators actually feel. The delivery success ratio is the share of events that reach a terminal acknowledged state within their retry budget, not the share of first attempts that succeed — a webhook that succeeds on retry 3 is a success from the customer’s perspective. The end-to-end latency is the time from event creation to terminal acknowledgement, reported at p95 (and p99 for stricter tiers). Measuring latency from creation rather than first dispatch is essential: events that age in the outbox before the first attempt are the ones customers complain about.
Define each SLI as a precise sentence with a clear good-event definition. For example: “The proportion of events created in the window that were acknowledged with a 2xx within 15 minutes.” That sentence is the contract; everything else derives from it.
Step 2: Set Realistic SLO Targets
Do not invent targets; read them off your history. Compute the last quarter’s success ratio and latency distribution, then set the SLO slightly tighter than your steady-state performance but loose enough to absorb normal incidents. A 99.9% success target over 30 days allows roughly 43 minutes of total budget if expressed as time, or 0.1% of attempts if expressed as a count. Tier targets by importance — a 99.95% target for payment events and 99.5% for low-stakes notifications — so you are not paying for reliability nobody needs.
| SLI | Good-event definition | Example target | Window |
|---|---|---|---|
| Delivery success ratio | Event acknowledged with 2xx within retry budget | 99.9% | 30 days rolling |
| End-to-end latency (p95) | Creation to ack under 15 minutes | 95% of events | 30 days rolling |
| End-to-end latency (p99) | Creation to ack under 60 minutes | 99% of events | 30 days rolling |
| First-attempt success ratio | Acknowledged on attempt 1 | 99.0% | 7 days rolling |
Step 3: Encode the SLIs as Recording Rules
Express each indicator as a Prometheus recording rule so dashboards and alerts read a pre-computed series rather than recomputing heavy queries. The success ratio divides acknowledged deliveries by total attempts over the window.
groups:
- name: webhook_slo
interval: 1m
rules:
# Rolling 30d delivery success ratio (acked / attempts).
- record: webhook:delivery_success_ratio:30d
expr: |
sum(rate(webhook_deliveries_total{outcome="acked"}[30d]))
/
sum(rate(webhook_deliveries_total[30d]))
# Fraction of events whose end-to-end latency was under 900s (15m).
- record: webhook:latency_under_15m_ratio:30d
expr: |
sum(rate(webhook_delivery_latency_seconds_bucket{le="900"}[30d]))
/
sum(rate(webhook_delivery_latency_seconds_count[30d]))
Step 4: Compute the Error Budget
The error budget is 1 - SLO_target, and budget remaining is how much of that allowance you have left. Encode it so you can show “67% of the monthly budget remains” on a dashboard and gate releases on it.
# Error budget remaining as a fraction of the allowed failure budget.
# target = 0.999, so allowed failure = 0.001.
- record: webhook:error_budget_remaining:30d
expr: |
1 - (
(1 - webhook:delivery_success_ratio:30d)
/
(1 - 0.999)
)
When webhook:error_budget_remaining:30d approaches zero, you are about to miss the SLO; that is the signal to freeze risky changes and prioritize reliability work. A healthy budget, by contrast, is permission to ship faster.
Verification and Testing
Backtest the rules against historical data before publishing the SLO. Load a representative window into a scratch Prometheus and confirm the computed success ratio matches an independent count from your event store. A simple assertion in CI guards against label drift breaking the math:
def test_success_ratio_matches_event_store(prom, event_store):
promql = prom.query("webhook:delivery_success_ratio:30d")
acked = event_store.count(status="acked", window_days=30)
total = event_store.count(window_days=30)
expected = acked / total
# Allow small tolerance for scrape timing differences.
assert abs(promql - expected) < 0.0005
You can also spot-check latency on the wire by timing a synthetic event from creation to the acknowledgement log line and confirming it lands inside the 15-minute objective.
Failure Modes and Gotchas
- Measuring success per attempt, not per event. An at-least-once system retries; counting first attempts makes a healthy pipeline look broken. Define the good event as terminal acknowledgement within the retry budget.
- Latency from dispatch instead of creation. Events stuck in the outbox are invisible if you start the clock at first dispatch. Always measure from creation time.
- Targets pulled from thin air. An aspirational 99.99% you have never hit produces a permanently exhausted budget and alert fatigue. Derive targets from history.
- One global SLO for mixed traffic. Bursty low-priority events drag down the number that matters for payment events. Tier SLOs by event class so the budget reflects real customer impact, in line with message ordering guarantees for ordered streams.