Defining SLOs for Webhook Delivery: SLIs, Targets, and Error Budgets

Promising a customer that “webhooks are reliable” means nothing until you can express reliability as a number you measure and defend — which is the gap this guide closes, building on the signals established in Webhook Observability & Monitoring. The specific scenario here is a webhook product that integrators depend on for time-sensitive flows (payment notifications, provisioning, sync), where you must publish a service level objective and hold yourself to it. We will choose two service level indicators — delivery success ratio and end-to-end latency — set defensible targets, and turn the gap between target and reality into an error budget that governs how aggressively you ship. The indicators come straight from the spans and metrics you wired up in instrumenting webhooks with OpenTelemetry, and a depleting budget is what triggers alerting on webhook delivery failures.

Error budget over a rolling window A budget bar depletes as failed deliveries accumulate against a 99.9 percent target. budget remaining 67% consumed 33% SLO target: 99.9% success over 30 days Budget = 0.1% of attempts may fail Two SLIs Success ratio = acked / attempts Latency = p95 event-creation to ack Budget burn drives release and alerting decisions
The error budget is the allowable failure share under the SLO target; burning it fast is the signal to slow down and investigate.

Prerequisites

Step 1: Choose Service Level Indicators

Two indicators capture what integrators actually feel. The delivery success ratio is the share of events that reach a terminal acknowledged state within their retry budget, not the share of first attempts that succeed — a webhook that succeeds on retry 3 is a success from the customer’s perspective. The end-to-end latency is the time from event creation to terminal acknowledgement, reported at p95 (and p99 for stricter tiers). Measuring latency from creation rather than first dispatch is essential: events that age in the outbox before the first attempt are the ones customers complain about.

Define each SLI as a precise sentence with a clear good-event definition. For example: “The proportion of events created in the window that were acknowledged with a 2xx within 15 minutes.” That sentence is the contract; everything else derives from it.

Step 2: Set Realistic SLO Targets

Do not invent targets; read them off your history. Compute the last quarter’s success ratio and latency distribution, then set the SLO slightly tighter than your steady-state performance but loose enough to absorb normal incidents. A 99.9% success target over 30 days allows roughly 43 minutes of total budget if expressed as time, or 0.1% of attempts if expressed as a count. Tier targets by importance — a 99.95% target for payment events and 99.5% for low-stakes notifications — so you are not paying for reliability nobody needs.

SLI Good-event definition Example target Window
Delivery success ratio Event acknowledged with 2xx within retry budget 99.9% 30 days rolling
End-to-end latency (p95) Creation to ack under 15 minutes 95% of events 30 days rolling
End-to-end latency (p99) Creation to ack under 60 minutes 99% of events 30 days rolling
First-attempt success ratio Acknowledged on attempt 1 99.0% 7 days rolling

Step 3: Encode the SLIs as Recording Rules

Express each indicator as a Prometheus recording rule so dashboards and alerts read a pre-computed series rather than recomputing heavy queries. The success ratio divides acknowledged deliveries by total attempts over the window.

groups:
  - name: webhook_slo
    interval: 1m
    rules:
      # Rolling 30d delivery success ratio (acked / attempts).
      - record: webhook:delivery_success_ratio:30d
        expr: |
          sum(rate(webhook_deliveries_total{outcome="acked"}[30d]))
          /
          sum(rate(webhook_deliveries_total[30d]))

      # Fraction of events whose end-to-end latency was under 900s (15m).
      - record: webhook:latency_under_15m_ratio:30d
        expr: |
          sum(rate(webhook_delivery_latency_seconds_bucket{le="900"}[30d]))
          /
          sum(rate(webhook_delivery_latency_seconds_count[30d]))

Step 4: Compute the Error Budget

The error budget is 1 - SLO_target, and budget remaining is how much of that allowance you have left. Encode it so you can show “67% of the monthly budget remains” on a dashboard and gate releases on it.

      # Error budget remaining as a fraction of the allowed failure budget.
      # target = 0.999, so allowed failure = 0.001.
      - record: webhook:error_budget_remaining:30d
        expr: |
          1 - (
            (1 - webhook:delivery_success_ratio:30d)
            /
            (1 - 0.999)
          )

When webhook:error_budget_remaining:30d approaches zero, you are about to miss the SLO; that is the signal to freeze risky changes and prioritize reliability work. A healthy budget, by contrast, is permission to ship faster.

Verification and Testing

Backtest the rules against historical data before publishing the SLO. Load a representative window into a scratch Prometheus and confirm the computed success ratio matches an independent count from your event store. A simple assertion in CI guards against label drift breaking the math:

def test_success_ratio_matches_event_store(prom, event_store):
    promql = prom.query("webhook:delivery_success_ratio:30d")
    acked = event_store.count(status="acked", window_days=30)
    total = event_store.count(window_days=30)
    expected = acked / total
    # Allow small tolerance for scrape timing differences.
    assert abs(promql - expected) < 0.0005

You can also spot-check latency on the wire by timing a synthetic event from creation to the acknowledgement log line and confirming it lands inside the 15-minute objective.

Failure Modes and Gotchas