Defining SLOs for Webhook Delivery: SLIs, Targets, and Error Budgets

Q: Should the number I publish to customers be the same as my internal target?

No. Publish something coarser and looser, such as 99.5% acknowledged within 24 hours measured monthly, against an internal objective of 99.9% within 15 minutes. The gap is headroom that lets the internal target stay aggressive without every internal miss becoming a service-credit conversation. Compute both from the same recording rules so the two definitions cannot drift apart, and keep the contractual wording free of percentiles and tiered exclusions.

Promising a customer that “webhooks are reliable” means nothing until you can express reliability as a number you measure and defend — which is the gap this guide closes, building on the signals established in Webhook Observability & Monitoring. The specific scenario here is a webhook product that integrators depend on for time-sensitive flows (payment notifications, provisioning, sync), where you must publish a service level objective and hold yourself to it. We will choose two service level indicators — delivery success ratio and end-to-end latency — set defensible targets, and turn the gap between target and reality into an error budget that governs how aggressively you ship. The indicators come straight from the spans and metrics you wired up in instrumenting webhooks with OpenTelemetry, and a depleting budget is what triggers alerting on webhook delivery failures.

The error budget is the allowable failure share under the SLO target; burning it fast is the signal to slow down and investigate.

Prerequisites

Delivery metrics already emitted: an attempts counter labeled by outcome and a latency histogram measured from event creation, as built in the observability overview.
A Prometheus-compatible store with recording-rule support, or an equivalent SLO platform.
Agreement with stakeholders on the measurement window (28–30 days rolling is typical) before publishing any number.
An understanding of delivery guarantee levels, because an at-least-once system’s “success” means eventual acknowledgement, not first-attempt success.

Step 1: Choose Service Level Indicators

Two indicators capture what integrators actually feel. The delivery success ratio is the share of events that reach a terminal acknowledged state within their retry budget, not the share of first attempts that succeed — a webhook that succeeds on retry 3 is a success from the customer’s perspective. The end-to-end latency is the time from event creation to terminal acknowledgement, reported at p95 (and p99 for stricter tiers). Measuring latency from creation rather than first dispatch is essential: events that age in the outbox before the first attempt are the ones customers complain about. Reporting that distribution correctly — bucket boundaries, quantile error, aggregation across endpoints — is its own discipline, covered in tracking webhook delivery latency percentiles.

Starting the clock at first dispatch would have reported 34 seconds for an event the integrator waited 42 seconds to receive.

Define each SLI as a precise sentence with a clear good-event definition. For example: “The proportion of events created in the window that were acknowledged with a 2xx within 15 minutes.” That sentence is the contract; everything else derives from it.

Read that sentence again with an adversarial eye, because three words in it decide numbers that will later be argued about in a review meeting. Created fixes the denominator as events, not attempts — an event that took six attempts contributes exactly one unit to both numerator and denominator, so a retry storm cannot dilute the ratio into looking healthy. Acknowledged fixes the success condition at the consumer’s response rather than at your dispatcher’s belief, which matters because these are the only two things you can actually observe. And within 15 minutes makes the indicator a joint success-and-latency measure: an event delivered after two hours is counted as a failure even though it eventually arrived. That last choice is the one teams most often get wrong by omission, and omitting it produces the pathological result that a pipeline delivering everything a day late scores 100%.

The denominator also needs an explicit position on events that were never dispatched at all. An event dropped by a validation bug before it reached the outbox is invisible to a metric derived from the outbox, so the SLI silently excludes exactly the failure class with the worst customer impact. Anchor the denominator to the business-side count — rows in the source table, or a counter incremented at the point the domain decides an event should exist — and reconcile it against outbox rows daily. A persistent gap of even 0.05% between “events that should exist” and “events in the outbox” is a bug, not noise, and no amount of delivery-side reliability compensates for it.

Step 2: Set Realistic SLO Targets

Do not invent targets; read them off your history. Compute the last quarter’s success ratio and latency distribution, then set the SLO slightly tighter than your steady-state performance but loose enough to absorb normal incidents. A 99.9% success target over 30 days allows roughly 43 minutes of total budget if expressed as time, or 0.1% of attempts if expressed as a count. Tier targets by importance — a 99.95% target for payment events and 99.5% for low-stakes notifications — so you are not paying for reliability nobody needs.

Before committing to a number, check that your volume can support it. A ratio SLO is a statistical estimate, and at low volume the estimate is dominated by noise. An event class producing 2,000 events a month against a 99.9% target has a budget of exactly two failed events; the third failure is a breach, and a single unlucky consumer deploy can spend the whole month’s allowance in ninety seconds. The useful rule of thumb is that you need at least 10 permitted failures in the window for the objective to describe reliability rather than luck, which means roughly 10,000 events a month at 99.9% and 100,000 at 99.99%. Below that threshold, do not lower your ambition — change the instrument. Lengthen the window to 90 days, express the objective as a count (“no more than three undelivered events per quarter”), or fold the low-volume class into a broader tier whose aggregate volume is sufficient.

Window shape matters as much as window length. A rolling 30-day window is the right default because it never resets: a bad Tuesday keeps costing you for 30 days, which is exactly the incentive you want, and there is no end-of-month cliff where a team spends the remaining budget because it is about to expire. Calendar-month windows are easier to explain to finance and produce visibly worse engineering behaviour. The cost of rolling windows is that recovery feels slow — an incident that burned 40% of the budget will still be visible a month later — and people will ask why the dashboard is red when everything is fine today. Answer that with a second panel showing the last 24 hours alongside the 30-day figure, rather than by shortening the window.

SLI	Good-event definition	Example target	Window
Delivery success ratio	Event acknowledged with 2xx within retry budget	99.9%	30 days rolling
End-to-end latency (p95)	Creation to ack under 15 minutes	95% of events	30 days rolling
End-to-end latency (p99)	Creation to ack under 60 minutes	99% of events	30 days rolling
First-attempt success ratio	Acknowledged on attempt 1	99.0%	7 days rolling

Tiering makes the cost of reliability explicit. Laid out side by side, the same four columns show why a payment event and a marketing event cannot share one number: the strictest tier buys twenty-two minutes of monthly budget and a pager, while the loosest buys seven hours and a dashboard nobody watches at 3am.

Four tiers priced separately: each row buys a different amount of failure and a different response when the budget runs out.

Step 3: Encode the SLIs as Recording Rules

Express each indicator as a Prometheus recording rule so dashboards and alerts read a pre-computed series rather than recomputing heavy queries. The success ratio divides acknowledged deliveries by total attempts over the window.

groups:
  - name: webhook_slo
    interval: 1m
    rules:
      # Rolling 30d delivery success ratio (acked / attempts).
      - record: webhook:delivery_success_ratio:30d
        expr: |
          sum(rate(webhook_deliveries_total{outcome="acked"}[30d]))
          /
          sum(rate(webhook_deliveries_total[30d]))

      # Fraction of events whose end-to-end latency was under 900s (15m).
      - record: webhook:latency_under_15m_ratio:30d
        expr: |
          sum(rate(webhook_delivery_latency_seconds_bucket{le="900"}[30d]))
          /
          sum(rate(webhook_delivery_latency_seconds_count[30d]))

Step 4: Compute the Error Budget

The error budget is 1 - SLO_target, and budget remaining is how much of that allowance you have left. Encode it so you can show “67% of the monthly budget remains” on a dashboard and gate releases on it.

      # Error budget remaining as a fraction of the allowed failure budget.
      # target = 0.999, so allowed failure = 0.001.
      - record: webhook:error_budget_remaining:30d
        expr: |
          1 - (
            (1 - webhook:delivery_success_ratio:30d)
            /
            (1 - 0.999)
          )

When webhook:error_budget_remaining:30d approaches zero, you are about to miss the SLO; that is the signal to freeze risky changes and prioritize reliability work. A healthy budget, by contrast, is permission to ship faster.

Make that policy explicit before you need it, because a budget with no attached consequence is a dashboard decoration. A workable ladder: above 50% remaining, ship normally; between 20% and 50%, require reliability work to be represented in the sprint and ban changes to the dispatch path on Fridays; below 20%, freeze all non-reliability changes to the delivery path until the trailing seven-day burn is under 1.0. The threshold that matters is the last one, and the reason it is 20% rather than 0% is that the budget is a trailing measure — by the time it reads zero, the incidents that spent it are already a fortnight in the past and the freeze arrives too late to prevent the breach it was meant to prevent.

Deciding Which Failures Burn the Budget

The most contested question in any webhook SLO review is not the target, it is which failures count against it. Charge yourself for everything and the number becomes hostage to your integrators: one customer pointing a webhook at a decommissioned host can eat a payment-tier budget by themselves. Exclude too freely and the SLO stops describing anything a customer recognizes. The resolution is a written, small, and auditable exclusion list, applied as a metric label rather than a manual adjustment.

Three exclusions, each capped and each recorded as a label on the metric so the decision is auditable months later.

Each branch has a justification and a cap. A 400 or 422 returned after a valid signature check means the consumer rejected a payload you delivered correctly; retrying cannot help, and charging it to the delivery budget would make your reliability a function of their schema drift. Track it as a separate contract_error_ratio indicator instead — it is a real problem, just not a delivery problem, and separating it is what lets you tell an integrator “your endpoint rejected 4,102 valid payloads this month” with a straight face. An endpoint that has been auto-disabled after 24 hours of continuous failure stops generating budget-consuming events from the disable point onward, because otherwise a single abandoned URL generates unbounded failures forever; the endpoint is counted once, as one lost integration, in a separate count. And a maintenance window registered in advance through your API excludes failures for its duration, capped at four hours a month per endpoint so the mechanism cannot be used to hide chronic instability.

Three rules make these exclusions safe rather than corrosive. They must be decided by the metric label at write time, never by editing history — a retroactive exclusion is indistinguishable from cooking the books, and once one has been granted the next incident review will ask for another. They must be capped, so no single exclusion category can absorb an unbounded amount of failure. And the excluded volume must be shown on the same dashboard as the SLO, because a success ratio of 99.95% next to “12% of failures excluded this month” is honest, while the same number alone is not. If the excluded share ever exceeds the budget itself, the exclusions have become the story and the SLI needs redesigning.

Publishing the Number: SLA Versus Internal SLO

The number you commit to contractually and the number you run the system against should not be the same, and the gap between them is deliberate engineering headroom, not dishonesty. If your internal SLO is 99.9% and you publish 99.9%, then any month you miss internally is a month you owe service credits, and your engineering team’s reliability target becomes a finance conversation. Publishing 99.5% against an internal 99.9% gives you a factor of five in failure allowance: you can miss the internal objective badly — 99.6% for a month — and still be comfortably inside the commitment, which means the internal number can stay aggressive enough to actually drive work.

The published number also needs a simpler definition than the internal one, because it will be read by people who cannot inspect your metrics. “99.5% of events acknowledged within 24 hours, measured monthly, excluding endpoints unreachable for more than one hour” is defensible in a contract; a definition that leans on p95 latency, tiered exclusions, and a rolling window is not, because every clause is a future dispute. Keep the internal objective sophisticated and the external one boring, publish the external one on a status page fed by the same recording rules, and never let the two definitions drift into different codebases — compute both from the identical series, with the external one as a coarser query over the same data.

Verification and Testing

Backtest the rules against historical data before publishing the SLO. Load a representative window into a scratch Prometheus and confirm the computed success ratio matches an independent count from your event store. A simple assertion in CI guards against label drift breaking the math:

def test_success_ratio_matches_event_store(prom, event_store):
    promql = prom.query("webhook:delivery_success_ratio:30d")
    acked = event_store.count(status="acked", window_days=30)
    total = event_store.count(window_days=30)
    expected = acked / total
    # Allow small tolerance for scrape timing differences.
    assert abs(promql - expected) < 0.0005

You can also spot-check latency on the wire by timing a synthetic event from creation to the acknowledgement log line and confirming it lands inside the 15-minute objective.

Failure Modes and Gotchas

Measuring success per attempt, not per event. An at-least-once system retries; counting first attempts makes a healthy pipeline look broken. Define the good event as terminal acknowledgement within the retry budget.
Latency from dispatch instead of creation. Events stuck in the outbox are invisible if you start the clock at first dispatch. Always measure from creation time.
Targets pulled from thin air. An aspirational 99.99% you have never hit produces a permanently exhausted budget and alert fatigue. Derive targets from history.
One global SLO for mixed traffic. Bursty low-priority events drag down the number that matters for payment events. Tier SLOs by event class so the budget reflects real customer impact, in line with message ordering guarantees for ordered streams.
A target the volume cannot support. At 2,000 events a month a 99.9% objective permits two failures, so the metric measures luck rather than reliability. Require at least ten permitted failures in the window, or switch to a count-based objective over a longer period.
A success ratio with no latency clause. An indicator that only asks “did it eventually arrive” scores a pipeline delivering everything a day late at 100%. Bind the good-event definition to a deadline.
Retroactive exclusions. Editing history after an incident to remove failures destroys the credibility of every past number too. Decide exclusions at write time via a metric label, cap each category, and publish the excluded share alongside the ratio.
Publishing the internal target as the contractual one. With no gap between SLO and SLA, every internal miss becomes a credit negotiation and the internal target quietly gets loosened. Keep the published number coarser and looser, computed from the same series.

Frequently Asked Questions

Should a delivery that succeeded on the fourth attempt count as a success?

Yes, provided it landed inside the deadline your good-event definition names. In an at-least-once system retries are the mechanism, not the failure, and an indicator that counts attempts instead of events will report a healthy pipeline as broken every time a consumer restarts.

Keep first-attempt success as a separate operational indicator on a shorter window; it is useful for spotting consumer degradation early, but it is not what an integrator experiences.

How much traffic do I need before a 99.9% objective means anything?

Work backwards from how many failures the objective permits. When that allowance drops into single digits, the published figure is governed by chance rather than by engineering, and one unlucky restart on the consumer side can spend all of it in under two minutes.

In practice that puts the floor at five figures of monthly volume for a three-nines target and an order of magnitude higher for four nines. Underneath it, prefer a longer measurement period or an absolute cap such as three undelivered events per quarter.

Do failures caused by a customer's broken endpoint count against my budget?

Not if you want the number to mean anything, but the exclusion has to be narrow, capped, and applied as a label at write time rather than as a manual correction afterwards. A 4xx returned after a valid signature check is a contract error, not a delivery failure, and belongs in its own indicator.

Publish the excluded share next to the ratio. A success figure that depends on discarding a tenth of all failures is only honest when both numbers are visible.

Rolling window or calendar month?

Rolling, almost always. A calendar window resets on the first of the month, which creates a cliff where a team that has budget left spends it and a team that has none waits out the clock. A rolling window keeps the cost of an incident visible for its full duration.

The complaint you will get is that the dashboard stays red after the problem is fixed; answer it with a second panel showing the trailing 24 hours rather than by shortening the window.

What should actually happen when the budget runs out?

Something specific and agreed in advance, or the budget is decoration. A workable ladder freezes non-reliability changes to the delivery path below 20% remaining and lifts the freeze when the trailing seven-day burn rate drops under 1.0.

Trigger at 20% rather than zero because the measure is trailing: at zero, the incidents that spent the budget are already weeks old and a freeze cannot prevent the breach it was meant to prevent.

Should the number I publish to customers be the same as my internal target?

No. Publish something coarser and looser — 99.5% acknowledged within 24 hours, measured monthly — against an internal objective of 99.9% within 15 minutes. The gap is headroom that lets the internal target stay aggressive without every internal miss becoming a service-credit conversation.

Compute both from the same recording rules so the two definitions cannot drift apart, and keep the contractual wording free of percentiles and tiered exclusions, since every clause is a future dispute.

Why does my success ratio look fine while customers report missing events?

Almost always because the measurement begins too late in the pipeline. If your ratio is computed from rows that already exist in the dispatch table, anything lost earlier — a rolled-back transaction, an over-eager filter, a serialization error — simply never enters the calculation.

The fix is a second count taken at the moment the business decides an event exists, compared against dispatch-table inserts by a daily job. Any stable discrepancy, however small it looks, is a defect in the emit path rather than measurement error.