Choosing a Webhook Delivery Guarantee Level
Every webhook system makes a delivery guarantee, whether it is chosen deliberately or not. The guarantee is the contract you offer consumers about how often each event arrives: never more than once, at least once, or exactly once in effect. Picking the right tier is the most consequential design decision in Delivery Guarantee Levels, because it dictates your retry policy, your storage footprint, and how much work consumers must do. This guide gives you a concrete procedure and a comparison table to choose, then shows how to encode the choice in code. For the deeper theory behind why exactly-once is effectively a deduplication problem, read at-least-once vs exactly-once delivery trade-offs.
The short version: at-least-once is the right default for almost all webhooks, paired with consumer-side idempotency. At-most-once and effectively-exactly-once are specializations you reach for only when a specific requirement forces them.
The three guarantee levels compared
There is no true exactly-once over an unreliable network — “exactly-once” in practice means at-least-once delivery plus deduplication that makes reprocessing a no-op. The three operational tiers are:
| Dimension | At-most-once | At-least-once | Effectively-exactly-once |
|---|---|---|---|
| Duplicates | Never | Possible (must be tolerated) | Suppressed by a dedup store |
| Lost events | Possible on any failure | Never (retried until acked) | Never |
| Retries | None | Bounded retries + backoff | Bounded retries + backoff |
| Where dedup lives | N/A | Consumer (its responsibility) | Producer dedup store + consumer idempotency |
| Storage cost | Lowest | Low (delivery log only) | Highest (durable dedup keys with long TTL) |
| Latency overhead | Lowest | Low | Added lookup/write per event |
| Typical use | Metrics, presence pings, ephemeral signals | Most business events | Payments, billing, inventory |
The decisive trade-off is duplicates versus loss. You cannot have neither without unbounded cost; you choose which one your consumers can tolerate cheaply. Most can absorb duplicates with a small idempotency check far more easily than they can recover from a silently dropped event.
Step 1: Classify each event’s loss tolerance
Run the decision per event type, not per system. A single dispatcher often carries metric.sampled (loss fine) alongside invoice.paid (loss catastrophic). For each type, ask: if this event vanishes and no one notices for an hour, what breaks?
from enum import Enum
class Guarantee(Enum):
AT_MOST_ONCE = "at_most_once"
AT_LEAST_ONCE = "at_least_once"
EFFECTIVELY_EXACTLY_ONCE = "eeo"
EVENT_POLICY = {
"metric.sampled": Guarantee.AT_MOST_ONCE, # cheap, replaceable
"user.updated": Guarantee.AT_LEAST_ONCE, # dedupe on consumer
"invoice.paid": Guarantee.EFFECTIVELY_EXACTLY_ONCE, # money
}
If loss is acceptable, choose at-most-once: fire once, no retries, no delivery log. Everything else continues to Step 2.
Step 2: Decide who owns deduplication
For events that must not be lost, the next question is whether the consumer can deduplicate. A consumer that writes through a unique idempotency key (e.g. an INSERT ... ON CONFLICT DO NOTHING keyed on the event ID) makes at-least-once safe with almost no extra machinery on the producer side. This is the sweet spot and your default:
def deliver_at_least_once(dispatcher, event, max_attempts=6):
# Retry until acknowledged; the consumer dedupes on event["id"].
for attempt in range(max_attempts):
ok = dispatcher.post(event, headers={"X-Idempotency-Key": event["id"]})
if ok:
return "delivered"
dispatcher.backoff(attempt) # exponential backoff + jitter
dispatcher.dead_letter(event)
return "exhausted"
If you cannot rely on the consumer to dedupe — for example a third-party endpoint with side effects you don’t control — escalate to Step 3 and provide deduplication on the producer side.
Step 3: Cost the strongest tier before committing
Effectively-exactly-once is not free. It requires a durable dedup store whose key TTL must outlive your maximum retry window and any DLQ retention, so a replayed event from a dead-letter queue is still recognized as a duplicate weeks later. Budget for the storage and the per-event lookup latency:
import redis
class ExactlyOnceGate:
def __init__(self, client: redis.Redis, ttl_days: int = 30):
self.r = client
self.ttl = ttl_days * 24 * 3600
def first_delivery(self, event_id: str) -> bool:
# SET NX is the dedup primitive; returns True only the first time.
return bool(self.r.set(f"eeo:{event_id}", "1", nx=True, ex=self.ttl))
def deliver_eeo(dispatcher, gate: ExactlyOnceGate, event):
if not gate.first_delivery(event["id"]):
return "suppressed_duplicate"
return deliver_at_least_once(dispatcher, event)
Only adopt this tier where a duplicate genuinely causes harm that the consumer cannot undo cheaply — double charges, double shipments, double-counted balances.
Step 4: Encode the choice in the dispatcher
Wire the per-event policy into one place so the guarantee is explicit and testable rather than emergent from scattered retry settings:
def dispatch(event, dispatcher, gate):
level = EVENT_POLICY.get(event["type"], Guarantee.AT_LEAST_ONCE)
if level is Guarantee.AT_MOST_ONCE:
dispatcher.post(event) # fire and forget, no retry
return "sent_best_effort"
if level is Guarantee.AT_LEAST_ONCE:
return deliver_at_least_once(dispatcher, event)
return deliver_eeo(dispatcher, gate, event)
Verification
Assert that each tier behaves as contracted under a forced failure.
def test_at_least_once_retries_then_delivers():
calls = {"n": 0}
class D:
def post(self, e, headers=None):
calls["n"] += 1
return calls["n"] >= 3 # fail twice, then succeed
def backoff(self, a): pass
def dead_letter(self, e): pass
assert deliver_at_least_once(D(), {"id": "e1"}) == "delivered"
assert calls["n"] == 3
def test_eeo_suppresses_duplicate():
gate = ExactlyOnceGate(fakeredis.FakeStrictRedis())
assert gate.first_delivery("evt-9") is True
assert gate.first_delivery("evt-9") is False # second time: duplicate
Operationally, confirm the contract with a black-box probe: send the same event twice and inspect the consumer.
# Send a duplicate; an EEO consumer must show exactly one applied side effect.
curl -s -X POST "$ENDPOINT" -H 'X-Idempotency-Key: evt-9' -d @event.json
curl -s -X POST "$ENDPOINT" -H 'X-Idempotency-Key: evt-9' -d @event.json
Failure modes and gotchas
- Defaulting the whole system to one tier. Forcing effectively-exactly-once on every event type pays the dedup-store tax on metrics and pings that never needed it. Classify per event type (Step 1); mixing tiers in one dispatcher is correct, not messy.
- At-least-once without consumer dedup. Choosing at-least-once but having a consumer that is not idempotent silently becomes “at-least-once with duplicate side effects.” Verify the consumer dedupes before you rely on this tier.
- Dedup TTL shorter than retention. If the EEO key expires before the DLQ’s retention window, a late replay re-applies the event. Set the dedup TTL longer than
max_retry_window + dlq_retention, as in Step 3. - Treating at-most-once as a performance optimization. Dropping retries to “speed things up” silently downgrades a business event’s guarantee. At-most-once is a deliberate choice for replaceable data only, never a tuning knob for important events.
- Unbounded at-least-once retries. “At least once” does not mean “forever.” Without a retry cap and a DLQ, a permanently broken endpoint pins workers and never surfaces the failure. Always cap attempts and dead-letter the remainder.