Choosing a Webhook Delivery Guarantee Level

Every webhook system makes a delivery guarantee, whether it is chosen deliberately or not. The guarantee is the contract you offer consumers about how often each event arrives: never more than once, at least once, or exactly once in effect. Picking the right tier is the most consequential design decision in Delivery Guarantee Levels, because it dictates your retry policy, your storage footprint, and how much work consumers must do. This guide gives you a concrete procedure and a comparison table to choose, then shows how to encode the choice in code. For the deeper theory behind why exactly-once is effectively a deduplication problem, read at-least-once vs exactly-once delivery trade-offs.

The short version: at-least-once is the right default for almost all webhooks, paired with consumer-side idempotency. At-most-once and effectively-exactly-once are specializations you reach for only when a specific requirement forces them.

The decision tree: loss tolerance picks at-most-once; otherwise dedup capability decides between at-least-once and an effectively-exactly-once dedup layer.

The three guarantee levels compared

There is no true exactly-once over an unreliable network — “exactly-once” in practice means at-least-once delivery plus deduplication that makes reprocessing a no-op. The three operational tiers are:

Dimension	At-most-once	At-least-once	Effectively-exactly-once
Duplicates	Never	Possible (must be tolerated)	Suppressed by a dedup store
Lost events	Possible on any failure	Never (retried until acked)	Never
Retries	None	Bounded retries + backoff	Bounded retries + backoff
Where dedup lives	N/A	Consumer (its responsibility)	Producer dedup store + consumer idempotency
Storage cost	Lowest	Low (delivery log only)	Highest (durable dedup keys with long TTL)
Latency overhead	Lowest	Low	Added lookup/write per event
Typical use	Metrics, presence pings, ephemeral signals	Most business events	Payments, billing, inventory

The decisive trade-off is duplicates versus loss. You cannot have neither without unbounded cost; you choose which one your consumers can tolerate cheaply. Most can absorb duplicates with a small idempotency check far more easily than they can recover from a silently dropped event.

Step 1: Classify each event’s loss tolerance

Run the decision per event type, not per system. A single dispatcher often carries metric.sampled (loss fine) alongside invoice.paid (loss catastrophic). For each type, ask: if this event vanishes and no one notices for an hour, what breaks?

from enum import Enum

class Guarantee(Enum):
    AT_MOST_ONCE = "at_most_once"
    AT_LEAST_ONCE = "at_least_once"
    EFFECTIVELY_EXACTLY_ONCE = "eeo"

EVENT_POLICY = {
    "metric.sampled":   Guarantee.AT_MOST_ONCE,      # cheap, replaceable
    "user.updated":     Guarantee.AT_LEAST_ONCE,     # dedupe on consumer
    "invoice.paid":     Guarantee.EFFECTIVELY_EXACTLY_ONCE,  # money
}

If loss is acceptable, choose at-most-once: fire once, no retries, no delivery log. Everything else continues to Step 2.

Step 2: Decide who owns deduplication

For events that must not be lost, the next question is whether the consumer can deduplicate. A consumer that writes through a unique idempotency key (e.g. an INSERT ... ON CONFLICT DO NOTHING keyed on the event ID) makes at-least-once safe with almost no extra machinery on the producer side. This is the sweet spot and your default:

def deliver_at_least_once(dispatcher, event, max_attempts=6):
    # Retry until acknowledged; the consumer dedupes on event["id"].
    for attempt in range(max_attempts):
        ok = dispatcher.post(event, headers={"X-Idempotency-Key": event["id"]})
        if ok:
            return "delivered"
        dispatcher.backoff(attempt)   # exponential backoff + jitter
    dispatcher.dead_letter(event)
    return "exhausted"

If you cannot rely on the consumer to dedupe — for example a third-party endpoint with side effects you don’t control — escalate to Step 3 and provide deduplication on the producer side.

Step 3: Cost the strongest tier before committing

Effectively-exactly-once is not free. It requires a durable dedup store whose key TTL must outlive your maximum retry window and any DLQ retention, so a replayed event from a dead-letter queue is still recognized as a duplicate weeks later. Budget for the storage and the per-event lookup latency:

import redis

class ExactlyOnceGate:
    def __init__(self, client: redis.Redis, ttl_days: int = 30):
        self.r = client
        self.ttl = ttl_days * 24 * 3600

    def first_delivery(self, event_id: str) -> bool:
        # SET NX is the dedup primitive; returns True only the first time.
        return bool(self.r.set(f"eeo:{event_id}", "1", nx=True, ex=self.ttl))

def deliver_eeo(dispatcher, gate: ExactlyOnceGate, event):
    if not gate.first_delivery(event["id"]):
        return "suppressed_duplicate"
    return deliver_at_least_once(dispatcher, event)

Only adopt this tier where a duplicate genuinely causes harm that the consumer cannot undo cheaply — double charges, double shipments, double-counted balances.

Step 4: Encode the choice in the dispatcher

Wire the per-event policy into one place so the guarantee is explicit and testable rather than emergent from scattered retry settings:

def dispatch(event, dispatcher, gate):
    level = EVENT_POLICY.get(event["type"], Guarantee.AT_LEAST_ONCE)
    if level is Guarantee.AT_MOST_ONCE:
        dispatcher.post(event)              # fire and forget, no retry
        return "sent_best_effort"
    if level is Guarantee.AT_LEAST_ONCE:
        return deliver_at_least_once(dispatcher, event)
    return deliver_eeo(dispatcher, gate, event)

Verification

Assert that each tier behaves as contracted under a forced failure.

def test_at_least_once_retries_then_delivers():
    calls = {"n": 0}
    class D:
        def post(self, e, headers=None):
            calls["n"] += 1
            return calls["n"] >= 3        # fail twice, then succeed
        def backoff(self, a): pass
        def dead_letter(self, e): pass
    assert deliver_at_least_once(D(), {"id": "e1"}) == "delivered"
    assert calls["n"] == 3

def test_eeo_suppresses_duplicate():
    gate = ExactlyOnceGate(fakeredis.FakeStrictRedis())
    assert gate.first_delivery("evt-9") is True
    assert gate.first_delivery("evt-9") is False   # second time: duplicate

Operationally, confirm the contract with a black-box probe: send the same event twice and inspect the consumer.

# Send a duplicate; an EEO consumer must show exactly one applied side effect.
curl -s -X POST "$ENDPOINT" -H 'X-Idempotency-Key: evt-9' -d @event.json
curl -s -X POST "$ENDPOINT" -H 'X-Idempotency-Key: evt-9' -d @event.json

Failure modes and gotchas

Defaulting the whole system to one tier. Forcing effectively-exactly-once on every event type pays the dedup-store tax on metrics and pings that never needed it. Classify per event type (Step 1); mixing tiers in one dispatcher is correct, not messy.
At-least-once without consumer dedup. Choosing at-least-once but having a consumer that is not idempotent silently becomes “at-least-once with duplicate side effects.” Verify the consumer dedupes before you rely on this tier.
Dedup TTL shorter than retention. If the EEO key expires before the DLQ’s retention window, a late replay re-applies the event. Set the dedup TTL longer than max_retry_window + dlq_retention, as in Step 3.
Treating at-most-once as a performance optimization. Dropping retries to “speed things up” silently downgrades a business event’s guarantee. At-most-once is a deliberate choice for replaceable data only, never a tuning knob for important events.
Unbounded at-least-once retries. “At least once” does not mean “forever.” Without a retry cap and a DLQ, a permanently broken endpoint pins workers and never surfaces the failure. Always cap attempts and dead-letter the remainder.