At-least-once vs exactly-once webhook delivery trade-offs

Q: If exactly-once is impossible, why do some brokers advertise it?

They are describing exactly-once semantics inside their own closed system, where the producer, log, and consumer offset all commit through one transactional protocol. The guarantee stops at the boundary: the moment an event leaves over HTTP to a third party, the sender is back to not knowing whether a missing acknowledgement means the work happened. Webhook delivery is always the boundary case.

Q: How do I choose the deduplication retention window?

Let your replay habits set it, not the sender's documented retry schedule. Teams that drain dead-letter queues weekly need at least eight days, otherwise a Monday replay of Thursday's failures re-applies every side effect. Storage is cheap relative to a duplicate refund run.

Q: Is a duplicate rate of two percent a problem?

No, that is a normal baseline for HTTP delivery and it is exactly what the consumer's deduplication exists to absorb. Investigate when the rate moves rather than when it exists: a jump into double digits almost always means handler latency has crossed the sender's read timeout and healthy work is being retried.

Q: Does an idempotent consumer remove the need for retries at the sender?

No, they solve opposite halves of the problem. Retries stop events from being lost; idempotency stops them from being applied twice. Drop the retries and you have at-most-once delivery with silent data loss, which is far harder to detect than a duplicate.

Q: What if the side effect is a call to a third-party API rather than a database write?

Then you cannot put it in the same transaction as the claim, and the honest options are to pass your event id through as that API's own idempotency key, or to record the outbound attempt durably before making the call and reconcile afterwards. Most payment and messaging APIs accept an idempotency key precisely because this problem has no local solution.

Q: How do I test that a consumer is genuinely idempotent?

Replay a captured delivery twice in a row and assert the database state is byte-identical after the second run, not merely that the response was 200. Then repeat with the two deliveries running concurrently, which is the case a naive check-then-insert passes in serial testing and fails in production.

Every webhook platform must pick a delivery semantic, and the choice cascades into retry design, consumer complexity, and operational cost. This comparison builds on message ordering guarantees and sits alongside implementing strict ordering for financial webhooks, which shows ordering on top of these semantics. For the full taxonomy of guarantees a platform can offer, see the cross-cutting treatment in delivery guarantee levels.

The short version: true exactly-once delivery over an unreliable network is impossible, because the sender can never be certain whether a lost acknowledgement means the receiver got the message or not. What real systems ship is at-least-once delivery plus idempotent consumers, which together produce an effect that is observed exactly once. Understanding why is the difference between chasing an unattainable guarantee and building one that works.

A lost acknowledgement is indistinguishable from a lost message, so the sender retries; the consumer dedupes to make the effect observed once.

What at-least-once delivery guarantees

At-least-once means the sender keeps retrying until it receives an acknowledgement, so the consumer is guaranteed to see every event — but possibly more than once. It is the default for essentially every commercial webhook provider because it is the only semantic that survives network partitions without dropping events. The sender’s contract is simple: store the event, deliver, and on any timeout or non-2xx response, retry with backoff.

The burden moves to the consumer, which must tolerate duplicates. That is the whole reason idempotency matters in webhooks; with an idempotent consumer, at-least-once delivery yields effectively-once processing.

Where duplicates actually come from

“The network is unreliable” is true but useless when you are staring at two identical charges. In practice duplicates come from four specific windows, each with a distinct signature in the logs, and knowing which one you are in tells you whether to fix the consumer, the timeout, or nothing at all.

The dangerous region starts the instant the side effect commits and ends only when the sender has recorded the acknowledgement — every retry that lands inside it is a duplicate the consumer must absorb.

The consumer is slower than the sender’s read timeout. This is by far the largest source, and it produces duplicates while everything reports success. The sender gives up at 10 seconds, the handler finishes at 11, the work is committed, and the retry arrives to do it all again. The signature is a p99 handler latency sitting right on the sender’s timeout value and duplicates that trail their originals by exactly one retry interval. The fix is not deduplication, it is acknowledging first and doing the work asynchronously so the response is measured in milliseconds.

A proxy returned an error the application never saw. A gateway with a 30-second idle timeout, or a load balancer draining a node mid-request, returns 504 to the sender while the upstream request completed normally. You can confirm it in one query: 504s in the proxy log with matching 200s in the application log for the same request id. Nothing is wrong with the consumer here; the timeout budget of the proxy simply has to exceed the sender’s, or the handler has to get faster.

The sender’s own broker redelivered. If the sender pulls from a queue with a visibility timeout, a delivery task that runs longer than that timeout makes the message visible again and a second worker dispatches the same event. The signature is duplicate bursts spaced at exactly the visibility timeout rather than at the retry backoff intervals, and it is entirely a sender-side bug — but the consumer still has to absorb it.

Someone replayed on purpose. Draining a dead-letter queue, re-running a backfill, or re-sending a day of events after an outage all produce duplicates by design, often thousands at once. This is the case that proves why deduplication keys must outlive the retry window: a replay run three days later against a 24-hour deduplication window will re-apply every side effect it touches.

Why exactly-once delivery is unattainable

Exactly-once delivery would require the sender to know, with certainty, that the consumer received and durably stored each event — exactly once, never zero, never twice. The problem is the two generals: after the consumer processes an event it sends an ack, but if that ack is lost the sender cannot tell the difference between “consumer never got it” and “consumer got it, ack vanished.” Its only safe move is to retry, which produces a duplicate. No additional acknowledgement round solves this; the last message in any finite exchange can always be the one that’s lost.

So exactly-once is achievable only as exactly-once processing — the effect happens once even though the message may arrive several times. That is delivered by at-least-once transport plus a consumer that discards duplicates.

At-most-once for contrast

At-most-once delivery fires each event once and never retries: zero duplicates, but events are silently lost on any failure. It suits pure telemetry or best-effort notifications where a missed event costs nothing. It is the wrong choice for anything stateful, which is why it rarely appears in webhook platforms beyond fire-and-forget pings.

Comparison

Dimension	At-most-once	At-least-once	Exactly-once (effective)
Duplicates	Never	Possible	Suppressed by consumer
Lost events	Possible	Never (with retries)	Never
Sender complexity	Trivial (fire and forget)	Retry + backoff + DLQ	Same as at-least-once
Consumer complexity	None	Must tolerate duplicates	Must be idempotent + store keys
Network-partition behavior	Drops events	Survives, retries later	Survives, retries later
Realistic to implement	Yes	Yes	Only as effectively-once
Typical fit	Telemetry, pings	The default for webhooks	Money movement, account state

Read the table as a single axis rather than three isolated columns: each step to the right buys a stronger correctness property and pays for it with work on the consumer side.

Correctness is bought with consumer work: only the right-hand column both keeps every event and applies it once, and only because the consumer deduplicates.

Choosing for a workload

Default to at-least-once delivery with an idempotent consumer for almost every webhook integration. It is the only combination that neither drops events nor double-applies them, and it degrades gracefully under partitions. Add a durable idempotency key when the side effect is irreversible, and back retries with exponential backoff and a dead-letter queue so a permanently failing consumer does not stall the pipeline.

Pick at-most-once only for genuinely disposable signals. Never reach for it on anything that mutates state, because the lost-event mode is invisible until reconciliation surfaces the gap. When ordering also matters, layer the techniques from implementing strict ordering for financial webhooks on top of the at-least-once base, or scope the guarantee down with per-key ordering with partitioned queues so a single slow consumer cannot stall unrelated resources.

Implementing effectively-once on the consumer

An event moves through a small, fully explicit lifecycle on the consumer: it is delivered, its id is claimed, the side effect is applied, and the whole thing commits together. Only two exits exist — a conflict on the claim, which means a duplicate, and a crash before commit, which rolls back to the delivered state so the next retry starts clean.

There is no state in which the claim exists without its side effect, which is exactly what stops a crashed retry from silently swallowing an event.

import hashlib
import json
import psycopg2

# A UNIQUE constraint on event_id turns at-least-once into effectively-once:
#   CREATE TABLE processed_events (
#     event_id TEXT PRIMARY KEY,
#     processed_at TIMESTAMPTZ NOT NULL DEFAULT now()
#   );

def stable_event_id(headers: dict, payload: dict) -> str:
    # Prefer the provider's event id; it is stable across retries.
    if eid := headers.get("X-Event-Id"):
        return eid
    canonical = json.dumps(payload, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(canonical.encode()).hexdigest()

def handle(conn, headers: dict, payload: dict) -> str:
    eid = stable_event_id(headers, payload)
    with conn:  # one transaction: claim + side effect commit together
        with conn.cursor() as cur:
            cur.execute(
                "INSERT INTO processed_events (event_id) VALUES (%s) "
                "ON CONFLICT (event_id) DO NOTHING RETURNING event_id",
                (eid,),
            )
            if cur.fetchone() is None:
                # The row already existed: this is a retry of a processed event.
                return "duplicate-ignored"
            apply_side_effect(cur, payload)  # runs in the same transaction
    return "processed"

def apply_side_effect(cur, payload: dict) -> None:
    ...  # the actual work, committed atomically with the claim

Claiming the event_id and applying the side effect in one transaction is what makes it correct: either both commit or neither does, so a crash mid-flight is safely retried rather than leaving a claimed-but-unprocessed event.

Sizing and retiring the claim table

The claim table is the cost of effectively-once processing, and it grows at the full delivery rate rather than the duplicate rate. At a modest 200 deliveries per second that is 17.3 million rows per day; retaining 72 hours leaves roughly 52 million rows, and at about 100 bytes per row including the index entry that is 5.2 GB of storage doing nothing but remembering what you have already seen.

Retention should be set by the longest replay you will realistically perform, not by the provider’s retry horizon. If your team drains dead-letter queues weekly, a 72-hour window means a Monday replay of Thursday’s failures re-applies every side effect it touches, because the claims expired in between. Eight days of retention costs about 14 GB at the rate above and removes the entire class of incident; that is a good trade in almost every system.

How you delete matters more than how much you keep. A nightly DELETE ... WHERE processed_at < now() - interval '8 days' on a table taking 200 inserts per second produces dead tuples faster than autovacuum reclaims them, and the observable result is an insert p99 that drifts from 2 ms to 40 ms over a few weeks while the table size grows even though the row count is flat. Use daily range partitions and drop the expired partition instead: reclaiming space becomes a metadata operation and the index stays shallow.

The primary key type is the other quiet cost. Random text ids give a b-tree with no insert locality, so every insert dirties a different page and the write-ahead log grows several times faster than the data — the symptom is WAL volume wildly out of proportion to row volume and checkpoints firing every few seconds. Hashing the event id down to 16 raw bytes shrinks the index by more than half and makes the working set far more likely to stay in cache.

The cheapest version of this table is the one you never create. If the side effect already writes a row keyed by something derived from the event — a ledger entry, a state transition record — put a unique constraint on (resource_id, provider_event_id) there and the deduplication is free: no second table, no retention policy, and no possibility of the claim and the effect diverging. Reach for a dedicated claim table only when the side effect has no natural row of its own.

Verification

def test_duplicate_is_ignored(conn):
    headers = {"X-Event-Id": "evt_123"}
    payload = {"amount": 100}
    assert handle(conn, headers, payload) == "processed"
    # An identical retry must not re-run the side effect.
    assert handle(conn, headers, payload) == "duplicate-ignored"

You can also force the duplicate path with curl by replaying the same delivery twice; the second call should be acknowledged with a 200 but leave state unchanged:

BODY='{"amount":100}'
for i in 1 2; do
  curl -fsS -X POST localhost:8000/webhooks \
    -H 'X-Event-Id: evt_123' -H 'content-type: application/json' --data "$BODY"
done
# Expect one state change in the database, two 200 responses.

Failure modes and gotchas

Claiming and applying in separate transactions. If you insert the event_id, commit, then crash before the side effect, the retry sees the row and skips the work — the event is lost despite at-least-once transport. Keep both in one transaction.
Treating “exactly-once” as a transport setting. No retry tuning or ack scheme buys exactly-once on the wire. Chasing it wastes effort; invest in consumer idempotency instead.
Using a non-stable event id. Deriving the id from a field the provider rewrites on retry (a timestamp, a delivery attempt counter) defeats deduplication entirely. Use the provider’s event id, or hash only retry-stable fields.
Unbounded retries with no dead-letter path. At-least-once without a ceiling will retry a poison event forever, blocking the queue. Cap attempts and route exhausted events to a dead-letter queue for out-of-band handling.
Returning 5xx for an already-processed event. A duplicate that answers with an error convinces the sender the delivery failed, so it retries harder and for longer — turning successful deduplication into a self-inflicted retry storm. Acknowledge duplicates with 200 and count them; the metric is the alerting signal, not the status code.
Hashing a re-serialised body. If the id is derived by parsing JSON and dumping it again, any change in key order, unicode escaping, or float formatting between your parser and the sender’s produces a different digest for the same event. Hash the raw request bytes, or use the provider’s event id and skip the problem entirely.
Acknowledging before the work is durable. Responding 200 and then handing the event to a background task means a worker restart drops it with the sender convinced it succeeded — at-least-once transport cannot help you once you have lied about receipt. Commit to a durable queue before responding, then process from there.

Frequently Asked Questions

If exactly-once is impossible, why do some brokers advertise it?

They are describing exactly-once semantics inside their own closed system, where the producer, log, and consumer offset all commit through one transactional protocol. The guarantee stops at the boundary: the moment an event leaves over HTTP to a third party, the sender is back to not knowing whether a missing acknowledgement means the work happened. Webhook delivery is always the boundary case.

How do I choose the deduplication retention window?

Let your replay habits set it, not the sender's documented retry schedule. Teams that drain dead-letter queues weekly need at least eight days, otherwise a Monday replay of Thursday's failures re-applies every side effect. Storage is cheap relative to a duplicate refund run.

Is a duplicate rate of two percent a problem?

No, that is a normal baseline for HTTP delivery and it is exactly what the consumer's deduplication exists to absorb. Investigate when the rate moves rather than when it exists: a jump into double digits almost always means handler latency has crossed the sender's read timeout and healthy work is being retried.

Does an idempotent consumer remove the need for retries at the sender?

No, they solve opposite halves of the problem. Retries stop events from being lost; idempotency stops them from being applied twice. Drop the retries and you have at-most-once delivery with silent data loss, which is far harder to detect than a duplicate.

What if the side effect is a call to a third-party API rather than a database write?

Then you cannot put it in the same transaction as the claim, and the honest options are to pass your event id through as that API's own idempotency key, or to record the outbound attempt durably before making the call and reconcile afterwards. Most payment and messaging APIs accept an idempotency key precisely because this problem has no local solution.

How do I test that a consumer is genuinely idempotent?

Replay a captured delivery twice in a row and assert the database state is byte-identical after the second run, not merely that the response was 200. Then repeat with the two deliveries running concurrently, which is the case a naive check-then-insert passes in serial testing and fails in production.