Adding Jitter to Webhook Retry Backoff

Exponential backoff alone does not prevent a retry storm — it can cause one. If a downstream endpoint blips and a thousand webhook deliveries fail at the same instant, plain base * 2^n backoff schedules all thousand retries for the same future moment. They stampede the recovering endpoint in lockstep, fail again, and re-synchronize for the next round. The fix is jitter: deliberate randomization of the delay so retries spread out instead of clustering. This guide, part of Exponential Backoff Algorithms, compares the three standard jitter strategies — full, equal, and decorrelated — with runnable code for each, and shows how to pick one. It builds directly on implementing exponential backoff in Python webhook handlers; read that first for the retry-loop scaffolding.

The thundering herd is the core problem jitter solves. Without randomization, every client that failed at time T retries at T + delay, recreating the spike. Put numbers on it: 1,000 deliveries fail together on attempt 4, whose exponential ceiling is 16 seconds. Un-jittered, all 1,000 arrive inside the same one-second bucket at T+16s. With full jitter drawing uniformly from [0, 16], those same 1,000 retries land at roughly 62 per second across sixteen buckets — a peak sixteen times lower for exactly the same total volume.

Both lanes are drawn to the same vertical scale and carry the same 1,000 retries: jitter does not reduce load, it removes the instant at which all of it arrives.

Prerequisites

Python 3.10+ with asyncio and (optionally) httpx for the integration example.
A working bounded retry loop with an exponential schedule and a delay ceiling — the structure from implementing exponential backoff in Python webhook handlers.
A monotonic clock for sleeps (asyncio.sleep is fine) so NTP adjustments cannot produce negative delays.
A clear picture of how many independent dispatchers can fail simultaneously — the larger the fleet, the more jitter matters.

Step 1: Establish the base backoff schedule

Jitter randomizes around an exponential ceiling, so start with the un-jittered delay. Cap it so deep retries don’t sleep for hours:

def base_delay(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
    # The exponential ceiling jitter will draw from. attempt is 0-indexed.
    return min(cap, base * (2 ** attempt))

For base=1.0, cap=60.0 this yields 1, 2, 4, 8, 16, 32, 60, 60… seconds. Every strategy below transforms this number into an actual, randomized sleep. Note where the curve flattens: from attempt 6 onwards the ceiling binds and every further attempt shares the same 60-second window, which is also the point at which jitter stops widening and the herd risk stops improving — one reason choosing retry budgets and max attempts treats the ceiling and the attempt cap as a single decision.

The two capped bars are the ones to watch: once the ceiling binds, extra attempts buy time but no additional spread, so the attempt cap becomes the real control.

Step 2: Full jitter

Full jitter draws the delay uniformly from zero up to the exponential ceiling. It maximizes spread and is the AWS-recommended default for high-contention systems, because the expected concurrency at any instant is minimized:

import random

def full_jitter(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
    # Uniform in [0, ceiling]; maximum spread, lowest collision probability.
    return random.uniform(0, base_delay(attempt, base, cap))

The trade-off: a draw can be near zero, so an individual client may retry almost immediately. That is fine for spreading load but means full jitter offers no minimum spacing between a client’s own attempts.

Step 3: Equal jitter

Equal jitter keeps half the ceiling as a fixed floor and randomizes the other half. This guarantees a minimum delay (so a client never hammers instantly) while still decorrelating retries across clients:

import random

def equal_jitter(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
    half = base_delay(attempt, base, cap) / 2
    # Floor of `half` plus a uniform `half`; delay is always >= half the ceiling.
    return half + random.uniform(0, half)

Equal jitter is a good middle ground when you want predictable lower-bound spacing for latency budgeting but still need herd protection. Its spread is narrower than full jitter, so under extreme contention it protects the endpoint slightly less.

Step 4: Decorrelated jitter

Decorrelated jitter derives each delay from the previous delay rather than from the attempt number, letting the window grow organically and spread widely over successive retries. It tends to drain a backlog faster than full jitter while still avoiding synchronization:

import random

def decorrelated_jitter(prev_delay: float, base: float = 1.0, cap: float = 60.0) -> float:
    # Grow from the last delay; sleep = min(cap, uniform(base, prev*3)).
    return min(cap, random.uniform(base, prev_delay * 3))

Because it is stateful, you carry prev_delay across iterations, seeding it with base on the first attempt. This is the strategy to reach for when total time-to-recovery matters as much as smoothness. Tracing one client through five sleeps makes the growth concrete: each draw widens the window to three times the value that was actually slept, so two dispatchers that started in lockstep diverge immediately and keep diverging.

Because the next window is a function of the last sleep rather than the attempt index, two dispatchers that failed at the same instant separate on the first draw and stay separated.

Step 5: Choose a strategy and integrate it

Pick based on your dominant concern, then wire it into the retry loop. The choice is not a ranking — each strategy optimizes a different quantity, so the branch you take depends on which one your incident reviews keep pointing at.

Each branch optimizes something different — spread, a guaranteed floor, or drain time — which is why the right answer changes per system rather than per fashion.

Strategy	Spread	Minimum delay	Best when
Full jitter	Widest	None (can be ~0)	Large fleets, herd avoidance is paramount
Equal jitter	Moderate	Half the ceiling	You need a predictable lower bound for latency budgets
Decorrelated jitter	Wide, growing	`base`	Draining a backlog quickly while staying smooth
No jitter	None	Full ceiling	Single client, no contention (rarely correct)

import asyncio, random

async def deliver_with_jitter(send, *, base=1.0, cap=60.0, max_attempts=6):
    prev = base
    for attempt in range(max_attempts):
        try:
            return await send()
        except Exception:
            if attempt == max_attempts - 1:
                raise
            # Swap in full_jitter / equal_jitter as needed; decorrelated shown:
            delay = decorrelated_jitter(prev, base, cap)
            prev = delay
            await asyncio.sleep(delay)
    raise RuntimeError("unreachable")

When many endpoints fail at once and breakers trip across the fleet, combine jitter with per-endpoint circuit breaker state machines so probes during recovery are also staggered.

Verification

Test the two invariants that matter: delays stay within bounds, and a population of retries actually spreads.

import statistics

def test_full_jitter_bounds():
    for attempt in range(8):
        d = full_jitter(attempt, base=1.0, cap=60.0)
        assert 0 <= d <= min(60.0, 2 ** attempt)

def test_equal_jitter_has_floor():
    d = equal_jitter(3, base=1.0, cap=60.0)   # ceiling = 8
    assert d >= 4.0                            # never below half the ceiling

def test_jitter_spreads_population():
    # 1000 clients all on attempt 3 should NOT cluster on one value.
    delays = [full_jitter(3) for _ in range(1000)]
    assert statistics.pstdev(delays) > 1.0     # real spread, not a spike

To confirm the effect end to end, send a burst of failing deliveries and watch when the retries actually arrive at a stub endpoint:

# Count retry arrivals per second; jittered retries should be roughly flat,
# not a single tall bar at one timestamp.
grep 'retry_attempt' dispatcher.log | awk '{print $1}' | uniq -c

A correctly jittered system shows retry counts distributed across many seconds; a broken (un-jittered) one shows nearly all retries on a single second.

Failure modes and gotchas

Jitter applied to the wrong value. Adding random.uniform(0, 1) on top of the full exponential delay barely changes anything — the spread must span a large fraction of the ceiling. Full jitter draws from [0, ceiling], not [ceiling, ceiling+1].
Re-synchronization after the first round. Equal jitter with a too-large fixed floor lets clients drift back into alignment over several attempts. If you see periodic spikes, widen the random component or switch to full or decorrelated jitter.
Forgetting to carry state for decorrelated jitter. Recomputing from base every attempt instead of from prev_delay collapses decorrelated jitter into a fixed small range. Persist prev across iterations as in Step 5.
Capping after jittering. Apply the min(cap, ...) to the exponential before jittering (full/equal) or inside the formula (decorrelated). Clamping the post-jitter value can pile probability mass exactly on the cap, re-creating a mini herd at the ceiling.
No jitter on circuit-breaker probes. Even with jittered retries, an un-jittered Half-Open probe storm re-trips a recovering endpoint. Jitter probe timing too, and limit concurrent probes across the fleet.

Frequently Asked Questions

Does adding jitter change the delivery window I promised consumers?

Yes, and usually in the direction people do not expect. Full jitter sleeps half the ceiling on average, so a schedule whose un-jittered sum was an hour now burns through its attempts in roughly half that time while the worst case is unchanged. If the window was derived from the un-jittered sum you are covering a long outage for less time than you think, so recompute the attempt cap against the expected sum rather than the maximum.

Is random.random() good enough, or do workers need their own seed?

The module seeds itself from OS entropy at import, so separate processes are already independent — the case that bites is forking. A pre-forking server or a multiprocessing pool that forks after import hands every child an identical generator state, and every child then draws the same delay sequence, which is exactly the synchronized herd jitter exists to prevent. Re-seed in the post-fork hook, or use random.SystemRandom inside the dispatcher.

Does jitter still apply when the delay is a scheduled timestamp rather than an in-process sleep?

It applies, and the scheduler is the better place for it. Randomize the visibility delay or the scheduled-for timestamp at the moment you enqueue the next attempt, so the spread is durable and survives a worker restart that would otherwise realign every pending retry on the same reboot instant. Sleeping in the worker also pins a concurrency slot for the whole delay, which is the more expensive way to wait.

Should a Retry-After value be jittered too?

Treat it as a floor and add jitter on top of it, never subtract. Every dispatcher that received the same 429 received the same number, so honouring it exactly recreates a synchronized arrival at the precise second the endpoint claimed it would be ready — often the second it is most fragile. A uniform draw across ten to twenty percent of the value breaks the alignment without meaningfully delaying recovery.

With decorrelated jitter, does the delay ceiling still do anything?

It is the only thing bounding the walk. Each draw is taken from a window three times the last sampled delay, so a run of high draws escalates fast and, without the cap in the formula, one unlucky delivery ends up scheduled well past the window it was supposed to land in. The cap is also the point at which the strategy stops widening, after which further attempts buy time but no additional spread.

Can I draw the random offset once per delivery instead of once per attempt?

No — a single offset reused across a delivery's attempts spreads the first retry and nothing afterwards. Because every client keeps its constant offset, the population is just as tightly grouped on attempt five as it would be with no jitter at all, merely shifted along the time axis. Draw fresh on every attempt; a random call costs nothing next to the retry it is spacing out.