Adding Jitter to Webhook Retry Backoff
Exponential backoff alone does not prevent a retry storm — it can cause one. If a downstream endpoint blips and a thousand webhook deliveries fail at the same instant, plain base * 2^n backoff schedules all thousand retries for the same future moment. They stampede the recovering endpoint in lockstep, fail again, and re-synchronize for the next round. The fix is jitter: deliberate randomization of the delay so retries spread out instead of clustering. This guide, part of Exponential Backoff Algorithms, compares the three standard jitter strategies — full, equal, and decorrelated — with runnable code for each, and shows how to pick one. It builds directly on implementing exponential backoff in Python webhook handlers; read that first for the retry-loop scaffolding.
The thundering herd is the core problem jitter solves. Without randomization, every client that failed at time T retries at T + delay, recreating the spike. Jitter decorrelates those retries so the load on the recovering endpoint stays smooth.
Prerequisites
- Python 3.10+ with
asyncioand (optionally)httpxfor the integration example. - A working bounded retry loop with an exponential schedule and a delay ceiling — the structure from implementing exponential backoff in Python webhook handlers.
- A monotonic clock for sleeps (
asyncio.sleepis fine) so NTP adjustments cannot produce negative delays. - A clear picture of how many independent dispatchers can fail simultaneously — the larger the fleet, the more jitter matters.
Step 1: Establish the base backoff schedule
Jitter randomizes around an exponential ceiling, so start with the un-jittered delay. Cap it so deep retries don’t sleep for hours:
def base_delay(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
# The exponential ceiling jitter will draw from. attempt is 0-indexed.
return min(cap, base * (2 ** attempt))
For base=1.0, cap=60.0 this yields 1, 2, 4, 8, 16, 32, 60, 60… seconds. Every strategy below transforms this number into an actual, randomized sleep.
Step 2: Full jitter
Full jitter draws the delay uniformly from zero up to the exponential ceiling. It maximizes spread and is the AWS-recommended default for high-contention systems, because the expected concurrency at any instant is minimized:
import random
def full_jitter(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
# Uniform in [0, ceiling]; maximum spread, lowest collision probability.
return random.uniform(0, base_delay(attempt, base, cap))
The trade-off: a draw can be near zero, so an individual client may retry almost immediately. That is fine for spreading load but means full jitter offers no minimum spacing between a client’s own attempts.
Step 3: Equal jitter
Equal jitter keeps half the ceiling as a fixed floor and randomizes the other half. This guarantees a minimum delay (so a client never hammers instantly) while still decorrelating retries across clients:
import random
def equal_jitter(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
half = base_delay(attempt, base, cap) / 2
# Floor of `half` plus a uniform `half`; delay is always >= half the ceiling.
return half + random.uniform(0, half)
Equal jitter is a good middle ground when you want predictable lower-bound spacing for latency budgeting but still need herd protection. Its spread is narrower than full jitter, so under extreme contention it protects the endpoint slightly less.
Step 4: Decorrelated jitter
Decorrelated jitter derives each delay from the previous delay rather than from the attempt number, letting the window grow organically and spread widely over successive retries. It tends to drain a backlog faster than full jitter while still avoiding synchronization:
import random
def decorrelated_jitter(prev_delay: float, base: float = 1.0, cap: float = 60.0) -> float:
# Grow from the last delay; sleep = min(cap, uniform(base, prev*3)).
return min(cap, random.uniform(base, prev_delay * 3))
Because it is stateful, you carry prev_delay across iterations, seeding it with base on the first attempt. This is the strategy to reach for when total time-to-recovery matters as much as smoothness.
Step 5: Choose a strategy and integrate it
Pick based on your dominant concern, then wire it into the retry loop:
| Strategy | Spread | Minimum delay | Best when |
|---|---|---|---|
| Full jitter | Widest | None (can be ~0) | Large fleets, herd avoidance is paramount |
| Equal jitter | Moderate | Half the ceiling | You need a predictable lower bound for latency budgets |
| Decorrelated jitter | Wide, growing | base |
Draining a backlog quickly while staying smooth |
| No jitter | None | Full ceiling | Single client, no contention (rarely correct) |
import asyncio, random
async def deliver_with_jitter(send, *, base=1.0, cap=60.0, max_attempts=6):
prev = base
for attempt in range(max_attempts):
try:
return await send()
except Exception:
if attempt == max_attempts - 1:
raise
# Swap in full_jitter / equal_jitter as needed; decorrelated shown:
delay = decorrelated_jitter(prev, base, cap)
prev = delay
await asyncio.sleep(delay)
raise RuntimeError("unreachable")
When many endpoints fail at once and breakers trip across the fleet, combine jitter with per-endpoint circuit breaker state machines so probes during recovery are also staggered.
Verification
Test the two invariants that matter: delays stay within bounds, and a population of retries actually spreads.
import statistics
def test_full_jitter_bounds():
for attempt in range(8):
d = full_jitter(attempt, base=1.0, cap=60.0)
assert 0 <= d <= min(60.0, 2 ** attempt)
def test_equal_jitter_has_floor():
d = equal_jitter(3, base=1.0, cap=60.0) # ceiling = 8
assert d >= 4.0 # never below half the ceiling
def test_jitter_spreads_population():
# 1000 clients all on attempt 3 should NOT cluster on one value.
delays = [full_jitter(3) for _ in range(1000)]
assert statistics.pstdev(delays) > 1.0 # real spread, not a spike
To confirm the effect end to end, send a burst of failing deliveries and watch when the retries actually arrive at a stub endpoint:
# Count retry arrivals per second; jittered retries should be roughly flat,
# not a single tall bar at one timestamp.
grep 'retry_attempt' dispatcher.log | awk '{print $1}' | uniq -c
A correctly jittered system shows retry counts distributed across many seconds; a broken (un-jittered) one shows nearly all retries on a single second.
Failure modes and gotchas
- Jitter applied to the wrong value. Adding
random.uniform(0, 1)on top of the full exponential delay barely changes anything — the spread must span a large fraction of the ceiling. Full jitter draws from[0, ceiling], not[ceiling, ceiling+1]. - Re-synchronization after the first round. Equal jitter with a too-large fixed floor lets clients drift back into alignment over several attempts. If you see periodic spikes, widen the random component or switch to full or decorrelated jitter.
- Forgetting to carry state for decorrelated jitter. Recomputing from
baseevery attempt instead of fromprev_delaycollapses decorrelated jitter into a fixed small range. Persistprevacross iterations as in Step 5. - Capping after jittering. Apply the
min(cap, ...)to the exponential before jittering (full/equal) or inside the formula (decorrelated). Clamping the post-jitter value can pile probability mass exactly on the cap, re-creating a mini herd at the ceiling. - No jitter on circuit-breaker probes. Even with jittered retries, an un-jittered Half-Open probe storm re-trips a recovering endpoint. Jitter probe timing too, and limit concurrent probes across the fleet.