Adding Jitter to Webhook Retry Backoff

Exponential backoff alone does not prevent a retry storm — it can cause one. If a downstream endpoint blips and a thousand webhook deliveries fail at the same instant, plain base * 2^n backoff schedules all thousand retries for the same future moment. They stampede the recovering endpoint in lockstep, fail again, and re-synchronize for the next round. The fix is jitter: deliberate randomization of the delay so retries spread out instead of clustering. This guide, part of Exponential Backoff Algorithms, compares the three standard jitter strategies — full, equal, and decorrelated — with runnable code for each, and shows how to pick one. It builds directly on implementing exponential backoff in Python webhook handlers; read that first for the retry-loop scaffolding.

The thundering herd is the core problem jitter solves. Without randomization, every client that failed at time T retries at T + delay, recreating the spike. Jitter decorrelates those retries so the load on the recovering endpoint stays smooth.

Retry spike versus jittered spread Without jitter all retries land in one spike that re-overwhelms the endpoint; with jitter the same retries spread across the window. No jitter all retries land at once With jitter retries spread across the window time
Without jitter, exponential backoff still synchronizes retries into a spike; jitter spreads the same retries across the window so the endpoint is not re-overwhelmed.

Prerequisites

Step 1: Establish the base backoff schedule

Jitter randomizes around an exponential ceiling, so start with the un-jittered delay. Cap it so deep retries don’t sleep for hours:

def base_delay(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
    # The exponential ceiling jitter will draw from. attempt is 0-indexed.
    return min(cap, base * (2 ** attempt))

For base=1.0, cap=60.0 this yields 1, 2, 4, 8, 16, 32, 60, 60… seconds. Every strategy below transforms this number into an actual, randomized sleep.

Step 2: Full jitter

Full jitter draws the delay uniformly from zero up to the exponential ceiling. It maximizes spread and is the AWS-recommended default for high-contention systems, because the expected concurrency at any instant is minimized:

import random

def full_jitter(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
    # Uniform in [0, ceiling]; maximum spread, lowest collision probability.
    return random.uniform(0, base_delay(attempt, base, cap))

The trade-off: a draw can be near zero, so an individual client may retry almost immediately. That is fine for spreading load but means full jitter offers no minimum spacing between a client’s own attempts.

Step 3: Equal jitter

Equal jitter keeps half the ceiling as a fixed floor and randomizes the other half. This guarantees a minimum delay (so a client never hammers instantly) while still decorrelating retries across clients:

import random

def equal_jitter(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
    half = base_delay(attempt, base, cap) / 2
    # Floor of `half` plus a uniform `half`; delay is always >= half the ceiling.
    return half + random.uniform(0, half)

Equal jitter is a good middle ground when you want predictable lower-bound spacing for latency budgeting but still need herd protection. Its spread is narrower than full jitter, so under extreme contention it protects the endpoint slightly less.

Step 4: Decorrelated jitter

Decorrelated jitter derives each delay from the previous delay rather than from the attempt number, letting the window grow organically and spread widely over successive retries. It tends to drain a backlog faster than full jitter while still avoiding synchronization:

import random

def decorrelated_jitter(prev_delay: float, base: float = 1.0, cap: float = 60.0) -> float:
    # Grow from the last delay; sleep = min(cap, uniform(base, prev*3)).
    return min(cap, random.uniform(base, prev_delay * 3))

Because it is stateful, you carry prev_delay across iterations, seeding it with base on the first attempt. This is the strategy to reach for when total time-to-recovery matters as much as smoothness.

Step 5: Choose a strategy and integrate it

Pick based on your dominant concern, then wire it into the retry loop:

Strategy Spread Minimum delay Best when
Full jitter Widest None (can be ~0) Large fleets, herd avoidance is paramount
Equal jitter Moderate Half the ceiling You need a predictable lower bound for latency budgets
Decorrelated jitter Wide, growing base Draining a backlog quickly while staying smooth
No jitter None Full ceiling Single client, no contention (rarely correct)
import asyncio, random

async def deliver_with_jitter(send, *, base=1.0, cap=60.0, max_attempts=6):
    prev = base
    for attempt in range(max_attempts):
        try:
            return await send()
        except Exception:
            if attempt == max_attempts - 1:
                raise
            # Swap in full_jitter / equal_jitter as needed; decorrelated shown:
            delay = decorrelated_jitter(prev, base, cap)
            prev = delay
            await asyncio.sleep(delay)
    raise RuntimeError("unreachable")

When many endpoints fail at once and breakers trip across the fleet, combine jitter with per-endpoint circuit breaker state machines so probes during recovery are also staggered.

Verification

Test the two invariants that matter: delays stay within bounds, and a population of retries actually spreads.

import statistics

def test_full_jitter_bounds():
    for attempt in range(8):
        d = full_jitter(attempt, base=1.0, cap=60.0)
        assert 0 <= d <= min(60.0, 2 ** attempt)

def test_equal_jitter_has_floor():
    d = equal_jitter(3, base=1.0, cap=60.0)   # ceiling = 8
    assert d >= 4.0                            # never below half the ceiling

def test_jitter_spreads_population():
    # 1000 clients all on attempt 3 should NOT cluster on one value.
    delays = [full_jitter(3) for _ in range(1000)]
    assert statistics.pstdev(delays) > 1.0     # real spread, not a spike

To confirm the effect end to end, send a burst of failing deliveries and watch when the retries actually arrive at a stub endpoint:

# Count retry arrivals per second; jittered retries should be roughly flat,
# not a single tall bar at one timestamp.
grep 'retry_attempt' dispatcher.log | awk '{print $1}' | uniq -c

A correctly jittered system shows retry counts distributed across many seconds; a broken (un-jittered) one shows nearly all retries on a single second.

Failure modes and gotchas