Webhook Rate Limiting and Backpressure

Sustaining Resilient Delivery & Retry Strategies at scale means a dispatcher must throttle itself: rate limiting caps how fast it offers events to a consumer, while backpressure is the feedback that slows the producer when the consumer cannot keep up. Without both, a healthy spike in events becomes a self-inflicted outage — the dispatcher saturates a slow endpoint, retries pile on top of the original load, and a recoverable slowdown cascades into total failure. This page covers the two halves together in Python: a token bucket governs the steady-state send rate, concurrency caps and queue-depth signals convert downstream slowness into producer-side slowdown, and HTTP 429 Retry-After responses let the consumer assert its own limit.

Token bucket and backpressure feedback loop A refilling token bucket gates dispatch into a bounded queue; queue depth and 429 responses feed back to slow the producer. Producer events in Token bucket refill r/sec Bounded queue depth = signal grant token Workers concurrency cap Consumer backpressure: 429 / Retry-After + queue depth slow the producer
The token bucket sets the steady-state rate; a bounded queue and 429 responses feed back to slow the producer when the consumer falls behind.

Token Bucket Rate Limiting

A token bucket is the workhorse for outbound dispatch because it permits short bursts while bounding the long-run average. The bucket holds up to capacity tokens and refills at rate tokens per second; each send consumes one token, and an empty bucket forces the dispatcher to wait. Tuning is two-dimensional: rate matches the consumer’s documented sustained throughput, while capacity controls how large a burst you allow before the average reasserts itself. A capacity equal to one second of rate behaves almost like a fixed-window limiter; a larger capacity tolerates spiky producers without dropping below the consumer’s ceiling.

The alternatives trade smoothness for simplicity. A fixed-window counter (N requests per minute) is trivial but admits double-rate bursts at window boundaries. A leaky bucket enforces a perfectly smooth output rate with no burst allowance, which is ideal for strict per-endpoint contracts but wasteful when the consumer could absorb occasional spikes. For multi-instance dispatchers, the bucket state must be shared — a Redis-backed token bucket (often a single atomic Lua script) keeps the global rate correct no matter how many workers draw from it, the same coordination pattern used for nonce-based replay protection state.

Concurrency Caps and Queue-Depth Signals

Rate limiting bounds how often you start a send; a concurrency cap bounds how many are in flight at once. The two are distinct controls and you need both: a generous rate with no concurrency cap lets a sudden batch of slow responses open thousands of simultaneous connections and exhaust sockets and memory. An asyncio.Semaphore (or a fixed worker pool) caps in-flight deliveries; sizing it to the consumer’s connection limit is what actually protects a slow endpoint.

Queue depth is the primary backpressure signal. When dispatch work lands in a bounded queue, a full queue means consumers are draining slower than producers are filling — the signal to push back arrives automatically as the enqueue operation blocks or fails. An unbounded queue hides this until the process runs out of memory, converting a slow consumer into a crash. Watch the high-water mark: when depth crosses a threshold, slow the producer (shed, pause, or delay intake) rather than letting latency grow without bound. This shares the failure-isolation goal of Circuit Breaker Patterns, but where a breaker stops traffic on errors, backpressure modulates traffic on saturation.

Honoring 429 and Retry-After

A well-behaved consumer asserts its own limit by returning HTTP 429 with a Retry-After header. Treating 429 as a generic failure is a common and damaging bug: feeding it into Exponential Backoff Algorithms ignores the explicit instruction the consumer just gave you. Honor Retry-After precisely — wait exactly that long — and only fall back to exponential backoff with jitter when the header is absent. Critically, a 429 should also lower the bucket’s own rate for that endpoint, so the next burst does not immediately re-trigger the limit. Apply adaptive concurrency: on repeated 429s, shrink the in-flight cap; on a clean streak, grow it back toward the ceiling.

Failure Mode Analysis & Mitigation

Failure Mode Impact Mitigation Strategy
Unbounded dispatch queue Slow consumer drives the producer out of memory and crashes it Use a bounded queue; treat a full queue as the backpressure signal
429 fed into blind retry Dispatcher ignores the consumer’s stated limit and amplifies the overload Honor Retry-After exactly; lower the per-endpoint token rate on 429
Rate cap without concurrency cap A burst of slow responses opens thousands of connections, exhausting sockets Add an asyncio.Semaphore sized to the consumer’s connection limit
Per-instance rate limits N workers each enforce the limit, so the global rate is N× the target Share token-bucket state in Redis with an atomic refill-and-take script
Retry storm on recovery All paused deliveries fire at once when the consumer returns Drain through the same token bucket and add jitter to release timing

Runnable Implementation Example

This async dispatcher combines a token bucket, a concurrency cap, a bounded queue, and 429 handling.

import asyncio
import time
import httpx

class TokenBucket:
    def __init__(self, rate: float, capacity: float):
        self.rate = rate
        self.capacity = capacity
        self._tokens = capacity
        self._updated = time.monotonic()
        self._lock = asyncio.Lock()

    async def acquire(self) -> None:
        async with self._lock:
            while True:
                now = time.monotonic()
                # Refill based on elapsed wall-clock time.
                self._tokens = min(
                    self.capacity,
                    self._tokens + (now - self._updated) * self.rate,
                )
                self._updated = now
                if self._tokens >= 1:
                    self._tokens -= 1
                    return
                # Sleep just long enough for one token to accrue.
                await asyncio.sleep((1 - self._tokens) / self.rate)

    def throttle(self, factor: float = 0.5) -> None:
        """Shrink the steady-state rate after a 429."""
        self.rate = max(1.0, self.rate * factor)


class Dispatcher:
    def __init__(self, rate: float, max_inflight: int, queue_size: int):
        self.bucket = TokenBucket(rate, capacity=rate)
        self.sem = asyncio.Semaphore(max_inflight)   # concurrency cap
        self.queue: asyncio.Queue = asyncio.Queue(maxsize=queue_size)  # bounded
        self.client = httpx.AsyncClient(timeout=10.0)

    async def submit(self, event: dict) -> None:
        # A full queue is the backpressure signal: block the producer here.
        await self.queue.put(event)

    async def _send(self, event: dict) -> None:
        await self.bucket.acquire()                  # rate limit
        async with self.sem:                         # concurrency cap
            resp = await self.client.post(event["url"], json=event["body"])
            if resp.status_code == 429:
                # Honor the consumer's explicit limit, then lower our rate.
                delay = float(resp.headers.get("Retry-After", "5"))
                self.bucket.throttle()
                await asyncio.sleep(delay)
                await self.queue.put(event)          # requeue for another pass

    async def run(self, workers: int = 4) -> None:
        async def worker():
            while True:
                event = await self.queue.get()
                try:
                    await self._send(event)
                finally:
                    self.queue.task_done()
        await asyncio.gather(*(worker() for _ in range(workers)))

Operational Workflows & CI/CD Integration

Export the levers as runtime configuration, not constants: per-endpoint rate, capacity, max_inflight, and queue_size should be tunable without a redeploy so operators can throttle a misbehaving integration in seconds. Emit queue depth, token-bucket fill level, in-flight count, and 429-rate as metrics, and alert when queue depth trends toward its bound — that is the leading indicator of an impending backlog, well before latency SLOs are breached. In load tests, drive the dispatcher against a deliberately slow endpoint and assert the queue stays bounded and memory stays flat; a test that only exercises the happy path will never catch an unbounded-queue regression.

Debugging Checklist