Webhook Rate Limiting and Backpressure

Sustaining Resilient Delivery & Retry Strategies at scale means a dispatcher must throttle itself: rate limiting caps how fast it offers events to a consumer, while backpressure is the feedback that slows the producer when the consumer cannot keep up. Without both, a healthy spike in events becomes a self-inflicted outage — the dispatcher saturates a slow endpoint, retries pile on top of the original load, and a recoverable slowdown cascades into total failure. This page covers the two halves together in Python: a token bucket governs the steady-state send rate, concurrency caps and queue-depth signals convert downstream slowness into producer-side slowdown, and HTTP 429 Retry-After responses let the consumer assert its own limit.

The token bucket sets the steady-state rate; a bounded queue and 429 responses feed back to slow the producer when the consumer falls behind.

Token Bucket Rate Limiting

A token bucket is the workhorse for outbound dispatch because it permits short bursts while bounding the long-run average. The bucket holds up to capacity tokens and refills at rate tokens per second; each send consumes one token, and an empty bucket forces the dispatcher to wait. Tuning is two-dimensional: rate matches the consumer’s documented sustained throughput, while capacity controls how large a burst you allow before the average reasserts itself. A capacity equal to one second of rate behaves almost like a fixed-window limiter; a larger capacity tolerates spiky producers without dropping below the consumer’s ceiling. The full per-endpoint implementation, including the atomic Lua refill script, is worked through in token bucket rate limiting for webhook senders.

The alternatives trade smoothness for simplicity. A fixed-window counter (N requests per minute) is trivial but admits double-rate bursts at window boundaries. A leaky bucket enforces a perfectly smooth output rate with no burst allowance, which is ideal for strict per-endpoint contracts but wasteful when the consumer could absorb occasional spikes. For multi-instance dispatchers, the bucket state must be shared — a Redis-backed token bucket (often a single atomic Lua script) keeps the global rate correct no matter how many workers draw from it, the same coordination pattern used for nonce-based replay protection state.

Only the token bucket buys burst tolerance without a window edge, which is why it is the default for outbound webhook dispatch.

Sizing Rate and Capacity from a Consumer’s Real Limit

Rate and capacity are two independent decisions and conflating them is the most common sizing error. Rate is a contract with the consumer: if an integration documents 100 requests per minute, the sustainable rate is 1.67 per second and no amount of burst tolerance changes that. Capacity is a decision about your own smoothness: it is the number of deliveries you are willing to emit back-to-back after an idle period. Setting capacity equal to one second of rate makes the limiter behave like a metronome, which is safe but wastes headroom whenever the consumer has been idle. Setting capacity to ten seconds of rate lets a batch of accumulated events go out in one burst, which is usually what a consumer’s own limiter tolerates and what makes fan-out feel responsive.

The arithmetic to run before choosing capacity is the burst-duration calculation. A bucket with capacity C and rate R that starts full can emit C requests instantly, and then sustains R per second thereafter; the burst is absorbed by the consumer over roughly C / R seconds of its own queueing. With R = 20/s and C = 200, the consumer receives 200 requests in a few hundred milliseconds and must buffer ten seconds of work — fine for a queue-backed consumer, fatal for one that processes synchronously with a 30-connection pool and a 400 ms handler, because 200 concurrent arrivals against 30 slots means the 31st request waits and the 200th waits nearly three seconds. When you do not know how the consumer is built, assume synchronous and keep capacity near 2 × R.

Consumer’s stated limit	Rate to configure	Capacity	Burst absorbed	Why
100 requests / minute	1.67 / s	10	~6 s of work	Small burst keeps a fragile endpoint smooth
20 requests / second	20 / s	40	2 s of work	Default shape: two seconds of burst tolerance
500 requests / second, queue-backed	500 / s	2000	4 s of work	Consumer buffers internally, so bursts are cheap
Undocumented, synchronous handler	10 / s	20	2 s of work	Deliberately conservative until measured
Undocumented, known queue-backed	50 / s	500	10 s of work	Burst is absorbed by the consumer’s queue, not its handlers

Two refinements pay for themselves quickly. First, tokens should be taken before the connection is opened rather than after, so that a burst never manifests as a burst of TCP connects — the rate limiter and the connection pool must agree about what a unit of work is. Second, refill must be computed from elapsed time rather than from a periodic timer. Timer-driven refills drift under load precisely when accuracy matters, and a missed tick silently lowers your effective rate; the elapsed-time form in the implementation below is self-correcting because it derives tokens from the monotonic clock on every acquisition.

Concurrency Caps and Queue-Depth Signals

Rate limiting bounds how often you start a send; a concurrency cap bounds how many are in flight at once. The two are distinct controls and you need both: a generous rate with no concurrency cap lets a sudden batch of slow responses open thousands of simultaneous connections and exhaust sockets and memory. An asyncio.Semaphore (or a fixed worker pool) caps in-flight deliveries; sizing it to the consumer’s connection limit is what actually protects a slow endpoint.

Queue depth is the primary backpressure signal. When dispatch work lands in a bounded queue, a full queue means consumers are draining slower than producers are filling — the signal to push back arrives automatically as the enqueue operation blocks or fails. An unbounded queue hides this until the process runs out of memory, converting a slow consumer into a crash. Watch the high-water mark: when depth crosses a threshold, slow the producer (shed, pause, or delay intake) rather than letting latency grow without bound. This shares the failure-isolation goal of Circuit Breaker Patterns, but where a breaker stops traffic on errors, backpressure modulates traffic on saturation.

Sizing the concurrency cap is an empirical exercise, not a guess. Start from Little’s law: the in-flight count you need to sustain a target rate is rate × mean_latency. A consumer that answers in 200 ms at a target of 50 events/sec needs about 10 concurrent connections; if the same consumer degrades to 2-second responses, holding 50/sec would demand 100 connections, which is exactly the moment you want the cap to bite instead of scaling with the damage. Set the ceiling from the consumer’s documented connection limit where one is published, and from the healthy-state calculation plus a small margin where it is not. A cap that is never reached costs nothing; a cap that is reached is doing its job.

The two controls also fail differently, which is why neither substitutes for the other. Hitting the rate limit means “we are going too fast” and resolves itself as tokens accrue. Hitting the concurrency cap means “the consumer is not answering” and resolves only when responses come back — so a saturated semaphore, unlike an empty bucket, is a health signal worth alerting on. Track both: tokens_available near zero is normal under load, but semaphore_available pinned at zero for more than a few seconds means latency has moved, and the right response is usually to trip a breaker rather than to keep queuing.

Drain Time Is the Alert, Not Queue Depth

Queue depth on its own is a number without a unit of urgency. Ten thousand queued deliveries is a non-event for an endpoint draining at 500 per second and a four-hour incident for one draining at 0.7 per second. The metric to put on the dashboard and on the alert is therefore derived: drain_time = depth / (drain_rate - arrival_rate) when the drain rate exceeds the arrival rate, and undefined — meaning “growing without bound” — when it does not. That undefined case is the one that should page, immediately and regardless of depth, because a queue whose arrival rate exceeds its drain rate has already failed; the only open question is how long until it hits its bound.

Work an example. An endpoint limited to 20 per second is receiving 35 events per second during a bulk import. The backlog grows at 15 per second, so a 50,000-message bound is reached in 3,333 seconds — 55 minutes of warning, which is plenty if anyone is looking and useless if the only alert fires when the queue is already full. After the import stops and arrivals fall to 5 per second, the backlog drains at 15 per second, so 50,000 messages take 3,333 seconds to clear. Total time-to-delivery for the last event in that backlog is therefore over an hour and a half, which is the number that belongs in the incident review, not the peak depth.

Depth alone tells you nothing actionable; the projected drain time is what converts a rising queue into a decision about whether to shed, throttle the producer, or wait.

Three thresholds cover the useful cases. Alert at a projected drain time above the delivery promise for that endpoint — that is the point at which queuing has already broken the contract even though nothing has failed. Alert at any sustained period where the arrival rate exceeds the drain rate for longer than five minutes, since that is unbounded growth regardless of the current depth. And alert when the queue has been at its bound for more than a few seconds, because from that moment onward the system is shedding or blocking and the behaviour has changed qualitatively. Depth itself belongs on a dashboard as context, not on a pager.

Honoring 429 and Retry-After

A well-behaved consumer asserts its own limit by returning HTTP 429 with a Retry-After header. Treating 429 as a generic failure is a common and damaging bug: feeding it into Exponential Backoff Algorithms ignores the explicit instruction the consumer just gave you. Honor Retry-After precisely — wait exactly that long — and only fall back to exponential backoff with jitter when the header is absent. Critically, a 429 should also lower the bucket’s own rate for that endpoint, so the next burst does not immediately re-trigger the limit. Apply adaptive concurrency: on repeated 429s, shrink the in-flight cap; on a clean streak, grow it back toward the ceiling.

A 429 is an instruction, not an error: honor the stated interval exactly, shrink the local rate, and only widen it again after a clean streak.

Fairness and Head-of-Line Blocking Across Tenants

Rate limiting protects consumers from you. Fair queuing protects your consumers from each other, and a dispatcher that has the first without the second will eventually deliver a very unfair service. The mechanism is head-of-line blocking: a single FIFO queue drained by a fixed worker pool gives whichever tenant enqueued the most work a proportional share of every worker, so a customer running a 400,000-event backfill occupies the pipeline and a different customer’s password-reset webhook waits behind it. Nothing is failing, no alert fires, and the affected tenant experiences it as an outage.

The arithmetic makes the severity concrete. Thirty-two workers draining a shared FIFO at 20 deliveries per second per worker handle 640 per second. If one tenant enqueues 400,000 events, that tenant’s work alone occupies the queue for 625 seconds, and every event enqueued behind it inherits that delay — a ten-minute time-to-delivery for tenants who contributed a handful of events each. Splitting into per-tenant queues drained by a round-robin scheduler changes the outcome completely: the small tenants are served on every scheduling pass, so their latency is bounded by the number of active tenants rather than by the largest backlog.

Fair queuing changes who waits, not how fast you go: the same 640 deliveries per second serve small tenants in seconds instead of in minutes.

Implementations differ mainly in how they pick the next queue. Strict round-robin over active tenants is trivial and adequate when tenants are roughly comparable. Weighted round-robin, with weights derived from plan tier or from committed volume, is the usual production choice because it lets a large customer legitimately receive more throughput without being able to take all of it. Deficit round-robin is worth the extra complexity when payload sizes vary by an order of magnitude, since it schedules by bytes rather than by messages. Whatever the policy, the scheduler must skip tenants whose token bucket is empty rather than blocking on them — otherwise a single rate-limited endpoint stalls the rotation and you have reinvented head-of-line blocking one level up.

The operational tell that you need this is a latency distribution with a long, lumpy tail that correlates with nothing on your own dashboards. Add a tenant_id label to the time-to-delivery histogram and the picture resolves immediately: a handful of tenants sitting at seconds while the fleet median sits at hundreds of milliseconds is head-of-line blocking, and no amount of extra workers will fix it, because more workers drain the same FIFO in the same order.

Failure Mode Analysis & Mitigation

Failure Mode	Impact	Mitigation Strategy
Unbounded dispatch queue	Slow consumer drives the producer out of memory and crashes it	Use a bounded queue; treat a full queue as the backpressure signal
429 fed into blind retry	Dispatcher ignores the consumer’s stated limit and amplifies the overload	Honor `Retry-After` exactly; lower the per-endpoint token rate on 429
Rate cap without concurrency cap	A burst of slow responses opens thousands of connections, exhausting sockets	Add an `asyncio.Semaphore` sized to the consumer’s connection limit
Per-instance rate limits	N workers each enforce the limit, so the global rate is N× the target	Share token-bucket state in Redis with an atomic refill-and-take script
Retry storm on recovery	All paused deliveries fire at once when the consumer returns	Drain through the same token bucket and add jitter to release timing

Runnable Implementation Example

This async dispatcher combines a token bucket, a concurrency cap, a bounded queue, and 429 handling.

import asyncio
import time
import httpx

class TokenBucket:
    def __init__(self, rate: float, capacity: float):
        self.rate = rate
        self.capacity = capacity
        self._tokens = capacity
        self._updated = time.monotonic()
        self._lock = asyncio.Lock()

    async def acquire(self) -> None:
        async with self._lock:
            while True:
                now = time.monotonic()
                # Refill based on elapsed wall-clock time.
                self._tokens = min(
                    self.capacity,
                    self._tokens + (now - self._updated) * self.rate,
                )
                self._updated = now
                if self._tokens >= 1:
                    self._tokens -= 1
                    return
                # Sleep just long enough for one token to accrue.
                await asyncio.sleep((1 - self._tokens) / self.rate)

    def throttle(self, factor: float = 0.5) -> None:
        """Shrink the steady-state rate after a 429."""
        self.rate = max(1.0, self.rate * factor)


class Dispatcher:
    def __init__(self, rate: float, max_inflight: int, queue_size: int):
        self.bucket = TokenBucket(rate, capacity=rate)
        self.sem = asyncio.Semaphore(max_inflight)   # concurrency cap
        self.queue: asyncio.Queue = asyncio.Queue(maxsize=queue_size)  # bounded
        self.client = httpx.AsyncClient(timeout=10.0)

    async def submit(self, event: dict) -> None:
        # A full queue is the backpressure signal: block the producer here.
        await self.queue.put(event)

    async def _send(self, event: dict) -> None:
        await self.bucket.acquire()                  # rate limit
        async with self.sem:                         # concurrency cap
            resp = await self.client.post(event["url"], json=event["body"])
            if resp.status_code == 429:
                # Honor the consumer's explicit limit, then lower our rate.
                delay = float(resp.headers.get("Retry-After", "5"))
                self.bucket.throttle()
                await asyncio.sleep(delay)
                await self.queue.put(event)          # requeue for another pass

    async def run(self, workers: int = 4) -> None:
        async def worker():
            while True:
                event = await self.queue.get()
                try:
                    await self._send(event)
                finally:
                    self.queue.task_done()
        await asyncio.gather(*(worker() for _ in range(workers)))

Operational Workflows & CI/CD Integration

Export the levers as runtime configuration, not constants: per-endpoint rate, capacity, max_inflight, and queue_size should be tunable without a redeploy so operators can throttle a misbehaving integration in seconds. Emit queue depth, token-bucket fill level, in-flight count, and 429-rate as metrics, and alert when queue depth trends toward its bound — that is the leading indicator of an impending backlog, well before latency SLOs are breached. In load tests, drive the dispatcher against a deliberately slow endpoint and assert the queue stays bounded and memory stays flat; a test that only exercises the happy path will never catch an unbounded-queue regression.

Treat a rate change as a deploy-grade event even when it does not ship code. Record who changed a per-endpoint rate, from what to what, and why, and emit the current effective rate as a gauge so a dashboard shows the value in force rather than the value in the config repository — the two diverge the moment an on-call engineer throttles an endpoint at 3 a.m. and nobody reverts it. Adaptive throttling makes this worse in a useful way: if a 429 halves the rate automatically, the effective rate can drift far below the configured one and stay there silently, so alert on the ratio of effective to configured rate rather than on the raw number.

Two CI checks catch most regressions cheaply. First, a static assertion that every queue construction in the dispatch path passes a maxsize; this is a one-line lint rule and it prevents the single most damaging failure on this page from ever reaching production. Second, a soak test that runs the dispatcher against a stub endpoint which returns 429 for the first thirty seconds and then recovers, asserting that the drain rate afterwards never exceeds the configured ceiling. That second test is what proves the recovery path releases work through the bucket instead of dumping the whole backlog the instant the consumer comes back.

Debugging Checklist

Confirm the dispatch queue has a maxsize and that a full queue blocks or sheds the producer.
Verify 429 responses read Retry-After and lower the per-endpoint rate, rather than entering blind retry.
Check that a concurrency cap exists independently of the rate limit.
For multi-instance dispatchers, confirm token-bucket state is shared (Redis), not per-process.
Ensure paused or requeued deliveries drain through the bucket with jitter, not all at once.
Validate that queue depth and bucket fill level are exported as metrics with alerts.

Shedding and Coalescing When the Backlog Cannot Be Drained

Every control described so far assumes the backlog is eventually drainable. Sometimes it is not: a consumer limited to 20 per second that has accumulated four million events would need 55 hours to catch up, and by then most of those events describe states that changed hours ago. At that point the honest engineering decision is to reduce the work rather than to keep queuing it, and having decided that in advance is the difference between an orderly recovery and an all-night drain that delivers stale data.

Events fall into two classes and the classification must be attached to the event type, not decided during the incident. Superseded-state events — periodic status snapshots, presence updates, “current balance” notifications, aggregate counters — carry a value that the next event of the same type for the same subject completely replaces. These can be coalesced: keep only the newest per (subscription_id, subject_id, event_type) and discard the rest. On a real backlog dominated by a polling integration this routinely removes 80–95% of the queue, because the same fifty subjects appear thousands of times. Transactional events — payments, state transitions, anything a consumer will reconcile against its own ledger — are independently meaningful and must never be coalesced; the correct treatment for those is to keep queuing, or to dead-letter them with a clear replay path if the delivery window has already been blown.

Shedding needs a policy that is visible after the fact. Whatever you drop must leave a record: a counter labelled by event type and reason, and ideally a compact row noting which event IDs were coalesced away and which surviving event supersedes them. Without that, the first consumer to ask “why did I not receive event 8f3a” gets no answer, and the trust cost of an unexplained gap far exceeds the cost of the gap itself. Coalescing also interacts with any ordering promise you have made: collapsing events reorders nothing, but it does mean the consumer sees fewer transitions than occurred, so document coalescing as part of the event type’s contract rather than treating it as an internal optimisation.

Finally, decide the trigger in advance and wire it to the drain-time metric rather than to human judgement. A reasonable default is to coalesce automatically whenever projected drain time exceeds four times the delivery promise for that endpoint, and to alert rather than act when the backlog is transactional. Automatic coalescing under a clearly documented condition is defensible; the same action taken ad hoc at 3 a.m. by whoever is on call is not, and it is the kind of thing that surfaces in a customer’s audit six months later.

Frequently Asked Questions

Should the rate limit be scoped per endpoint, per tenant, or globally?

Per endpoint URL is the unit that matters, because that is what the consumer's own limiter counts. Add a per-tenant ceiling above it when one tenant registers many endpoints that resolve to the same infrastructure, and keep a global ceiling only as a self-protection bound on your own egress capacity.

What rate should we use for a consumer that publishes no limit at all?

Start deliberately low, around 10 requests per second with a capacity of 20, and raise it only on evidence. Measure the endpoint's p95 latency and its 429 rate for a week, then increase in steps while watching both; an endpoint whose latency starts climbing before it returns any 429s is telling you its real limit is lower than its stated one.

Is it safe to block the producer when the dispatch queue is full?

It is safe only when the producer is a background job that can afford to wait. If the producer is an HTTP request path serving a user, blocking on a full queue converts a slow webhook consumer into a slow API for everyone, so that path should enqueue durably and return, letting a separate relay absorb the backpressure.

Does a 429 from a consumer count against our delivery success rate?

Not as a failure, but it must not be counted as a success either. Record it as a distinct throttled outcome so that a rising 429 rate is visible without polluting the error budget, and page only when throttling starts pushing time-to-delivery past the promised window rather than on the 429 count itself.

How does backpressure differ from a circuit breaker in practice?

A breaker reacts to errors and stops traffic entirely for a cooldown; backpressure reacts to saturation and slows traffic while keeping it flowing. The distinction matters most for a consumer that is slow but returning 200, which never trips an error-rate breaker and can only be handled by throttling and concurrency limits.

Should we ever drop events rather than queue them?

Yes, for event types whose latest value supersedes earlier ones, such as periodic status snapshots or presence updates. Collapsing a backlog of superseded snapshots into the newest one per subject is lossless from the consumer's point of view and can shrink a backlog by an order of magnitude, but it is never acceptable for transactional events where each occurrence is independently meaningful.

Why not just add more dispatch workers instead of rate limiting?

Because the constraint is on the consumer's side, not yours. Adding workers to a consumer that is already at its limit converts queued work into 429s and timeouts, raises that consumer's error rate, and slows every other tenant sharing your worker pool. Scale out only after the limiter shows tokens are consistently available and the bottleneck is genuinely local.

Circuit Breaker Patterns — stop traffic on errors, the complement to throttling on saturation.
Exponential Backoff Algorithms — the fallback when no Retry-After is supplied.
Applying backpressure to webhook consumers — detecting slow consumers and pausing intake step by step.
Token bucket rate limiting for webhook senders — the Redis Lua refill script and per-endpoint rate config.
Resilient Delivery & Retry Strategies — the broader resilience context.