Applying Backpressure to Webhook Consumers

This guide implements consumer-side backpressure for a single Python webhook receiver under load, the concrete companion to Webhook Rate Limiting & Backpressure. The scenario: a FastAPI endpoint accepts events, hands them to background workers that do slow processing (a database write, a downstream call), and during a traffic spike the workers fall behind. Without backpressure the receiver either runs out of memory buffering or silently drops events. The fix is to make saturation explicit — a bounded queue, a 429 Retry-After response that tells the sender to slow down, and pause/resume control with hysteresis. Because the producer must honor that 429 to break the loop, pair this with sender-side protection such as per-endpoint circuit breaker state machines and the dispatch throttle described in token bucket rate limiting for webhook senders.

The receiver returns 429 once the queue crosses the high-water mark and resumes accepting only after it drains below the low-water mark.

Prerequisites

Python 3.11+ with asyncio, a FastAPI/Starlette receiver, and httpx for the test sender.
A receiver that already does its real work in background workers off the request path (synchronous processing inside the handler cannot be paused independently of intake).
Senders that honor HTTP 429 and Retry-After. If you control the dispatcher, see Webhook Rate Limiting & Backpressure for the producer side.
Idempotent processing, because backpressure relies on the sender retrying rejected deliveries.

Step 1: Bound the work queue

Replace any unbounded buffer with a fixed-size asyncio.Queue. The size is a deliberate memory budget: roughly the number of in-flight events you can hold without risking the process. A bounded queue turns “we are overloaded” from a silent memory leak into an observable, actionable condition.

import asyncio

QUEUE_MAX = 1000
HIGH_WATER = int(QUEUE_MAX * 0.8)   # start shedding
LOW_WATER = int(QUEUE_MAX * 0.4)    # resume accepting

queue: asyncio.Queue = asyncio.Queue(maxsize=QUEUE_MAX)

Pick QUEUE_MAX from two numbers you can actually measure: the average serialized size of one event and the headroom you are willing to give the process. A thousand 4 KB events is 4 MB of payload plus per-object overhead — trivial. A thousand 400 KB events is 400 MB, which on a 512 MB container is the outage you were trying to avoid. Size the queue in bytes first, then convert to a count. If your events vary wildly in size, store only the event id and a pointer in the queue and fetch the body in the worker.

The second number that matters is drain time. A queue that holds sixty seconds of work at full drain rate is a shock absorber; one that holds twenty minutes is a latency generator that will still be delivering stale events long after the sender has given up and retried them elsewhere. As a rule of thumb, set QUEUE_MAX to no more than 30–60 seconds of measured throughput, and let the sender’s retry machinery hold anything beyond that — it is designed for durable backlog and your process memory is not.

Step 2: Signal backpressure with 429 and Retry-After

When the queue is at or above the high-water mark, reject new deliveries with 429 Too Many Requests and a Retry-After header. This is the entire point of consumer backpressure: you are telling the sender to come back later instead of accepting work you cannot process.

from fastapi import FastAPI, Request, Response

app = FastAPI()

@app.post("/webhooks")
async def receive(request: Request):
    if queue.qsize() >= HIGH_WATER or not accepting.is_set():
        # Suggest a wait proportional to the backlog drain time.
        retry_after = max(1, queue.qsize() // max(1, DRAIN_PER_SEC))
        return Response(
            status_code=429,
            headers={"Retry-After": str(retry_after)},
        )
    event = await request.json()
    try:
        queue.put_nowait(event)        # never block the request on a full queue
    except asyncio.QueueFull:
        return Response(status_code=429, headers={"Retry-After": "5"})
    return Response(status_code=202)

Use put_nowait rather than await queue.put(...): blocking the HTTP handler on a full queue holds a connection open and converts backpressure into latency. Reject fast and let the sender retry.

The Retry-After value is a promise, so derive it rather than hard-coding a constant. qsize() // DRAIN_PER_SEC gives the sender the honest number of seconds until the backlog clears at the current rate; a fixed Retry-After: 60 either wastes capacity when you recover in five seconds or invites a stampede when you need five minutes. Two details are easy to get wrong. First, the header is defined in whole seconds (or an HTTP date) — a float serializes to something many clients silently discard, and a discarded header means the sender falls back to its own retry timing. Second, the value should carry a small random spread when many senders are being rejected at once, otherwise every one of them returns at the same instant and re-saturates the queue you just drained.

Return 429 as early in the request lifecycle as you can. Checking the queue before await request.json() avoids parsing a body you are about to discard, which matters when the payloads are large and the rejection rate is high — during a real spike the parsing cost of shed traffic can be the thing that keeps you saturated.

Step 3: Pause and resume intake with hysteresis

A single threshold causes flapping — the receiver toggles 202/429 on every event near the boundary. Use two thresholds: stop accepting at the high-water mark, and only resume once the queue drains below the low-water mark. An asyncio.Event makes the accepting/paused state explicit and cheap to check.

Two marks, not one: the gap between them is the only thing preventing the receiver from oscillating between 202 and 429 on every event.

accepting = asyncio.Event()
accepting.set()                         # start in the accepting state
DRAIN_PER_SEC = 50                      # measured worker throughput

async def watermark_monitor():
    while True:
        depth = queue.qsize()
        if depth >= HIGH_WATER and accepting.is_set():
            accepting.clear()           # pause intake
        elif depth <= LOW_WATER and not accepting.is_set():
            accepting.set()             # resume intake
        await asyncio.sleep(0.25)

async def worker():
    while True:
        event = await queue.get()
        try:
            await process(event)        # the slow work
        finally:
            queue.task_done()

@app.on_event("startup")
async def startup():
    asyncio.create_task(watermark_monitor())
    for _ in range(8):                   # concurrency cap = worker count
        asyncio.create_task(worker())

The gap between HIGH_WATER and LOW_WATER is the hysteresis band; widen it if you observe rapid accept/reject oscillation. How wide is wide enough depends on how fast the queue drains relative to how fast the sender comes back, so start from measured throughput rather than a fixed percentage:

Worker drain rate	Suggested QUEUE_MAX	HIGH_WATER	LOW_WATER	Rationale
10 events/sec	400	320 (80%)	120 (30%)	Slow drain needs a wide band; 200 events is 20 s of recovery before intake resumes
50 events/sec	1000	800 (80%)	400 (40%)	The default in the code above: 8 s of headroom, 8 s of recovery
200 events/sec	3000	2400 (80%)	1500 (50%)	Fast drain tolerates a narrower band because recovery is quick
1000 events/sec	10000	8500 (85%)	6000 (60%)	At this rate the queue is a shock absorber, not a buffer; keep it shallow in time
Highly variable	2 × p99 burst	75%	35%	When drain rate is unstable, buy width instead of precision

The watermark_monitor poll interval is the other flap control. Polling every 250 ms means the receiver can be at most a quarter-second late noticing that it crossed a mark, which at 50 events/sec is roughly a dozen events of overshoot — acceptable. At 1000 events/sec the same interval overshoots by 250 events, so either poll faster or check the marks inline on the enqueue path where the cost is a single integer comparison.

Step 4: Detect slow consumers from drain rate

Backpressure is reactive; detection lets you act before the queue saturates. Track per-event dwell time (enqueue-to-dequeue) and processing latency. A rising dwell time with steady intake means workers are falling behind — the leading indicator of an impending 429 storm.

Dwell time starts climbing roughly twenty seconds before depth reaches the high-water mark, which is the window in which you can still add capacity instead of shedding.

import time

dwell_samples: list[float] = []

async def worker_instrumented():
    while True:
        event = await queue.get()
        dwell = time.monotonic() - event["_enqueued_at"]
        dwell_samples.append(dwell)
        try:
            await process(event)
        finally:
            queue.task_done()

def health_snapshot() -> dict:
    recent = dwell_samples[-100:] or [0.0]
    return {
        "queue_depth": queue.qsize(),
        "accepting": accepting.is_set(),
        "p95_dwell_seconds": sorted(recent)[int(len(recent) * 0.95)],
    }

Stamp _enqueued_at = time.monotonic() when you enqueue in Step 2. Export health_snapshot() to your metrics pipeline and alert on p95_dwell_seconds trending up. If the dwell time is dominated by one downstream dependency rather than by intake volume, the fix belongs on the dispatch side instead — see handling slow webhook consumers for per-endpoint concurrency lanes.

Verification and testing

Drive the receiver faster than it drains and assert it sheds load instead of growing without bound.

import asyncio, httpx

async def flood():
    async with httpx.AsyncClient(base_url="http://localhost:8000") as c:
        codes = await asyncio.gather(*[
            c.post("/webhooks", json={"i": i}) for i in range(5000)
        ], return_exceptions=True)
    statuses = [r.status_code for r in codes if hasattr(r, "status_code")]
    assert 429 in statuses, "receiver never applied backpressure"
    assert 202 in statuses, "receiver rejected everything"
    print("202:", statuses.count(202), "429:", statuses.count(429))

asyncio.run(flood())

A passing test shows a mix of 202 and 429, and the receiver’s memory stays flat throughout the flood — confirming the bounded queue, not the heap, absorbs the spike.

Two assertions are worth adding once the basic shape passes. Assert that every 429 carries a parseable integer Retry-After, because a missing or malformed header is invisible in aggregate status-code counts and turns a cooperative sender into a hot-looping one. And assert that the receiver returns to accepting after the flood stops: run the flood, sleep long enough for the queue to drain past LOW_WATER, then post a single event and assert a 202. A receiver that pauses correctly but never resumes passes the first test and fails in production the same way an outage does.

async def resumes_after_drain():
    async with httpx.AsyncClient(base_url="http://localhost:8000") as c:
        await asyncio.gather(*[c.post("/webhooks", json={"i": i}) for i in range(5000)],
                             return_exceptions=True)
        await asyncio.sleep(30)                    # longer than QUEUE_MAX / DRAIN_PER_SEC
        resp = await c.post("/webhooks", json={"i": "probe"})
    assert resp.status_code == 202, f"still paused after drain: {resp.status_code}"

asyncio.run(resumes_after_drain())

Failure modes and gotchas

Blocking the handler on a full queue — using await queue.put() instead of put_nowait holds the request open under load, turning backpressure into mounting latency and connection exhaustion. Reject with 429 immediately.
Single threshold flapping — without the high/low-water hysteresis band, the receiver oscillates between accepting and rejecting near the boundary, producing noisy 429s. Widen the gap between the marks.
Senders ignoring 429 — if the dispatcher retries instantly regardless of Retry-After, your 429s amplify load instead of relieving it. Confirm sender behavior, and pair with the producer-side controls in Webhook Rate Limiting & Backpressure.
Non-idempotent processing — backpressure depends on rejected deliveries being retried, which means some events arrive twice. Deduplicate downstream so retries are safe.

Frequently Asked Questions

Should a paused receiver also fail its readiness probe?

No. Failing readiness pulls the instance out of the load balancer and redirects its share of traffic onto peers that are usually saturated for the same reason, which turns one busy replica into a rolling brownout. Keep readiness tied to process health and let the 429 carry the load signal. Tying a liveness probe to queue depth is worse still, because the restart discards every event currently buffered.

What happens to queued events when the process receives SIGTERM?

Anything still in the in-memory queue is lost unless you drain it deliberately. On shutdown, clear the accepting event first so new deliveries get a 429 rather than joining a queue nobody will finish, then await queue.join() with a timeout that fits inside the container's termination grace period. Sizing QUEUE_MAX to 30-60 seconds of throughput is what makes that drain finish in time.

Does this still work when the receiver runs as several replicas behind a load balancer?

Yes, because each replica sheds on its own depth and a round-robin balancer will simply get a cheap 429 from whichever replica is full. The thing to fix is the metric: alert on the maximum queue depth across replicas rather than the average, since one hot replica returning 429 to a tenth of your traffic is invisible in a mean that includes idle peers.

Can some event types be exempt from shedding?

Yes, by checking the type before the water-mark comparison and letting a small reserved class use the band between HIGH_WATER and QUEUE_MAX. Keep the reserve genuinely small, on the order of ten percent of capacity, or the exemption simply moves the saturation point without removing it. This only works when the priority signal is in a header or the URL path, since reading it from the body forfeits the cheap early rejection.

If a sender ignores our 429, have we simply dropped the event?

Yes. Shedding is only a backpressure mechanism when the other side retries; against a sender that treats 429 as terminal it is data loss with a polite status code. For third-party senders, check their documented retry policy before enabling shedding, and if they do not retry, accept into a durable log instead and pay for the storage rather than the incident.

Why return 429 rather than 503 when the receiver is saturated?

The two codes mean different things to a well-built dispatcher. A 503 reads as a broken endpoint and typically counts toward an error rate that trips a circuit breaker, cutting delivery entirely for a cooldown. A 429 with Retry-After reads as a request to slow down and is handled by throttling, which is exactly the behaviour you want; reserve 503 for the case where a dependency you need is genuinely unavailable.

Token bucket rate limiting for webhook senders — the sender-side throttle that keeps your 429s rare.
Handling slow webhook consumers — containing an endpoint whose latency, not volume, is the problem.
per-endpoint circuit breaker state machines — sender-side protection that complements consumer backpressure.
Webhook Rate Limiting & Backpressure — the producer-side rate limiting and feedback model behind this guide.