Load Testing Webhook Endpoints

Load testing a webhook receiver is the practice of subjecting your ingestion path to synthetic delivery traffic so that capacity limits are discovered in a controlled environment rather than during a provider’s production fan-out, and it belongs to the broader discipline of Webhook Testing & Local Development. This guide assumes you already operate a receiver that verifies signatures and enqueues work, and that you want to know its true throughput ceiling, its tail latency under stress, and the exact request rate at which it begins shedding or corrupting events. Readers should be comfortable with HTTP semantics, percentile latency, and a scripting language for the load generator.

The load generator drives concurrent signed requests into the endpoint and its async queue while a metrics pipeline records tail latency and error rate.

Modeling Realistic Delivery Traffic

The single most common load-testing mistake is generating a smooth, constant request rate that no real provider ever produces. Production webhook traffic is bursty: a provider batches events, opens a pool of concurrent connections, and delivers a spike that decays over seconds. A test that ignores this shape will report a throughput number your endpoint cannot actually sustain when a real burst arrives. Build at least three traffic profiles and run each separately.

Steady-state soak: a constant arrival rate held for 30–60 minutes to surface memory leaks, connection-pool exhaustion, and slow queue drain that only appear over time.
Ramp to breaking point: a linearly increasing arrival rate that climbs until error rate crosses a threshold (for example, 1% non-2xx). The rate at the crossover is your endpoint’s saturation point.
Burst / thundering herd: a near-instantaneous jump from baseline to many multiples of it, holding briefly, then dropping. This models a provider replaying a backlog and is covered in depth in simulating webhook traffic spikes.

Each profile answers a different question, so blending them into a single run leaves you unable to attribute any failure to a cause.

Drive load with open-model arrival rates, not a fixed number of looping virtual users. A closed model (each user waits for a response before sending the next request) artificially throttles itself when the endpoint slows down, hiding the queueing collapse you are trying to find. k6’s constant-arrival-rate and ramping-arrival-rate executors and Locust’s constant_throughput wait time both express load as requests per second independent of response time, which is the behavior a fire-and-forget webhook sender exhibits.

Coordinated Omission and Generator-Side Distortion

Open-model executors remove the worst measurement bias, but they do not remove all of it. Coordinated omission is the effect where the generator fails to issue a request at its scheduled instant — because every virtual user is blocked on a slow response, or because the process itself was descheduled — and then never records the latency those delayed requests would have shown. The requests that would have suffered most are precisely the ones that never got sent, so the surviving samples produce a percentile table that looks better the worse the endpoint behaves. A run reporting a clean p99 of 180 ms while the receiver was stalling for two seconds is not a lucky result; it is a missing-sample artifact, and acting on it means sizing your capacity from data the outage deleted.

The arithmetic that catches it is trivial and belongs in every run report. Multiply the configured arrival rate by the scenario duration to get the number of requests the run intended to issue, then compare that against the completed-iteration count the tool prints. A 30-minute soak at 400 requests per second intends 720,000 requests; a summary showing 703,000 means 2.4% of the schedule was never issued and the reported p99 describes only the easy 97.6%. k6 states this outright by incrementing dropped_iterations whenever no virtual user is free at the moment the scheduler wants to start one. Locust is quieter — constant_throughput simply falls behind — so you have to read the achieved requests-per-second in the stats table and compare it to the rate you asked for.

A stall silently deletes the exact samples that would have proved the stall, which is why the completed-iteration count is a precondition for trusting any percentile.

The fix has two halves. First, give the pool enough headroom that a slow response cannot starve the scheduler: preallocate at least target_rate × worst_case_latency virtual users, so a 500 rps scenario that must survive a 2-second worst case needs 1,000 concurrent users and should be configured with roughly 1,200 to leave slack. Second, separate the two questions you are asking. A latency run must be held below the drop threshold so every sample is real; a saturation run deliberately pushes past it and is read through dropped_iterations and error codes rather than percentiles. Publishing one number from a run that did both is how teams end up defending a throughput figure that has never once been achieved in production.

Treat any run with more than 1% dropped iterations as void for latency purposes. It still tells you something — that the endpoint could not absorb the offered rate — but its percentile table is not evidence and should never be pasted into a capacity document. When you need defensible tail numbers under heavy load, run the generator at a fixed rate you know it can sustain and scale out horizontally rather than asking one process to both saturate the endpoint and time it accurately.

Sizing and Isolating the Load Generator

More load tests are limited by the machine running them than by the system under test, and the failure is indistinguishable from endpoint saturation on a graph: throughput flattens, latency rises, errors appear. The difference is only visible if you instrument the generator host as carefully as the receiver. Four resources run out first, in a predictable order.

Ephemeral ports go first when connections are not reused. A default Linux range of roughly 28,000 ports combined with a 60-second TIME_WAIT caps a single source address at about 470 new connections per second — well below the rates people expect from a laptop, and it manifests as EADDRNOTAVAIL or a hard plateau at a suspiciously round number. Enabling keep-alive collapses the problem: at 500 requests per second with a 50 ms response time, Little’s Law says only 25 requests are in flight, so 25 reused sockets carry the whole load instead of 500 fresh handshakes. File descriptors go next — a default ulimit -n of 1024 fails long before any interesting rate. Then TLS handshake CPU: each new HTTPS connection costs roughly 1–3 ms of CPU on the generator, so a no-keep-alive test at 500 rps burns one to one and a half cores purely on cryptography you are not trying to measure. Finally, the scheduler itself slips once the process exceeds roughly 60% CPU, which reintroduces coordinated omission through the back door.

Keep-alive, a raised descriptor limit and CPU headroom are not tuning niceties — without them the run measures the generator and reports it as your endpoint's ceiling.

Placement is the other half of isolation, and it is a deliberate choice rather than a default. Running the generator inside the same region or VPC as the receiver strips 30–80 ms of wide-area round trip out of every sample, which is what you want when the question is “how much work can this service absorb”. Running it from a region comparable to the provider’s egress keeps that latency in, which is what you want when the question is “what will the provider’s client actually experience, and will it hit its timeout”. Pick one, write it down next to the result, and never compare numbers gathered under the two placements. When one node genuinely runs out of headroom, scale horizontally — k6 across several instances with aggregated output, or Locust in master and worker mode — rather than accepting a plateau you have not attributed.

Reading the Results: Utilisation, the Knee, and Recovery Time

A load test produces a graph; capacity planning needs a number, and the translation is where most reports go wrong. The rate at which throughput stops increasing is not the rate you can run at. Queueing behaviour means the interesting boundary is the knee — the offered rate at which p99 has roughly doubled from its unloaded baseline while throughput has flattened. Past that point every additional request buys queue depth instead of work, and the system’s behaviour becomes dominated by how quickly it can recover rather than how fast it can serve.

Little’s Law turns the graph into arithmetic you can defend. With an arrival rate of 800 events per second and a mean end-to-end processing time of 40 ms, the number of items in flight is 800 × 0.04 = 32; a worker pool with 24 concurrent slots therefore cannot keep up, and the queue grows at the difference between arrival and drain rate. If that pool actually drains 600 events per second, a 60-second overload at 800 rps accumulates 60 × 200 = 12,000 events of backlog. Recovery is the same sum in reverse: with the burst over and baseline traffic at 500 rps, spare capacity is 100 events per second, so the backlog needs 120 seconds to clear — twice as long as the incident that created it. That asymmetry is the single most useful thing a load test can tell an on-call engineer, and it never appears in a percentile table.

Peak requests per second is the number a run happens to reach; the knee is the number you can operate at with a burst still absorbable.

Convert the knee into an operating target by discounting it. A commonly workable default is to plan for 60–70% of the knee rate as sustained load, which leaves roughly a third of capacity as burst absorption and keeps the system on the flat part of the latency curve where small increases in traffic cause small increases in latency. Feed the same number into your alerting: if the knee is 620 rps, an alert at 400 rps sustained gives you the margin to add workers before the queue starts growing, and it pairs naturally with the delivery targets described in defining SLOs for webhook delivery.

Equally important is knowing when a run is simply invalid and must be thrown away rather than interpreted. Discard the first 60–120 seconds of every run: JIT warm-up, connection-pool fill, cold caches and lazily-initialised clients make early samples systematically pessimistic. Discard any run where an autoscaler changed the replica count mid-flight, because you measured two different systems averaged together. Discard any run where a shared staging database was simultaneously serving another team’s test, where the receiver’s downstream dependency was rate-limited (you measured the dependency’s quota), or where the generator reported dropped iterations above 1%. Re-run the ramp three times and require the saturation rate to agree within about 10%; a spread wider than that means something in the environment is varying more than the load you are applying, and no single run from that set is worth quoting.

Operating Load Tests Against Shared Staging

The receiver is rarely the only thing a synthetic delivery touches. A verified webhook usually writes rows, calls a downstream sandbox API, emits metrics, and sometimes sends notifications — and at 400 requests per second for 30 minutes that is 720,000 events, each multiplying into whatever your handler does next. Mapping that blast radius before the first run is the difference between a load test and an incident with your name on it.

Draw this map before the first run: the guards, not the receiver, are what stop a capacity experiment turning into a page for another team.

Three guards cover almost every case. Route the run through a dedicated synthetic tenant so its rows, quotas and metrics are separable from everyone else’s, and so cleanup is a single scoped delete rather than a forensic exercise. Replace genuinely external side effects — notifications, card authorisations, partner callbacks — with sink adapters selected by configuration, and assert in the run’s setup that the sink is active, because the check you skip is the run that emails 700,000 customers. And prefix every synthetic record with a marker such as loadtest_ so a truncate is trivial and so any dashboard can exclude the traffic afterwards. A downstream sandbox with a 100 rps quota is worth calling out separately: past that rate you are measuring the sandbox’s rate limiter, and the receiver’s own ceiling is invisible behind it.

Cadence matters as much as safety. A full ramp-to-break run is expensive and disruptive, so run it on a schedule — nightly or weekly — and keep a short, cheap version in the release pipeline: a three-minute constant-arrival-rate run at 70% of the last known knee, gated on p99 and error-rate thresholds so the pipeline fails automatically on a regression. Store each run’s saturation rate as a time series; a 15% drop between consecutive runs is a far more actionable signal than any absolute number, because it points at a specific change set. The result is a capacity figure that ages with the code rather than a one-off measurement that becomes folklore.

Choosing Between k6 and Locust

k6 scripts are written in JavaScript, compile to a single Go binary’s runtime, and produce very high request rates from one machine with low CPU overhead — ideal when you need tens of thousands of requests per second and want first-class arrival-rate executors and percentile thresholds as pass/fail gates; benchmarking webhook throughput with k6 walks that path end to end. Locust scripts are Python, which makes it trivial to reuse your existing signing code, generate complex payloads, and model stateful sequences; it scales horizontally across worker processes when a single node runs out of headroom. For signed-payload webhook tests where you must reproduce the provider’s exact HMAC scheme, Locust’s Python ergonomics often win; for raw ceiling-finding, k6 is leaner. Whichever you pick, the load generator must send a valid signature so the request exercises the real verification path — testing against an endpoint with auth disabled measures a system you will never run.

The choice is decided by what the run has to prove, not by language preference — and both branches end at a signed request.

Measuring p95, p99, and the Breaking Point

Averages lie. A receiver can show a 40 ms mean while one request in fifty takes two seconds because a connection waited behind a saturated pool. Providers retry on timeout, so tail latency directly drives duplicate deliveries and retry storms; the receiver must therefore acknowledge each delivery via synchronous callbacks versus async webhooks decisions that keep the response fast. Track p95 and p99 response time, the full error-rate breakdown by status code, and — critically — the queue depth and worker lag behind the endpoint, not just the HTTP response. A receiver that returns 200 in 10 ms while its queue grows unbounded has not passed; it has merely deferred its failure. Define explicit thresholds in the test so the run fails automatically when p99 exceeds your budget or error rate breaches 1%.

Encode all four as thresholds so the run fails itself; a human reading a percentile table will always forgive a queue that is quietly growing.

Failure Mode Analysis

Failure mode	Impact	Mitigation
Closed-model VUs mask collapse	Reported throughput is unachievable under real bursts	Use open-model arrival-rate executors (k6 ramping-arrival-rate, Locust constant_throughput)
Fast 2xx, unbounded queue growth	Endpoint “passes” while events back up and age out	Assert on queue depth and worker lag, not only HTTP latency
Load generator is the bottleneck	Plateau is the test rig’s limit, not the endpoint’s	Distribute load across nodes; monitor generator CPU and socket exhaustion
Signature verification skipped in test	Measured path differs from production hot path	Sign every synthetic request with the real HMAC scheme
Single fixed payload	Caches and dedup hide real per-event cost	Randomize event IDs and payload bodies per request

Runnable Implementation Example

The following Python Locust file signs each synthetic delivery with the provider’s HMAC scheme and drives an open-model throughput so the arrival rate stays constant even as the endpoint slows.

import hashlib
import hmac
import json
import os
import time
import uuid
from locust import HttpUser, task, constant_throughput

SECRET = os.environ["WEBHOOK_SECRET"].encode()
TARGET_RPS = float(os.environ.get("TARGET_RPS", "50"))


def sign(body: bytes, timestamp: str) -> str:
    """Reproduce the provider's signing scheme: HMAC over timestamp + body."""
    message = f"{timestamp}.".encode() + body
    digest = hmac.new(SECRET, message, hashlib.sha256).hexdigest()
    return f"t={timestamp},v1={digest}"


class WebhookSender(HttpUser):
    # Open-model load: each user targets a fixed rate regardless of latency.
    wait_time = constant_throughput(TARGET_RPS)

    @task
    def deliver(self):
        # Unique id per request defeats consumer-side caching/dedup.
        payload = {
            "id": str(uuid.uuid4()),
            "type": "order.created.v1",
            "data": {"amount": 4200, "currency": "USD"},
        }
        body = json.dumps(payload).encode()
        ts = str(int(time.time()))
        headers = {
            "Content-Type": "application/json",
            "X-Webhook-Signature": sign(body, ts),
        }
        with self.client.post(
            "/webhooks/orders",
            data=body,
            headers=headers,
            name="POST /webhooks/orders",
            catch_response=True,
        ) as resp:
            # Treat slow 2xx as a failure to surface tail latency in the report.
            if resp.elapsed.total_seconds() > 1.0:
                resp.failure("over 1s budget")
            elif resp.status_code >= 300:
                resp.failure(f"status {resp.status_code}")

Run it with locust -f load.py --headless -u 200 -r 20 --run-time 10m --host https://staging.example.com, then watch the percentile table and, separately, your queue-depth dashboard.

Debugging Checklist

Confirm the load generator itself is not CPU- or socket-bound before trusting any plateau.
Verify every synthetic request carries a valid signature and unique event ID.
Correlate the HTTP p99 spike with queue depth and worker lag at the same timestamp.
Re-run the breaking-point ramp three times; the saturation rate should be stable within ~10%.
Check for connection-pool exhaustion (look for EADDRNOTAVAIL or keep-alive churn) on both ends.
Ensure the staging database and downstream services are sized like production, not scaled down.

Frequently Asked Questions

How long does a soak have to run before its result means anything?

Thirty to sixty minutes is the practical minimum, because the defects a soak exists to find — file-descriptor leaks, connection-pool churn, unbounded in-memory buffers, slow queue drain — accumulate over tens of minutes rather than seconds. Throw away the first one to two minutes as warm-up, then require the last third of the run to show flat memory and flat queue depth. If either is still trending upward when the run ends, the correct conclusion is that the run was too short, not that the endpoint passed.

Can I trust a ceiling measured on a staging environment smaller than production?

Only as a relative signal, never as an absolute capacity number. Throughput does not scale linearly with a scaled-down environment, because the binding constraint is usually a single shared resource — database connections, a connection pool, a downstream quota — whose limit is not divided by your scale factor. A quarter-sized staging tier can produce a ceiling that is anywhere from one tenth to nearly all of production's, so use it to detect regressions between runs and size real capacity on production-shaped infrastructure.

What p99 budget should I actually set for a receiver?

Derive it from the sender's client timeout rather than from taste. Most providers abandon a delivery somewhere between 5 and 10 seconds and then retry it, so any response time approaching a third of that budget starts producing duplicate deliveries under load. For a receiver whose job is to verify a signature and enqueue, a p99 under 1 second is a comfortable and achievable target, and treating anything slower as a failed threshold keeps retry amplification out of your traffic model.

Why did throughput go down when I added more virtual users?

That is the signature of running past the knee into contention: more concurrent requests mean more competition for connection-pool slots, database locks and CPU, and the extra context switching costs more than the extra parallelism buys. Real throughput regression under increased offered load is a strong indicator that some resource is being held for the duration of a request rather than released early. Find the resource that is held longest — usually a pooled connection or a row lock — before adding capacity, because more replicas contending for the same lock makes the curve worse.

Does signature verification itself need its own load test?

The hashing does not — an HMAC over a two-kilobyte body costs microseconds and will never be your bottleneck. What does deserve a dedicated measurement is everything around it: buffering the raw body, parsing JSON, and above all fetching the signing secret. If the secret is retrieved from a secrets manager or database per request rather than from a cached value, that single call can dominate the entire request, so run one test with the cache warm and one with it disabled and compare.

How do I keep a load run from flooding the dead-letter queue?

Expect it to fill and plan the cleanup, because a run pushed past saturation is supposed to produce failures. Point the synthetic tenant at its own dead-letter destination so its entries are separable, cap the retry attempts for synthetic events lower than production, and truncate that destination as part of the run's teardown. If synthetic failures land in the shared dead-letter store, the next person triaging real growth there will waste hours on your traffic.

Should the same test double as a CI gate and as a capacity study?

No, and trying to make it do both is why so many pipelines have a flaky load stage. A CI gate needs to be short, cheap, deterministic and run at a fixed rate well below saturation so it only fails on real regressions; a capacity study needs long ramps, repeated runs and a controlled environment. Keep two scenario files that share the same payload and signing code, and let only the short one block a release.

Benchmarking webhook throughput with k6 — turning the ramp into a defensible capacity number.
Simulating webhook traffic spikes — modeling thundering-herd bursts in detail.
Inspecting and replaying webhook deliveries — capture and re-send real traffic for repeatable load.
Webhook mocking and sandbox environments — a stand-in provider to load-test against when the real one has quotas.
Webhook Testing & Local Development — the broader testing discipline.