Simulating Webhook Traffic Spikes

A traffic spike is the scenario that breaks most webhook receivers: a provider that has been quiet suddenly replays a backlog or fans out a large event batch, opening many concurrent connections within a second or two. This page extends load testing webhook endpoints with a focused recipe for reproducing that thundering herd using k6, and it pairs naturally with consumer-driven contract tests for webhooks so the payloads you blast are also schema-valid. The goal is not a vanity throughput number; it is to observe exactly how your endpoint degrades the instant arrival rate outruns processing rate.

The spike profile holds a flat baseline, jumps near-instantly to roughly ten times the rate for a short window, then drops back.

Prerequisites

k6 installed locally (brew install k6, or the Docker image grafana/k6).
A staging webhook endpoint that verifies signatures and enqueues work — never spike production.
The shared signing secret exported as an environment variable, plus knowledge of the provider’s exact HMAC scheme.
A metrics view of the receiver’s queue depth and worker lag, not just HTTP status, so you can see backlog form.
A defined latency budget (for example, p99 under 1 s) and an acceptable error rate (for example, under 1%).

Sizing the Spike: Multiplier, Hold Time, and Recovery Window

The three numbers in a spike scenario — how high, how long, and how long you watch afterwards — should come from the provider’s actual behaviour rather than from a round figure that felt dramatic. The most useful source is an outage: when a provider stops delivering for fifteen minutes and your normal rate is 20 events per second, it has 18,000 events queued, and it will hand them back as fast as its own concurrency allows. That is the burst you are modelling. A 10× multiplier is a sane starting default when you have no historical data, but the honest approach is a ladder — 3×, 10×, 30× — because the interesting result is not whether the endpoint survives one arbitrary number, it is where between those rungs the behaviour changes.

Hold time is the number people get wrong most often, because a short hold only tests buffers. Your receiver has time constants measured in tens of seconds: an autoscaler typically needs 30–90 seconds to notice load and bring a warm instance into rotation, a circuit breaker evaluates over a rolling window, and a queue’s visibility timeout only expires after its configured delay. A 30-second hold tests whether the existing capacity absorbs the burst; a five-minute hold tests whether the system can respond to one. Run both, and label the results differently, because they answer different questions and a system can easily pass the first while failing the second.

The recovery window follows from arithmetic you can do before the run. With a baseline of 20 events per second, a spike to 200 for 30 seconds, and a worker pool that genuinely drains 60 per second, the burst delivers 6,000 events while the workers complete 1,800, leaving a backlog of 4,200. Afterwards, arrival falls back to 20 per second against a 60 per second drain, so spare capacity is 40 per second and the backlog needs 105 seconds to clear — three and a half times the length of the burst that caused it. If your final scenario stage is 30 seconds long, the run ends while the system is still deep in trouble and the summary looks fine. Make the recovery stage at least twice the computed drain time.

Compute the drain time before writing the stages, then make the final stage outlast it — the shape of the tail is the actual finding.

Step-by-Step Implementation

1. Establish the baseline rate

First measure the steady rate your endpoint sustains comfortably. Run a short constant-arrival-rate test and find the highest rate that keeps p99 within budget and the queue flat — the stepped procedure in benchmarking webhook throughput with k6 produces exactly this number. Call that BASE. Your spike will jump to a multiple of it (start with 10×).

2. Write the spike scenario

k6’s ramping-arrival-rate executor lets you express an instantaneous jump by using a zero-duration stage. The preAllocatedVUs must be large enough to issue the spike rate even while responses are slow — under-allocating VUs silently caps the spike.

The scheduler asks for a rate; the virtual-user pool is what actually has to supply it, and the shortfall lands in dropped_iterations rather than on the wire.

import http from "k6/http";
import crypto from "k6/crypto";
import { check } from "k6";

const SECRET = __ENV.WEBHOOK_SECRET;
const HOST = __ENV.HOST; // e.g. https://staging.example.com
const BASE = Number(__ENV.BASE || 20);

export const options = {
  scenarios: {
    spike: {
      executor: "ramping-arrival-rate",
      startRate: BASE,
      timeUnit: "1s",
      preAllocatedVUs: 600, // headroom so the spike is not VU-limited
      maxVUs: 1500,
      stages: [
        { target: BASE, duration: "30s" },      // baseline
        { target: BASE * 10, duration: "1s" },  // near-instant jump
        { target: BASE * 10, duration: "30s" }, // hold the herd
        { target: BASE, duration: "1s" },       // drop back
        { target: BASE, duration: "30s" },      // observe recovery
      ],
    },
  },
  thresholds: {
    http_req_duration: ["p(99)<1000"], // fail run if p99 > 1s
    http_req_failed: ["rate<0.01"],    // fail run if >1% non-2xx
  },
};

3. Sign each generated payload

Reproduce the provider’s signing scheme inside the default function so every request exercises the real verification path. Randomize the event ID so consumer-side deduplication does not absorb the load.

Every hop here is production code; a spike that skips signing or reuses one event id measures a much cheaper system than the one you operate.

function signedHeaders(body) {
  const ts = Math.floor(Date.now() / 1000).toString();
  const mac = crypto.hmac("sha256", SECRET, `${ts}.${body}`, "hex");
  return {
    "Content-Type": "application/json",
    "X-Webhook-Signature": `t=${ts},v1=${mac}`,
  };
}

export default function () {
  const body = JSON.stringify({
    id: `evt_${Date.now()}_${__VU}_${__ITER}`,
    type: "order.created.v1",
    data: { amount: 4200, currency: "USD" },
  });
  const res = http.post(`${HOST}/webhooks/orders`, body, {
    headers: signedHeaders(body),
  });
  check(res, { "is 2xx": (r) => r.status >= 200 && r.status < 300 });
}

4. Run the spike

WEBHOOK_SECRET=$SECRET HOST=https://staging.example.com BASE=20 \
  k6 run spike.js

Verification and Testing

A spike test is only meaningful if you confirm two things at the spike timestamp. First, the k6 summary should report http_req_duration p99 and http_req_failed rate — if either threshold tripped, k6 exits non-zero, which makes the test usable as a CI gate. Second, correlate that moment against your receiver’s queue depth: a healthy endpoint shows the queue rising during the hold and draining smoothly afterward. Assert recovery explicitly by checking that, in the final baseline stage, the queue returns to near zero. A quick log assertion confirms no events were dropped:

# Count delivered vs. accepted; they must match.
grep -c '"accepted webhook"' receiver.log   # should equal total k6 requests
grep -c '"dropped"\|"queue full"' receiver.log  # must be 0

Any non-2xx the run surfaces should then be taken apart individually using the triage in debugging failed webhook deliveries — a spike that produces 401s is a signing bug in your script, not a capacity finding.

Deciding Whether a Spike Run Is Valid

Before arguing about whether the endpoint passed, decide whether the run is admissible evidence at all. Four signals settle it, and reading them in this order saves an afternoon of debating a graph that was never measuring the receiver.

The middle column is the one worth having: a failure you can trust tells you exactly how much capacity to add, while the right-hand column tells you nothing at all.

Two subtler invalidators deserve their own mention. The first is metric resolution: a 30-second spike observed through a 30-second scrape interval is averaged into invisibility, and the queue-depth graph you are relying on may show a gentle bump where the reality was a cliff. Drop the scrape interval to 5–10 seconds for the duration of the run, or record the receiver’s own counters at higher resolution and reconcile afterwards. The second is the connection storm at the instant of the jump: k6 opens a socket per virtual user, so going from 20 to 600 active users in one second means hundreds of simultaneous TLS handshakes arriving at a load balancer whose accept queue may be shorter than that. If you see connection resets clustered exactly at the transition and nowhere else, you have found a listen-backlog limit, which is a real production finding but a different one from “the handler is too slow”.

Finally, compare the baseline stage before the spike with the baseline stage after it. Identical offered rates should produce identical latency; if the post-spike baseline is measurably worse, the receiver did not actually recover, and something — a saturated connection pool, a growing retry queue, a leaked worker — is still carrying the burst. That comparison costs nothing and catches the class of problem that a single peak number never will.

Failure Modes and Gotchas

VU starvation caps the spike. If preAllocatedVUs is too low, k6 cannot issue the requested rate once responses slow, and you will under-test. Watch the dropped_iterations metric — any nonzero value means the generator, not the endpoint, was the limit. Raise preAllocatedVUs/maxVUs.
Connection limits on the generator host. A single machine can exhaust ephemeral ports or file descriptors during a 10× burst, producing EADDRNOTAVAIL that masquerades as endpoint errors. Raise ulimit -n, enable keep-alive, or run k6 distributed.
Fast 2xx hiding backlog. An endpoint that returns 200 immediately and enqueues can pass HTTP thresholds while its worker pool falls hours behind. Always assert on queue drain in the recovery stage, and consider routing overflow to a dead-letter queue so spikes degrade gracefully instead of dropping events.
Identical payloads. A static body lets caches and idempotency stores short-circuit processing, understating per-event cost. Always vary the event ID and meaningful fields per request.

A passing spike test is one where the receiver completes the whole cycle; a run that ends in Draining has only postponed the incident.

Frequently Asked Questions

Why is the second spike in a run always worse than the first?

Because the system has not returned to its starting state. Residual backlog, a circuit breaker sitting in half-open, a connection pool that churned, and caches evicted during the first burst all carry over into the second. Either allow a full drain plus a couple of minutes of quiet between spikes, or run each spike as its own scenario — otherwise you are measuring the recovery from spike one, not the response to spike two.

Should I use the provider's own sandbox to generate the burst instead of k6?

Use it once to confirm your synthetic request is byte-for-byte plausible — same headers, same signature format, same content type — and then generate load yourself. Provider sandboxes rate-limit aggressively, cannot be told to emit an exact arrival rate, and give you no way to reproduce a run tomorrow. Determinism and repeatability matter more here than authenticity, provided the authenticity check happened once.

The queue drained cleanly but the resulting data is inconsistent. Is that a load problem?

It is a concurrency problem that only load makes visible. Under a burst, many workers process events for the same entity simultaneously, so updates that arrived in order get applied out of order. If your handler is not idempotent and order-aware per key, a spike test is exactly the thing that will expose it — which is a valuable finding, but it needs a partitioning or locking fix rather than more capacity.

How should the test treat 429 responses from my own rate limiter?

As designed behaviour, counted separately from server errors. Split your thresholds so that 5xx responses fail the run while 429s are tracked as a shedding rate, and assert that every 429 carries a Retry-After header the sender can actually use. A receiver that sheds politely at a known rate is in far better shape than one that accepts everything and quietly grows an unbounded backlog.

Does the spike need production-sized data behind it?

Yes for anything database-bound, because an empty table is a completely different machine. Indexes fit in memory, the planner picks different strategies, and lock contention that dominates production simply does not occur. Seed the staging store to production-like row counts and cardinality per tenant before the run, or the ceiling you measure will be several times optimistic.

Can a spike test run in the deployment pipeline?

A small one can, and it is worth doing: a fixed 3x spike at a fixed baseline, held for 30 seconds with thresholds attached, takes about two minutes and catches regressions in the enqueue path. Keep the large multipliers for scheduled runs against a properly sized environment, since those need cleanup and a human reading the queue graphs afterwards.

Benchmarking webhook throughput with k6 — establishing the baseline rate the spike multiplies.
Consumer-driven contract tests for webhooks — keep spike payloads schema-valid.
Debugging failed webhook deliveries — diagnosing the failures a spike exposes.
Load testing webhook endpoints — the parent guide on capacity testing.