Debugging Failed Webhook Deliveries
A delivery marked “failed” in a provider’s dashboard tells you almost nothing on its own — it could be a signature mismatch, a handler timeout, or a 5xx from a dependency three layers down. This page extends inspecting and replaying webhook deliveries with a disciplined triage that turns that vague signal into a root cause, and it pairs naturally with testing webhooks locally with ngrok and tunnels so you can reproduce the exact failing request against a debugger on your laptop. The method is always the same: read the captured delivery, classify the failure, recompute what should have happened, and replay until it reproduces.
Prerequisites
- A delivery log that stores the raw body bytes, all headers, the response code, and a correlation ID (see the capture patterns in the parent guide).
- The shared signing secret and knowledge of the provider’s HMAC scheme.
- The ability to replay a stored delivery idempotently against a local or staging handler.
curland a Python REPL for ad-hoc signature recomputation.- For local reproduction, a tunnel so the provider (or your replay tool) can reach your machine.
Step-by-Step Implementation
1. Pull the failed delivery
Query the log for the specific failure rather than eyeballing a stream. Anchor on the response code and a time window.
psql -c "SELECT correlation_id, response_code, received_at
FROM webhook_deliveries
WHERE response_code >= 400
ORDER BY received_at DESC LIMIT 20;"
2. Classify the failure
The response code is your first branch. A 401/403 is almost always a signature problem. A handler that logs 408/499 or a provider that reports “timeout” points at slow processing. A 500/502/503 means your handler accepted the request but threw or a dependency was down. Classify before you change anything — fixing the wrong branch wastes the incident window.
3. Recompute the signature on the raw bytes
For a suspected signature mismatch, recompute the HMAC over the exact stored bytes and compare to the header the provider sent. A difference between recomputed-from-raw and recomputed-from-reparsed proves the body was mutated before verification — the single most common signing bug.
import hmac, hashlib, json, psycopg2
conn = psycopg2.connect("dbname=webhooks")
cur = conn.cursor()
cur.execute("SELECT raw_body, headers FROM webhook_deliveries WHERE correlation_id=%s",
("abc123def456",))
raw, headers = cur.fetchone()
raw = bytes(raw)
secret = b"whsec_..."
ts = json.loads(headers)["X-Webhook-Timestamp"]
expected = hmac.new(secret, f"{ts}.".encode() + raw, hashlib.sha256).hexdigest()
received = json.loads(headers)["X-Webhook-Signature"].split("v1=")[-1]
print("from raw bytes:", hmac.compare_digest(expected, received)) # True => signing OK
# If False, recompute after re-serializing to confirm a body-mutation bug:
reparsed = json.dumps(json.loads(raw)).encode()
print("from reparsed :",
hmac.compare_digest(hmac.new(secret, f"{ts}.".encode()+reparsed,
hashlib.sha256).hexdigest(), received))
4. Reproduce locally with a replay
Start your handler locally behind a tunnel, then replay the captured raw delivery into it so you can attach a debugger to the exact failing input. Re-sign with a fresh timestamp so your own replay-attack prevention window does not reject it.
TS=$(date +%s)
SIG=$(python -c "import hmac,hashlib,sys; \
print(hmac.new(b'whsec_...', f'$TS.'.encode()+open('body.bin','rb').read(), \
hashlib.sha256).hexdigest())")
curl -i -X POST http://localhost:8000/webhooks/orders \
-H "Content-Type: application/json" \
-H "X-Webhook-Timestamp: $TS" \
-H "X-Webhook-Signature: t=$TS,v1=$SIG" \
--data-binary @body.bin
5. Fix and re-verify against the corpus
Apply the fix, then replay the whole captured corpus of recent deliveries — not just the one — to confirm you resolved the class of failure without regressing others.
Verification and Testing
Prove the fix with an assertion, not a glance. After replaying the corpus, the handler log must show zero failures for the previously failing event type:
# Replay corpus, then assert no failures remain.
python replay_corpus.py --since "1 hour ago"
test "$(grep -c '"webhook_failed"' handler.log)" -eq 0 && echo "PASS" || echo "FAIL"
For the signature case specifically, the unit test should feed the stored raw bytes through your verification function and assert it now returns true. Persisting these reproductions as fixtures means the next deploy runs them automatically — turning a one-off debug into a permanent regression guard.
Failure Modes and Gotchas
- Body mutated before verification. A middleware that parses and re-serializes JSON (or strips a trailing newline) invalidates the signature. Always verify against the untouched raw bytes captured at ingress.
- Timeout from synchronous work. If the handler does the real processing before responding, slow dependencies cause provider timeouts and retries. Acknowledge fast and defer via the synchronous versus async webhook split, then debug the worker separately.
- 5xx that is really a duplicate. A retried delivery hitting a non-idempotent handler can 500 on a unique-constraint violation. Confirm the event ID against your idempotency store before assuming a code bug.
- Stale-timestamp rejection on replay. Re-sending the original headers trips the replay window. Re-sign with a current timestamp for trusted internal replays.
Related
- Simulating webhook traffic spikes — reproducing load-induced timeouts at scale.
- Testing webhooks locally with ngrok and tunnels — exposing a local handler for replay.
- Inspecting and replaying webhook deliveries — the parent guide on capture and replay tooling.