Debugging Failed Webhook Deliveries

A delivery marked “failed” in a provider’s dashboard tells you almost nothing on its own — it could be a signature mismatch, a handler timeout, or a 5xx from a dependency three layers down. This page extends inspecting and replaying webhook deliveries with a disciplined triage that turns that vague signal into a root cause, and it pairs naturally with testing webhooks locally with ngrok and tunnels so you can reproduce the exact failing request against a debugger on your laptop. The method is always the same: read the captured delivery, classify the failure, recompute what should have happened, and replay until it reproduces.

Failed-delivery classification decision tree A failed delivery branches by symptom into signature mismatch, timeout, or 5xx error, each leading to a targeted remediation. Failed delivery read from log Signature mismatch check raw body Timeout ack fast, defer work 5xx error trace dependency replay to reproduce
Each failed delivery is classified by symptom into signature mismatch, timeout, or 5xx, then reproduced by replay before a fix is applied.

Prerequisites

Step-by-Step Implementation

1. Pull the failed delivery

Query the log for the specific failure rather than eyeballing a stream. Anchor on the response code and a time window.

psql -c "SELECT correlation_id, response_code, received_at
         FROM webhook_deliveries
         WHERE response_code >= 400
         ORDER BY received_at DESC LIMIT 20;"

2. Classify the failure

The response code is your first branch. A 401/403 is almost always a signature problem. A handler that logs 408/499 or a provider that reports “timeout” points at slow processing. A 500/502/503 means your handler accepted the request but threw or a dependency was down. Classify before you change anything — fixing the wrong branch wastes the incident window.

3. Recompute the signature on the raw bytes

For a suspected signature mismatch, recompute the HMAC over the exact stored bytes and compare to the header the provider sent. A difference between recomputed-from-raw and recomputed-from-reparsed proves the body was mutated before verification — the single most common signing bug.

import hmac, hashlib, json, psycopg2

conn = psycopg2.connect("dbname=webhooks")
cur = conn.cursor()
cur.execute("SELECT raw_body, headers FROM webhook_deliveries WHERE correlation_id=%s",
            ("abc123def456",))
raw, headers = cur.fetchone()
raw = bytes(raw)
secret = b"whsec_..."

ts = json.loads(headers)["X-Webhook-Timestamp"]
expected = hmac.new(secret, f"{ts}.".encode() + raw, hashlib.sha256).hexdigest()
received = json.loads(headers)["X-Webhook-Signature"].split("v1=")[-1]

print("from raw bytes:", hmac.compare_digest(expected, received))   # True => signing OK
# If False, recompute after re-serializing to confirm a body-mutation bug:
reparsed = json.dumps(json.loads(raw)).encode()
print("from reparsed :",
      hmac.compare_digest(hmac.new(secret, f"{ts}.".encode()+reparsed,
                                   hashlib.sha256).hexdigest(), received))

4. Reproduce locally with a replay

Start your handler locally behind a tunnel, then replay the captured raw delivery into it so you can attach a debugger to the exact failing input. Re-sign with a fresh timestamp so your own replay-attack prevention window does not reject it.

TS=$(date +%s)
SIG=$(python -c "import hmac,hashlib,sys; \
  print(hmac.new(b'whsec_...', f'$TS.'.encode()+open('body.bin','rb').read(), \
  hashlib.sha256).hexdigest())")
curl -i -X POST http://localhost:8000/webhooks/orders \
  -H "Content-Type: application/json" \
  -H "X-Webhook-Timestamp: $TS" \
  -H "X-Webhook-Signature: t=$TS,v1=$SIG" \
  --data-binary @body.bin

5. Fix and re-verify against the corpus

Apply the fix, then replay the whole captured corpus of recent deliveries — not just the one — to confirm you resolved the class of failure without regressing others.

Verification and Testing

Prove the fix with an assertion, not a glance. After replaying the corpus, the handler log must show zero failures for the previously failing event type:

# Replay corpus, then assert no failures remain.
python replay_corpus.py --since "1 hour ago"
test "$(grep -c '"webhook_failed"' handler.log)" -eq 0 && echo "PASS" || echo "FAIL"

For the signature case specifically, the unit test should feed the stored raw bytes through your verification function and assert it now returns true. Persisting these reproductions as fixtures means the next deploy runs them automatically — turning a one-off debug into a permanent regression guard.

Failure Modes and Gotchas