Instrumenting Webhooks with OpenTelemetry for End-to-End Tracing
When a webhook delivery is slow or fails intermittently, the only reliable way to find the culprit is to follow a single event across the producer and consumer as one trace — and that is exactly what this guide builds, extending the broader patterns in Webhook Observability & Monitoring. We will wrap dispatch and delivery in OpenTelemetry spans, propagate the W3C Trace Context traceparent header from producer to consumer, and attach the span attributes that make the resulting trace actionable. Once spans exist, they become the substrate for the targets in defining SLOs for webhook delivery and the signals routed by alerting on webhook delivery failures.
Prerequisites
- Python 3.10+ with
opentelemetry-sdk,opentelemetry-exporter-otlp,opentelemetry-instrumentation-requests, and a web framework (FastAPI shown here). - A running OpenTelemetry collector or any OTLP-compatible backend (Jaeger, Tempo, Honeycomb) reachable from both services.
- An existing webhook dispatcher that sends HTTP POSTs, ideally backed by an outbox so you can record event creation time.
- Signature verification already in place on the consumer per HMAC signature verification — instrument around it, not instead of it.
Step 1: Configure the Tracer and Exporter
Initialize a tracer provider with an OTLP exporter and, critically, set the global propagator to W3C Trace Context so traceparent is the wire format on both ends.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.propagate import set_global_textmap
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
resource = Resource.create({"service.name": "webhook-dispatcher"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317")))
trace.set_tracer_provider(provider)
# Ensure traceparent (W3C) is the propagation format on both producer and consumer.
set_global_textmap(TraceContextTextMapPropagator())
tracer = trace.get_tracer("webhook.dispatch")
Step 2: Open a Dispatch Span and Inject Context
Wrap each delivery attempt in a span. Use inject to write the active context into the outgoing headers; never hand-format traceparent yourself.
import requests
from opentelemetry.propagate import inject
def deliver(event, endpoint, attempt):
with tracer.start_as_current_span("webhook.deliver") as span:
span.set_attribute("webhook.endpoint_id", endpoint["id"])
span.set_attribute("webhook.event_type", event["type"])
span.set_attribute("webhook.attempt", attempt)
span.set_attribute("webhook.payload_bytes", len(event["body"]))
headers = {"Content-Type": "application/json"}
inject(headers) # writes traceparent into headers from the active span
resp = requests.post(endpoint["url"], data=event["body"], headers=headers, timeout=10)
span.set_attribute("http.response.status_code", resp.status_code)
if resp.status_code >= 300:
span.set_status(trace.Status(trace.StatusCode.ERROR, f"status {resp.status_code}"))
return resp.status_code
Step 3: Set Span Attributes That Make Traces Actionable
The attributes above — endpoint_id, event_type, attempt, payload_bytes, and http.response.status_code — are what let you filter traces to “attempt > 1 deliveries to endpoint X that returned 5xx.” Add a span event for each retry decision so the backoff schedule is visible inline. Avoid putting the full payload or any secret on the span; record a payload hash instead.
Step 4: Extract Context and Start a Consumer Span
On the consumer, extract the context from request headers before starting your handler span. This is the join that makes the consumer span a child of the producer span.
from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.propagate import extract
app = FastAPI()
tracer = trace.get_tracer("webhook.consume")
@app.post("/webhooks")
async def receive(request: Request):
ctx = extract(dict(request.headers)) # reads traceparent into a context
with tracer.start_as_current_span("webhook.handle", context=ctx) as span:
body = await request.body()
span.set_attribute("webhook.payload_bytes", len(body))
# verify_signature(...) then process; span auto-closes on exit
return {"status": "ok"}
Step 5: Record Errors and Close Spans
Set the span status to error on any non-2xx outcome or exception and call record_exception so the stack trace rides on the span. Because the spans use context managers they close automatically, but never swallow exceptions before recording them — an unrecorded error is an invisible failure.
Verification and Testing
Run both services against a local collector and fire one event, then assert the trace joined correctly. A focused integration test extracts the context the producer would send and confirms the trace ID matches:
from opentelemetry.propagate import inject, extract
from opentelemetry import trace
def test_traceparent_round_trips():
tracer = trace.get_tracer("test")
with tracer.start_as_current_span("producer") as producer:
headers = {}
inject(headers)
assert "traceparent" in headers
producer_trace_id = producer.get_span_context().trace_id
ctx = extract(headers)
# The extracted span context carries the producer's trace id.
span_ctx = trace.get_current_span(ctx).get_span_context()
assert span_ctx.trace_id == producer_trace_id
You can also verify on the wire with curl and inspect that your handler logs the same trace ID it received:
curl -X POST http://localhost:8000/webhooks \
-H 'Content-Type: application/json' \
-H 'traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01' \
-d '{"type":"payment.succeeded"}'
Failure Modes and Gotchas
- Forgetting to set the W3C propagator. OpenTelemetry’s default propagator may differ across SDK versions; if producer and consumer disagree, the header is written in one format and silently ignored in the other. Call
set_global_textmap(TraceContextTextMapPropagator())on both. - A proxy or CDN stripping the header. Some edges drop unknown headers. Confirm
traceparentsurvives the full path; if it is stripped, allowlist it at the proxy. - Spans never exported because the process exits first. With
BatchSpanProcessor, short-lived dispatch workers can exit before the batch flushes. Callprovider.shutdown()on graceful exit, or use a span processor with a short schedule delay for low-volume workers. - Re-delivered events looking like duplicate traces. A retried delivery should be a new span on the original trace. Combine this with idempotency in webhooks so duplicate handling is observable rather than silent.