Webhook Observability & Monitoring: Tracing, Metrics, and Structured Logs

Observability for webhook delivery is the discipline of making an inherently asynchronous, cross-organization pipeline legible enough to debug under load — and it is a core part of Webhook Architecture Fundamentals & Design Patterns. A webhook crosses a producer’s dispatcher, the public internet, a consumer’s edge, and a queue or worker before any business logic runs. Each hop can drop, delay, duplicate, or reorder an event, yet the producer often sees only an opaque HTTP status code and the consumer often sees only a payload with no lineage. Without deliberate instrumentation you cannot answer the questions that matter during an incident: which events were dispatched, which were acknowledged, how long delivery took end to end, and where in the chain failures concentrate. This guide assumes you already operate a webhook system at meaningful volume and want to wire three signals — distributed traces, delivery metrics, and structured logs — into a coherent telemetry layer that supports both real-time alerting and forensic replay.

The three signals are complementary, not interchangeable. Metrics answer “how healthy is the system right now” cheaply and at high cardinality limits; traces answer “what happened to this one event” with full causal ordering; structured logs answer “exactly what did each component decide and why” with rich per-event context. A mature setup emits all three from the same instrumentation points, correlated by a shared trace identifier, so an alert on a metric links directly to the trace and the log lines for the offending deliveries.

The producer–consumer boundary is also an ownership boundary, and that is what makes webhook telemetry harder than tracing a request inside one service. The producer’s view ends at the HTTP response: it knows the status code, the byte count, and the wall-clock duration of the request, and nothing at all about whether the consumer durably enqueued the event or dropped it on the floor immediately after returning 200. The consumer’s view begins at that same instant and cannot see how long the event waited in the producer’s outbox, how many attempts preceded this one, or whether an earlier attempt already succeeded and this delivery is a duplicate. Every technique below exists to push a small, cheap piece of information across that boundary — a trace context header, an attempt counter, an event ID — so each side can reason about the half of the pipeline it does not own. When you cannot change the other side, which is the normal case for third-party integrations, the fallback is to reconstruct the missing half from your own signals and label the uncertainty explicitly rather than quietly assuming the invisible half is healthy.

Dispatch and ingest spans share a propagated trace context and emit metrics, traces, and logs to a common backend.

Distributed Tracing Across the Producer–Consumer Boundary

The hardest part of webhook observability is that a single logical event is processed by two systems that do not share a process, a host, or even an owner. Distributed tracing solves this by propagating a W3C Trace Context traceparent header from the producer’s dispatch span into the HTTP request, so the consumer can resume the same trace when it receives the payload. The producer opens a span when it pulls an event off its outbox, records the target URL and attempt number as span attributes, injects the trace context into the outgoing headers, and closes the span when the consumer’s HTTP response arrives. The consumer extracts that context, starts a child span around verification and business logic, and links any queued or retried work back to the original trace.

This producer-to-consumer propagation is what turns a wall of disconnected log lines into a single causal timeline. When delivery is slow, the trace tells you whether the latency lives in the producer’s queue, in network and TLS, or in the consumer’s handler. The mechanics — SDK setup, span attributes, and header injection and extraction — are covered in depth in instrumenting webhooks with OpenTelemetry. Pair tracing with idempotency in webhooks so that retried deliveries show up as additional spans on the same trace rather than as silent duplicate processing, and with message ordering guarantees so reordered events are visible as out-of-sequence spans rather than data corruption discovered days later.

One trace per event, one span per attempt — until the retry budget outgrows it

An event that finally lands on its fourth attempt should ideally read as one trace with four child spans, not as four unrelated traces. The clean way to get that is to create the trace when the event is created: open a root span as the outbox row is written, serialize the resulting traceparent into a column on that row, and have each dispatch attempt restore that context and open a child span carrying an attempt attribute. Every attempt then sits under one root, and the gap between spans is literally the backoff delay, which makes an over-aggressive or stalled retry schedule visible at a glance.

That model breaks down at exactly the point where retries become interesting. Tracing backends assemble a trace from spans that arrive within a bounded window — Tempo and Jaeger both flush and index in blocks measured in minutes, not hours — and a root span held open across a 26-hour retry budget will be split across many blocks. The observable symptom is a trace view that shows attempts 1 and 2 but silently omits attempts 3 and 4, or a trace that appears complete in the search results and empty when opened. The practical cutoff is the backend’s trace assembly window: if your total backoff schedule fits inside a few minutes, keep one long trace; if it stretches to hours, emit each attempt as its own short trace, add a span link back to the originating trace ID, and stamp event_id on every span so the join can be done by attribute query when the link is not enough. Losing the visual tree is a real cost, but it is smaller than the cost of a trace view that lies about what happened.

Trusting inbound trace context from a party you do not control

On the consumer side, an inbound traceparent is attacker-controlled input, and treating it as ordinary infrastructure metadata is a mistake three separate ways. First, anyone who can guess or observe an internal trace ID can inject spans into that trace and pollute an investigation. Second, the sampled flag is a free lever on your observability bill: a caller who sets -01 on every request forces a naive ParentBased sampler to keep 100% of that traffic, and a few million requests can evict the traces you actually needed from a retention-limited backend. Third, tracestate is an unbounded key–value string that will happily carry a kilobyte of junk into every span you store.

The defensive default for a public ingest endpoint is: validate the header format strictly, drop tracestate from untrusted peers, and make your own sampling decision instead of honouring the remote flag. Keep the correlation by recording the remote trace ID as a plain span attribute plus a span link, rather than adopting it as the trace ID of your own trace. For a partner you do control — your own dispatcher calling your own regional consumer — adopting the context directly is correct and gives you the single unified trace that makes cross-boundary latency legible. The distinction is not technical sophistication, it is whether the peer can be held to a contract.

Sampling Strategy and Telemetry Cost Control

Full-fidelity tracing of a busy webhook pipeline is affordable only by accident. Work the arithmetic before you turn the exporter on: 50 million deliveries a month at an average of 1.4 attempts each is 70 million dispatch spans, and if the consumer side is also instrumented that is roughly 140 million spans a month. At 300–500 bytes per span after compression that is 45–70 GB of trace ingest, and most managed backends price in that neighbourhood per gigabyte. Nobody signs off on that line item to investigate a handful of incidents a quarter.

The naive fix — head-based sampling at 1% — is worse than the cost it saves. Failures are rare by construction: at a 99.9% success ratio, only 140,000 of those 140 million spans represent a failed delivery, and a uniform 1% sample keeps about 1,400 of them, scattered arbitrarily. When an integrator asks why event evt_9f2c was never delivered, the answer is a 99% chance of “we did not keep that trace”. Head sampling optimizes for the traces you will never look at.

Tail-based sampling inverts the priority. The collector buffers the spans of a trace, waits a fixed decision window, and then applies policies: keep every trace containing an error status, every trace where attempt exceeds 1, every trace whose duration exceeds the p99 threshold, and a 1–2% probabilistic slice of everything else as a healthy baseline. On the numbers above that yields roughly 140,000 error traces plus about 1.4 million baseline spans — around 2% of raw volume with complete coverage of everything you would ever investigate.

Tail sampling brings two operational constraints that surprise teams. The collector must receive all spans of a trace on the same replica, which means putting a trace-ID-aware load-balancing exporter in front of the sampling tier; without it, each replica sees a fragment, makes an independent decision, and you get half-traces. The symptom is distinctive: traces look complete while the collector runs a single replica and start fragmenting the day you scale it out for throughput. The second constraint is memory. The buffer is roughly span rate × decision wait × span size: 3,000 spans/s with a 20-second wait and 500-byte spans is about 30 MB per replica and unremarkable, while 30,000 spans/s with a 60-second wait is 900 MB and will OOM a default-sized pod. Set the decision wait to just longer than a single attempt’s timeout — 15–30 seconds is right for a 10-second HTTP timeout — and never try to make it long enough to cover a retry backoff, because that is what span links and event_id are for.

Tail sampling is the only column that keeps every failure without paying for every success — the extra collector memory is the price.

Delivery Metrics: Success Rate, Latency, and Queue Depth

Metrics are the cheapest signal to alert on and the basis for any service level objective. Four families matter for webhooks. Delivery success rate is the ratio of acknowledged deliveries (terminal 2xx) to total dispatch attempts, sliced by endpoint and event type. Delivery latency is the end-to-end time from event creation to consumer acknowledgement, recorded as a histogram so you can read p50, p95, and p99 rather than a misleading average — the bucket layout and quantile maths behind those figures are covered in tracking webhook delivery latency percentiles. Retry depth counts events currently in backoff and the distribution of attempt numbers, which surfaces a failing endpoint before its events exhaust their budget. Dead-letter queue depth is the count of events that have given up retrying and landed in the dead-letter queue; a rising DLQ depth is the clearest single indicator that a consumer is durably broken.

Keep cardinality under control: label by endpoint identifier and event type, but never by raw event ID or full URL, or your metrics backend will buckle. These metrics feed directly into target setting, which is the subject of defining SLOs for webhook delivery — success ratio and latency become your service level indicators, and their targets become the error budget that drives release decisions.

Every label on a delivery counter must come from a bounded set; per-event detail belongs on traces and logs instead.

Doing the cardinality arithmetic before you ship the metric

“Bounded” is not a feeling, it is a multiplication you can do in advance. A platform with 5,000 registered endpoints and 40 event types that labels its delivery counter by endpoint, event type, and outcome produces 5,000 × 40 × 4 = 800,000 active series from one counter. Add a latency histogram with 12 buckets — which in Prometheus exposition is 12 bucket series plus _sum and _count, so 14 — labeled the same way, and you are at 2.8 million series. Prometheus costs roughly 1–3 KB of resident memory per active series, so webhook delivery alone would claim 3–8 GB before anything else in the system is scraped. That is how a metric that looked reasonable in staging takes down a shared monitoring cluster on the day a large customer onboards 2,000 endpoints.

The fix is to stop labeling every metric identically. Split responsibilities: put the high-cardinality dimension on the cheap metric and drop it from the expensive one. A counter labeled by endpoint and outcome is 20,000 series and is what you need for per-endpoint failure alerting. A histogram labeled only by endpoint tier and outcome — say four tiers by traffic class — is 4 × 4 × 14 = 224 series and is enough to compute the latency percentiles the SLO cares about. Per-event-type detail lives on traces and logs, where it costs nothing per unique value. As a rule of thumb, budget a hard ceiling of 100,000 series for the whole webhook subsystem and check the projection against it whenever you add a label; if the new label would break the ceiling, it belongs on a span attribute instead.

The attempt label deserves special scrutiny, including in the example code further down this guide. With a maximum of eight attempts it multiplies every labeled series eightfold, and the analytical value of distinguishing attempt 6 from attempt 7 is close to zero. Collapse it to two values — first and retry — on the counter, and read the full attempt distribution from spans when you actually need it.

Producer-side and consumer-side numbers will not agree, and that gap is a signal

If both sides of an integration are instrumented, you have two success ratios, and they will differ. The producer counts what it observed; the consumer counts what it accepted. They diverge in specific, diagnosable ways. A response lost after the consumer committed — a connection reset between the consumer’s 200 and the producer’s socket — is recorded as a failure by the producer and a success by the consumer, and the producer will retry an event the consumer has already processed. An edge proxy or CDN that acknowledges the request before the application sees it produces the opposite skew: the producer records success for an event that never reached business logic, which is the single most damaging silent failure in webhook delivery because it looks perfect on every dashboard.

Reconcile deliberately rather than assuming. A daily job that compares the set of event IDs the producer marked acknowledged against the set the consumer recorded as processed will surface both skews within a day instead of a quarter. Typical healthy numbers look like a producer success ratio of 99.2% against a consumer-accepted count of 99.6%; that 0.4% delta is the duplicate deliveries the consumer’s idempotency layer absorbed, and it should be roughly stable. A delta that inverts — the consumer processing fewer events than the producer believes it delivered — means acknowledgements are being issued by something that is not the handler, and it warrants an immediate look at the edge configuration.

Exemplars: the link from an aggregate spike to one real request

Metrics and traces are usually joined by hand: an operator sees a p99 spike, guesses a time range, and hunts for a slow trace. Exemplars remove the guesswork by attaching a sample trace ID directly to a histogram bucket observation, so a click on the spiking bucket in Grafana opens an actual trace that landed in it. The cost is small — one exemplar per series per scrape interval, stored in a separate ring buffer — and the operational payoff during an incident is disproportionate, because it removes the step where people give up. Exemplars require OpenMetrics exposition and an explicitly enabled exemplar storage flag on the Prometheus side, and they only work if the recorded trace was actually sampled: pair them with the tail-sampling policy above, which keeps high-latency traces by construction, or the exemplar will point at a trace that was thrown away.

Structured Logs and Trace Correlation

Logs carry the detail that metrics aggregate away and traces summarize. Emit one structured (JSON) log line per delivery decision — dispatched, acknowledged, retry-scheduled, dead-lettered — and stamp every line with the trace_id and span_id from the active span plus the event ID, endpoint ID, attempt number, and HTTP status. That correlation key is the join that lets an operator pivot from a latency alert to the exact deliveries that breached it. Never log raw payloads or signature secrets; log a payload hash and the verification outcome instead, consistent with the controls in HMAC signature verification.

The set of decisions worth logging is small and finite, because a delivery only ever moves through a handful of observable states. Model those states explicitly and the log stream becomes a replayable audit of every choice the dispatcher made.

Logging on state transitions rather than on every code path keeps the stream small enough to retain and complete enough to reconstruct a delivery.

The delivery log line schema

Fix the field set once and treat it as an interface. Every consumer of these logs — the support tool that answers “where is my event”, the reconciliation job, the incident dashboard — depends on the names staying stable, and renaming endpoint_id to destination_id after the fact breaks all of them silently.

Field	Example value	Why it earns its place
`trace_id`	`0af7651916cd43dd8448eb211c80319c`	The join key from a metric alert or a customer report to the full causal timeline
`event_id`	`evt_9f2c1b4a`	Survives when tracing sampled the trace away; the only stable identity across retries and replays
`endpoint_id`	`ep_7f2a`	Groups every delivery to one destination so a single broken integration is isolable
`attempt`	`3`	Distinguishes a first-attempt blip from an endpoint that has been failing for hours
`outcome`	`retry_scheduled`	The state transition itself; the whole line exists to record this
`http_status`	`503`	Separates consumer rejection from transport failure without reading the message text
`latency_ms`	`8421`	Per-attempt duration, so a slow success is as visible as an outright failure
`payload_sha256`	`9f2c1b…`	Proves which bytes were sent during a dispute without storing the bytes

Two of those fields carry a trap. latency_ms must record the attempt duration measured on the producer’s clock, never a difference between a producer timestamp and a consumer timestamp; clock skew of even a few hundred milliseconds between two organizations’ hosts produces negative latencies that quietly corrupt any percentile computed from the log stream. And payload_sha256 is only privacy-preserving when payloads carry enough entropy. A body like {"user_id": 42, "type": "plan.upgraded"} has a domain small enough to enumerate, so a plain SHA-256 is effectively reversible by anyone who can read your logs. Where payload shapes are that constrained, hash with a keyed HMAC using an internal secret rotated on the same schedule as your other keys, and accept that the hash then only supports equality checks inside your own systems.

Retention tiers and what they cost

Log volume for a busy pipeline is not small, and treating all of it as hot searchable data is how observability budgets get frozen. Four transitions per delivery at roughly 300 bytes a line is 1.2 KB per delivery; at 70 million attempts a month that is about 84 GB of raw JSON before indexing, and index overhead in a typical search-backed store adds 50–100% on top. Three tiers keep that sane. Hot storage of 7 days covers essentially every live incident and is what your search cluster is sized for. Warm storage of 30 days matches the SLO window and is where reconciliation and post-incident review happen; a columnar or compressed object-store tier is fine here because queries are analytical rather than interactive. Cold retention of 400 days, written as compressed objects keyed by event ID, exists purely to answer contractual questions about whether a specific event was sent, and does not need to be searchable at all.

Resist the temptation to sample the log stream to fit the budget. Sampling breaks the one guarantee that makes these logs worth keeping — that any given event ID can be accounted for — and it fails precisely during the incidents that generate the most volume. Reduce volume structurally instead: drop the dispatched line for attempts that succeed immediately and log only the terminal acked line for them, which removes roughly half the volume in a healthy system while leaving every failing delivery fully traced. That trade is safe because a first-attempt success has no diagnostic content; the moment an attempt fails, log every transition for that event from then on.

Reading the Dashboard in the First Five Minutes of an Incident

Dashboards fail in a specific way during incidents: they show thirty panels, none of which answers “is this getting worse, and for whom”. Design the top row for exactly four questions and put everything else below the fold. The four are the delivery success ratio over a one-hour window, the dead-letter queue depth, the age of the oldest un-dispatched event, and the current attempt rate. The third is the one most teams omit and the one that most often identifies the problem.

Oldest-event age is not derivable from queue depth, and the two failure signatures are opposite. A dispatcher that has stalled while intake continues shows rising depth and rising age — obvious on any dashboard. A dispatcher that has stalled while intake also stalled, which is what an upstream outage or a poisoned leader election looks like, shows flat depth and rising age. Teams staring at a flat depth graph routinely conclude the pipeline is idle rather than dead. Emit oldest-event age as a gauge computed from the minimum created_at of undelivered outbox rows, refresh it every scrape, and alert on it independently of depth.

Below the top row, one panel earns its place: a table of the ten endpoints with the highest failure count in the last fifteen minutes, with the failure count, the modal HTTP status, and the current circuit breaker state per row. This is what turns “delivery is degraded” into “delivery is degraded for these three endpoints, all returning 502, all owned by one customer’s shared load balancer” — a distinction that decides whether you page your own team or contact an integrator. Everything that cannot change what you do next should be deleted; a panel nobody has ever acted on is negative value, because it costs attention during the exact minutes attention is scarcest.

CI/CD and Operational Integration

Instrumentation must be tested like any other code path. Add a stage to your pipeline that boots the dispatcher against a local OpenTelemetry collector, fires a synthetic event, and asserts that a trace with the expected span names and attributes was exported. Treat missing or malformed spans as a build failure, because broken instrumentation is invisible until the incident when you need it. Version your metric names and labels alongside the schema; renaming a metric silently orphans every dashboard and alert that referenced it. Bake SLO definitions into code (recording rules and alert rules in source control) so that thresholds are reviewed, not adjusted by hand in a UI during a page.

Rolling out an instrumentation change without an outage

Telemetry code runs on the hot path, which makes an instrumentation deploy a production change rather than a documentation change. Sequence it accordingly. Ship the collector first: a collector that already understands the new attributes will accept old and new payloads, whereas an SDK emitting to a collector that rejects the schema drops spans and reports nothing about it. Then roll the SDK change to a single dispatcher replica and check three numbers before proceeding — spans exported per delivery (it should be exactly the count you designed, and a duplicate context manager is the usual cause of double), the SDK’s dropped-span counter, and the dispatcher’s own p99 dispatch latency. Only then continue to the fleet.

The failure that actually hurts is exporter backpressure leaking into the request path. SimpleSpanProcessor exports synchronously and will add the collector’s round-trip time to every delivery; under a collector slowdown that turns a 40 ms dispatch into a 400 ms dispatch and collapses throughput fleet-wide. BatchSpanProcessor with a bounded queue is the only safe default — 2,048 queued spans and a 5-second schedule delay are sane starting values — and it fails by silently discarding spans when the queue fills. That silence is the point: shedding telemetry to protect delivery is correct behaviour. It is also why the dropped-span counter must be scraped and alerted on, or your first sign of a saturated pipeline will be traces that mysteriously stop containing consumer spans.

Metric renames need a deprecation window that matches your longest recording rule, not your release cadence. A 30-day SLO recording rule reads 30 days of history, so deleting the old series the moment the new one ships blanks the SLO panel and stops the burn-rate alerts from evaluating — an alert that evaluates to no data does not fire, which means the rename silently disables paging. Dual-emit both names for one full SLO window, cut the recording rules over to the new name, watch for a full window that the two agree, and only then delete. Rollback is symmetrical: because the old series never stopped, reverting the deploy needs no dashboard changes at all.

Finally, run a synthetic canary. One event per minute, dispatched to an endpoint you own that always returns 200, gives you a continuous end-to-end signal that is independent of customer traffic. It is the only thing that distinguishes “no failures” from “no traffic” at three in the morning, and it costs 1,440 deliveries a day.

Failure Modes in Webhook Telemetry

Failure Mode	Impact	Mitigation
Trace context dropped at the boundary	Producer and consumer spans never join; deliveries appear as orphaned half-traces	Inject `traceparent` on dispatch and extract it on ingest; assert propagation in CI
High-cardinality metric labels	Metrics backend OOMs or drops series; dashboards go blank during incidents	Label only by endpoint and event type; move event-level detail to traces and logs
Sampling drops the failing trace	The one slow or errored delivery is not in the sampled set when you investigate	Use tail-based sampling that keeps all error and high-latency traces
Logs without trace correlation	Cannot pivot from a metric alert to the offending deliveries	Stamp every log line with `trace_id`, `span_id`, event ID, and endpoint ID
Latency measured from dispatch, not creation	End-to-end SLO looks healthy while events age in the outbox	Record latency from event creation time, not from first dispatch attempt
Synchronous span export on the hot path	Dispatch p99 tracks collector latency; throughput collapses under a collector slowdown	Use `BatchSpanProcessor` with a bounded queue and alert on the dropped-span counter
Metric renamed without a dual-emit window	30-day SLO panels blank and burn-rate alerts stop evaluating, so nothing pages	Emit old and new names together for one full SLO window before deleting the old
Clock skew between producer and consumer hosts	Negative or implausible end-to-end latencies corrupt every percentile	Compute all durations from producer-side clocks; treat consumer timestamps as advisory
Edge proxy acknowledging before the handler runs	Delivery looks 100% successful while events are never processed	Reconcile producer acknowledgements against consumer processed-event IDs daily
Untrusted inbound trace context adopted verbatim	External callers force-sample your backend or inject spans into internal traces	Re-sample locally and attach the remote trace ID as a span link, not as your trace ID

Annotated Example: A Delivery Metrics Recorder

The snippet below records the three metric families around a single dispatch attempt. It is deliberately transport-agnostic so it can wrap any HTTP client.

import time
from opentelemetry import metrics

meter = metrics.get_meter("webhook.delivery")

# Counter for outcomes; histogram for end-to-end latency in seconds.
deliveries = meter.create_counter("webhook.deliveries", unit="1")
latency = meter.create_histogram("webhook.delivery.latency", unit="s")

def record_delivery(send_fn, *, endpoint_id, event_type, created_at, attempt):
    """Wrap a send function and emit success/latency metrics.

    send_fn() must return an HTTP status code (int) or raise on transport error.
    `created_at` is the event creation time (epoch seconds), NOT first-dispatch
    time, so latency reflects true end-to-end age.
    """
    labels = {"endpoint": endpoint_id, "event_type": event_type}
    try:
        status = send_fn()
        outcome = "acked" if 200 <= status < 300 else "rejected"
    except Exception:
        outcome = "transport_error"
        status = 0
    finally:
        # Latency is measured from event creation to terminal outcome.
        latency.record(time.time() - created_at, labels)
        deliveries.add(1, {**labels, "outcome": outcome, "attempt": str(attempt)})
    return status

Three details in that code are load-bearing. The finally block guarantees a latency observation and a counter increment even when the transport raises, which is what prevents the most common instrumentation bug: a pipeline whose success ratio looks perfect because failures never reach the recording call. The outcome value distinguishes rejected (the consumer answered with a non-2xx, so the fault is likely theirs) from transport_error (no answer at all, so the fault is likely network or DNS), and collapsing those two into a single failed bucket destroys the first useful branch of any investigation. And created_at is deliberately the event’s creation time rather than the attempt’s start time, so the histogram measures what the integrator experiences.

One line should be changed before you run this at scale: "attempt": str(attempt) multiplies the series count by your maximum attempt number for no analytical gain. Replace it with "attempt": "first" if attempt == 1 else "retry" and read the full distribution from spans. If you want an unbounded-cardinality view of attempts without paying for it in the metrics store, add a span attribute in the same function and let the tracing backend carry it.

Debugging Checklist

Frequently Asked Questions

Should a public webhook endpoint trust the traceparent header a caller sends?

Not blindly. The header is unauthenticated input, so a caller can set the sampled flag on every request and force your backend to retain traffic you never wanted, or reuse an internal trace id to inject spans into an unrelated investigation.

Validate the format, discard tracestate from peers you have no contract with, and make your own sampling decision locally. Record the caller's trace id as a span attribute and a span link so correlation survives without handing an outsider control of your trace graph.

How do I keep tracing affordable at tens of millions of deliveries per month?

Never solve it with a uniform head-based sample, because failures are rare and a 1% sample throws away 99% of the only traces anyone will ever open. Use a tail-based policy in the collector that keeps every errored trace, every trace with more than one attempt, and every trace above your p99 latency threshold, plus a small probabilistic slice of healthy traffic as a baseline.

Budget the collector memory as span rate multiplied by decision wait multiplied by span size, and keep the decision wait just above one attempt timeout rather than trying to cover a whole retry schedule.

Which clock should end-to-end delivery latency be measured with?

The producer's, from event creation to terminal acknowledgement. Subtracting a producer timestamp from a consumer timestamp mixes two independently drifting clocks, and a few hundred milliseconds of skew between organizations produces negative durations that silently poison every percentile derived from them.

If you need the consumer-side processing time as well, have the consumer report its own duration as a value in the response body or a span attribute rather than as a timestamp to be differenced.

Why would customers report missing events while my dashboards show everything green?

The classic cause is something upstream of your handler issuing the acknowledgement — an edge proxy, a CDN rule, or a load balancer health path answering 2xx before the application is reached. The producer records a successful delivery, the dead-letter queue stays empty, and nothing is ever retried.

Catch it by reconciling the set of event ids the producer marked acknowledged against the set the consumer actually processed on a daily schedule; a consumer processing fewer events than the producer delivered is the signature.

Is the attempt number safe to use as a metric label?

Only if you collapse it. A raw attempt label multiplies every series by your maximum attempt count, which on a counter already labeled by endpoint and event type is often a several-hundred-thousand-series increase for almost no analytical value.

Use two values, first and retry, on the metric, and keep the exact attempt number as a span attribute where unique values are free.

What is the smallest instrumentation that is still worth having?

A delivery counter labeled by endpoint and outcome, a latency histogram measured from event creation, a gauge for the age of the oldest undelivered event, and one structured log line per state transition carrying the event id. That set supports an SLO, a working alert, and a per-event support answer.

Tracing is the next thing to add, not the first, because without a stable event id in the logs you cannot answer the question customers actually ask.

How long do webhook delivery logs need to be kept?

Split it by purpose rather than picking one number. Seven days of hot searchable logs covers live incident work, thirty days of cheaper warm storage matches a typical rolling SLO window and supports reconciliation, and a compressed archive keyed by event id answers contractual "did you send it" questions for as long as your agreements require, often around 400 days.

Sampling the log stream to save money is the wrong lever, because it removes the guarantee that any individual event can be accounted for.