Per-Endpoint Circuit Breaker State Machines Stored in Redis

A single global circuit breaker is the wrong granularity for a webhook dispatcher that fans out to thousands of independent destinations. When one customer’s endpoint goes dark, a global breaker trips and starves every healthy destination of deliveries. This guide, part of Circuit Breaker Patterns, shows how to keep breaker state per destination and shared across every dispatch worker by storing it in Redis, so a failing endpoint is isolated without affecting the rest of your delivery fleet. We pair this with exponential backoff algorithms for retry timing and with backpressure on webhook consumers to drain queues safely when many breakers open at once.

The hard part is concurrency. In-process breakers work when one process owns all traffic to an endpoint, but a horizontally scaled dispatcher has dozens of workers hitting the same destination. Each worker would maintain its own counters, so the breaker only “sees” a fraction of the failures and trips late, or trips inconsistently. Centralizing state in Redis gives every worker one authoritative view per endpoint.

Every worker reads and writes one shared breaker key per destination, so failures to Endpoint A open only A's breaker while B and C keep delivering.

Prerequisites

Python 3.10+ with redis-py (sync or redis.asyncio) and httpx.
A Redis instance reachable by every dispatch worker (a single primary is fine; use Redis Cluster only if you hash-tag keys per endpoint).
A stable destination identifier for each endpoint. Use a hash of the normalized URL plus tenant ID, never the raw URL, so query strings and trailing slashes don’t fragment state.
Familiarity with the Closed/Open/Half-Open state machine described in the Circuit Breaker Patterns overview.

Step 1: Model the per-endpoint state key

Give each destination its own small Redis hash. The key embeds the endpoint ID so breakers never collide:

import hashlib

def endpoint_id(tenant_id: str, url: str) -> str:
    # Normalize so equivalent URLs map to one breaker.
    norm = url.split("?", 1)[0].rstrip("/").lower()
    digest = hashlib.sha256(f"{tenant_id}|{norm}".encode()).hexdigest()[:16]
    return digest

def breaker_key(ep_id: str) -> str:
    return f"cb:ep:{ep_id}"

Store the breaker as a hash with state, fail_count, opened_at, and window_start fields. Hashes let the Lua script mutate several fields in one round trip, and a TTL on the key garbage-collects breakers for endpoints that go quiet. Every field earns its place — drop window_start and the failure count never rolls, drop opened_at and Half-Open can never be reached.

Four fields and a TTL are the entire breaker: state answers "may I send", the other three answer "should that state change now".

Step 2: Make state transitions atomic with Lua

The correctness of a shared breaker hinges on atomicity. If two workers both read fail_count = 4, both increment, and both write 5, you lose a failure and trip late. Redis runs one script at a time against the keyspace, so moving the whole read-decide-write cycle inside Lua turns two concurrent reports into two strictly ordered ones, and the second worker sees the threshold breach the first one created.

Two workers reporting the same instant of failure still advance the counter twice, because the decision happens inside Redis rather than in either worker.

Push the entire read-decide-write cycle into one script and register it once per worker process:

import redis

TRANSITION_LUA = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local outcome = ARGV[2]          -- "success" | "failure" | "check"
local threshold = tonumber(ARGV[3])
local window = tonumber(ARGV[4])
local reset_timeout = tonumber(ARGV[5])

local state = redis.call('HGET', key, 'state') or 'CLOSED'
local fails = tonumber(redis.call('HGET', key, 'fail_count') or '0')
local opened_at = tonumber(redis.call('HGET', key, 'opened_at') or '0')
local win_start = tonumber(redis.call('HGET', key, 'window_start') or now)

-- Roll the failure window.
if now - win_start > window then
  fails = 0
  win_start = now
end

if outcome == 'check' then
  if state == 'OPEN' and (now - opened_at) >= reset_timeout then
    state = 'HALF_OPEN'
    redis.call('HSET', key, 'state', state)
  end
elseif outcome == 'success' then
  if state == 'HALF_OPEN' or state == 'CLOSED' then
    state = 'CLOSED'; fails = 0; win_start = now
  end
elseif outcome == 'failure' then
  fails = fails + 1
  if state == 'HALF_OPEN' or fails >= threshold then
    state = 'OPEN'; opened_at = now
  end
end

redis.call('HSET', key, 'state', state, 'fail_count', fails,
           'opened_at', opened_at, 'window_start', win_start)
redis.call('EXPIRE', key, window + reset_timeout + 60)
return state
"""

class RedisBreaker:
    def __init__(self, client: redis.Redis, *, threshold=5, window=60, reset_timeout=30):
        self.r = client
        self.threshold = threshold
        self.window = window
        self.reset_timeout = reset_timeout
        self._script = client.register_script(TRANSITION_LUA)

    def _run(self, ep_id: str, outcome: str) -> str:
        import time
        result = self._script(
            keys=[breaker_key(ep_id)],
            args=[int(time.time()), outcome, self.threshold,
                  self.window, self.reset_timeout],
        )
        return result.decode() if isinstance(result, bytes) else result

The script is the single source of truth for transitions. Workers never compute state in Python; they only report outcomes. The threshold, window and reset_timeout defaults above are starting points rather than answers — derive them from your measured per-endpoint traffic using the method in tuning circuit breaker thresholds for webhooks, because a threshold of five is meaningless for an endpoint that receives four deliveries an hour.

Step 3: Gate every dispatch through the shared breaker

Wrap the actual HTTP POST so each attempt first asks the breaker whether to proceed, then reports the result:

import httpx

class Dispatcher:
    def __init__(self, breaker: RedisBreaker, http: httpx.Client):
        self.breaker = breaker
        self.http = http

    def deliver(self, tenant_id: str, url: str, payload: dict, idem_key: str):
        ep = endpoint_id(tenant_id, url)
        state = self.breaker._run(ep, "check")
        if state == "OPEN":
            return {"status": "skipped", "reason": "breaker_open", "endpoint": ep}

        try:
            resp = self.http.post(
                url, json=payload,
                headers={"X-Idempotency-Key": idem_key},
                timeout=5.0,
            )
            resp.raise_for_status()
            self.breaker._run(ep, "success")
            return {"status": "delivered"}
        except (httpx.HTTPStatusError, httpx.RequestError):
            self.breaker._run(ep, "failure")
            return {"status": "failed", "reason": "downstream_error"}

Because the state lives in Redis, a worker that just booted immediately inherits the breaker’s current state. Scale-out, restarts, and deploys never reset the breaker.

Step 4: Limit Half-Open probes across the whole fleet

When an Open breaker ages into Half-Open, you do not want every worker probing the recovering endpoint simultaneously — that recreates the thundering herd the breaker exists to prevent. Hand out a single probe token with an atomic SET NX:

def acquire_probe(client: redis.Redis, ep_id: str, ttl: int = 5) -> bool:
    # Only one worker wins the token; others fail fast until it expires.
    return bool(client.set(f"cb:probe:{ep_id}", "1", nx=True, ex=ttl))

Call acquire_probe before sending when the state is HALF_OPEN; workers that lose the race treat the endpoint as still open. Stagger the probe itself with exponential backoff and jitter so probe timing differs from normal traffic.

Verification and testing

Drive the breaker with a fake endpoint and assert the shared state flips exactly once. Two simulated workers should converge:

import fakeredis

def test_shared_breaker_trips_once():
    client = fakeredis.FakeStrictRedis()
    b = RedisBreaker(client, threshold=3, window=60, reset_timeout=30)
    ep = endpoint_id("tenant-1", "https://hooks.example.com/in")

    # Worker A and Worker B both report failures against the same key.
    for _ in range(2):
        assert b._run(ep, "failure") == "CLOSED"
    assert b._run(ep, "failure") == "OPEN"   # third failure trips it
    assert b._run(ep, "check") == "OPEN"     # other workers see OPEN immediately

For a live check, watch the key while you take a test endpoint offline:

# Should show state flip to OPEN after the failure threshold.
redis-cli HGETALL cb:ep:<endpoint_id>

# Confirm only one probe token exists during Half-Open.
redis-cli --scan --pattern 'cb:probe:*'

A passing system shows exactly one cb:ep:* key per failing destination and at most one cb:probe:* token per endpoint at any moment.

Failure modes and gotchas

Redis becomes a hard dependency. If Redis is unreachable, decide deliberately: fail open (allow dispatch) for best-effort delivery, or fail closed for strict protection. Wrap _run in a timeout and a fallback so a Redis blip doesn’t stall every worker. Never let a Redis error silently behave like “Closed.”
Key fragmentation from unstable IDs. Building the endpoint ID from the raw URL means ?retry=1 or a trailing slash spawns a second breaker that never trips. Always normalize (strip query, lowercase, strip trailing slash) before hashing, as in Step 1.
Clock skew across workers. The Lua script uses now passed in by the caller; if workers’ clocks disagree, reset_timeout math drifts. Pass redis.call('TIME') from inside the script instead of a client timestamp to use Redis as the single clock.
Probe storms on recovery. Skipping the SET NX token lets every Half-Open worker probe at once and re-trip a barely-recovered endpoint. The token plus jittered timing is mandatory, not optional.
Unbounded key growth. Without the EXPIRE in the script, breakers for one-off or decommissioned endpoints accumulate forever. The TTL of window + reset_timeout + buffer keeps the keyspace proportional to active destinations.

Frequently Asked Questions

What happens to deliveries that are already in flight when the breaker opens?

Nothing cancels them. The check runs before the POST, so a request already on the wire finishes and reports its outcome normally, which means a handful of extra failures land against an already-open breaker. That is an acceptable cost: each one is bounded by the HTTP timeout, and the reported failures simply refresh opened_at. Cancelling mid-flight at the client level is possible but rarely worth the complexity.

Should two tenants that post to the same host share one breaker?

No, and the endpoint_id in Step 1 deliberately prevents it by hashing the tenant ID alongside the normalized URL. Two tenants on one host have different credentials, quotas and payload shapes, so one tenant's 500s are weak evidence about the other's health. If the shared host itself is the bottleneck, add a second coarser breaker keyed on hostname and require both to allow before dispatching.

How much latency does the Redis round trip add per delivery?

Two round trips per attempt, one for the check and one for the outcome, which is well under a millisecond against a co-located instance and invisible next to a five second HTTP timeout. Pipeline the check with any other per-attempt lookups the worker already makes. If it ever does show up in a profile, cache only the OPEN verdict in process for a second or two, never a CLOSED one, because a stale CLOSED is exactly the failure the shared breaker exists to prevent.

What does the breaker do during a Redis failover?

Script calls fail for the few seconds the failover takes, and any state written since the last replication to the new primary is gone. Give _run a short timeout and a deliberate fallback so workers do not block, then accept that breakers rebuild themselves from live traffic within one window. The realistic worst case is that an open breaker briefly allows traffic through again before the counters refill.

What should happen to an event whose dispatch is skipped because the breaker is open?

A skip is a routing decision, not a terminal outcome, so the event must go back on the schedule rather than being discarded or marked delivered. Requeue it with a delay at least as long as the remaining reset_timeout, and decide explicitly whether the skip consumes an attempt from the retry budget. Skips that silently burn budget turn a thirty second outage into a wave of dead letters.

Should the probe token be deleted after a successful probe?

Let it expire instead. Deleting it opens a race where a second worker acquires the token and probes before the transition script has recorded the close, which is the herd the token was meant to stop. If your probes routinely take longer than the token TTL, raise the TTL above the HTTP timeout rather than releasing the key early.