Per-Endpoint Circuit Breaker State Machines Stored in Redis
A single global circuit breaker is the wrong granularity for a webhook dispatcher that fans out to thousands of independent destinations. When one customer’s endpoint goes dark, a global breaker trips and starves every healthy destination of deliveries. This guide, part of Circuit Breaker Patterns, shows how to keep breaker state per destination and shared across every dispatch worker by storing it in Redis, so a failing endpoint is isolated without affecting the rest of your delivery fleet. We pair this with exponential backoff algorithms for retry timing and with backpressure on webhook consumers to drain queues safely when many breakers open at once.
The hard part is concurrency. In-process breakers work when one process owns all traffic to an endpoint, but a horizontally scaled dispatcher has dozens of workers hitting the same destination. Each worker would maintain its own counters, so the breaker only “sees” a fraction of the failures and trips late, or trips inconsistently. Centralizing state in Redis gives every worker one authoritative view per endpoint.
Prerequisites
- Python 3.10+ with
redis-py(sync orredis.asyncio) andhttpx. - A Redis instance reachable by every dispatch worker (a single primary is fine; use Redis Cluster only if you hash-tag keys per endpoint).
- A stable destination identifier for each endpoint. Use a hash of the normalized URL plus tenant ID, never the raw URL, so query strings and trailing slashes don’t fragment state.
- Familiarity with the Closed/Open/Half-Open state machine described in the Circuit Breaker Patterns overview.
Step 1: Model the per-endpoint state key
Give each destination its own small Redis hash. The key embeds the endpoint ID so breakers never collide:
import hashlib
def endpoint_id(tenant_id: str, url: str) -> str:
# Normalize so equivalent URLs map to one breaker.
norm = url.split("?", 1)[0].rstrip("/").lower()
digest = hashlib.sha256(f"{tenant_id}|{norm}".encode()).hexdigest()[:16]
return digest
def breaker_key(ep_id: str) -> str:
return f"cb:ep:{ep_id}"
Store the breaker as a hash with state, fail_count, opened_at, and window_start fields. Hashes let the Lua script mutate several fields in one round trip, and a TTL on the key garbage-collects breakers for endpoints that go quiet.
Step 2: Make state transitions atomic with Lua
The correctness of a shared breaker hinges on atomicity. If two workers both read fail_count = 4, both increment, and both write 5, you lose a failure and trip late. Push the entire read-decide-write cycle into a single Lua script so Redis executes it atomically:
import redis
TRANSITION_LUA = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local outcome = ARGV[2] -- "success" | "failure" | "check"
local threshold = tonumber(ARGV[3])
local window = tonumber(ARGV[4])
local reset_timeout = tonumber(ARGV[5])
local state = redis.call('HGET', key, 'state') or 'CLOSED'
local fails = tonumber(redis.call('HGET', key, 'fail_count') or '0')
local opened_at = tonumber(redis.call('HGET', key, 'opened_at') or '0')
local win_start = tonumber(redis.call('HGET', key, 'window_start') or now)
-- Roll the failure window.
if now - win_start > window then
fails = 0
win_start = now
end
if outcome == 'check' then
if state == 'OPEN' and (now - opened_at) >= reset_timeout then
state = 'HALF_OPEN'
redis.call('HSET', key, 'state', state)
end
elseif outcome == 'success' then
if state == 'HALF_OPEN' or state == 'CLOSED' then
state = 'CLOSED'; fails = 0; win_start = now
end
elseif outcome == 'failure' then
fails = fails + 1
if state == 'HALF_OPEN' or fails >= threshold then
state = 'OPEN'; opened_at = now
end
end
redis.call('HSET', key, 'state', state, 'fail_count', fails,
'opened_at', opened_at, 'window_start', win_start)
redis.call('EXPIRE', key, window + reset_timeout + 60)
return state
"""
class RedisBreaker:
def __init__(self, client: redis.Redis, *, threshold=5, window=60, reset_timeout=30):
self.r = client
self.threshold = threshold
self.window = window
self.reset_timeout = reset_timeout
self._script = client.register_script(TRANSITION_LUA)
def _run(self, ep_id: str, outcome: str) -> str:
import time
result = self._script(
keys=[breaker_key(ep_id)],
args=[int(time.time()), outcome, self.threshold,
self.window, self.reset_timeout],
)
return result.decode() if isinstance(result, bytes) else result
The script is the single source of truth for transitions. Workers never compute state in Python; they only report outcomes.
Step 3: Gate every dispatch through the shared breaker
Wrap the actual HTTP POST so each attempt first asks the breaker whether to proceed, then reports the result:
import httpx
class Dispatcher:
def __init__(self, breaker: RedisBreaker, http: httpx.Client):
self.breaker = breaker
self.http = http
def deliver(self, tenant_id: str, url: str, payload: dict, idem_key: str):
ep = endpoint_id(tenant_id, url)
state = self.breaker._run(ep, "check")
if state == "OPEN":
return {"status": "skipped", "reason": "breaker_open", "endpoint": ep}
try:
resp = self.http.post(
url, json=payload,
headers={"X-Idempotency-Key": idem_key},
timeout=5.0,
)
resp.raise_for_status()
self.breaker._run(ep, "success")
return {"status": "delivered"}
except (httpx.HTTPStatusError, httpx.RequestError):
self.breaker._run(ep, "failure")
return {"status": "failed", "reason": "downstream_error"}
Because the state lives in Redis, a worker that just booted immediately inherits the breaker’s current state. Scale-out, restarts, and deploys never reset the breaker.
Step 4: Limit Half-Open probes across the whole fleet
When an Open breaker ages into Half-Open, you do not want every worker probing the recovering endpoint simultaneously — that recreates the thundering herd the breaker exists to prevent. Hand out a single probe token with an atomic SET NX:
def acquire_probe(client: redis.Redis, ep_id: str, ttl: int = 5) -> bool:
# Only one worker wins the token; others fail fast until it expires.
return bool(client.set(f"cb:probe:{ep_id}", "1", nx=True, ex=ttl))
Call acquire_probe before sending when the state is HALF_OPEN; workers that lose the race treat the endpoint as still open. Stagger the probe itself with exponential backoff and jitter so probe timing differs from normal traffic.
Verification
Drive the breaker with a fake endpoint and assert the shared state flips exactly once. Two simulated workers should converge:
import fakeredis
def test_shared_breaker_trips_once():
client = fakeredis.FakeStrictRedis()
b = RedisBreaker(client, threshold=3, window=60, reset_timeout=30)
ep = endpoint_id("tenant-1", "https://hooks.example.com/in")
# Worker A and Worker B both report failures against the same key.
for _ in range(2):
assert b._run(ep, "failure") == "CLOSED"
assert b._run(ep, "failure") == "OPEN" # third failure trips it
assert b._run(ep, "check") == "OPEN" # other workers see OPEN immediately
For a live check, watch the key while you take a test endpoint offline:
# Should show state flip to OPEN after the failure threshold.
redis-cli HGETALL cb:ep:<endpoint_id>
# Confirm only one probe token exists during Half-Open.
redis-cli --scan --pattern 'cb:probe:*'
A passing system shows exactly one cb:ep:* key per failing destination and at most one cb:probe:* token per endpoint at any moment.
Failure modes and gotchas
- Redis becomes a hard dependency. If Redis is unreachable, decide deliberately: fail open (allow dispatch) for best-effort delivery, or fail closed for strict protection. Wrap
_runin a timeout and a fallback so a Redis blip doesn’t stall every worker. Never let a Redis error silently behave like “Closed.” - Key fragmentation from unstable IDs. Building the endpoint ID from the raw URL means
?retry=1or a trailing slash spawns a second breaker that never trips. Always normalize (strip query, lowercase, strip trailing slash) before hashing, as in Step 1. - Clock skew across workers. The Lua script uses
nowpassed in by the caller; if workers’ clocks disagree,reset_timeoutmath drifts. Passredis.call('TIME')from inside the script instead of a client timestamp to use Redis as the single clock. - Probe storms on recovery. Skipping the
SET NXtoken lets every Half-Open worker probe at once and re-trip a barely-recovered endpoint. The token plus jittered timing is mandatory, not optional. - Unbounded key growth. Without the
EXPIREin the script, breakers for one-off or decommissioned endpoints accumulate forever. The TTL ofwindow + reset_timeout + bufferkeeps the keyspace proportional to active destinations.