Per-Endpoint Circuit Breaker State Machines Stored in Redis

A single global circuit breaker is the wrong granularity for a webhook dispatcher that fans out to thousands of independent destinations. When one customer’s endpoint goes dark, a global breaker trips and starves every healthy destination of deliveries. This guide, part of Circuit Breaker Patterns, shows how to keep breaker state per destination and shared across every dispatch worker by storing it in Redis, so a failing endpoint is isolated without affecting the rest of your delivery fleet. We pair this with exponential backoff algorithms for retry timing and with backpressure on webhook consumers to drain queues safely when many breakers open at once.

The hard part is concurrency. In-process breakers work when one process owns all traffic to an endpoint, but a horizontally scaled dispatcher has dozens of workers hitting the same destination. Each worker would maintain its own counters, so the breaker only “sees” a fraction of the failures and trips late, or trips inconsistently. Centralizing state in Redis gives every worker one authoritative view per endpoint.

Shared per-endpoint breaker state Several dispatch workers read and write the same Redis-backed breaker key per destination, so failures to one endpoint open only that endpoint's breaker. Worker 1 Worker 2 Worker 3 Redis cb:ep:A = OPEN cb:ep:B = CLOSED cb:ep:C = HALF Endpoint A Endpoint B Endpoint C A open: fail fast
Every worker reads and writes one shared breaker key per destination, so failures to Endpoint A open only A's breaker while B and C keep delivering.

Prerequisites

Step 1: Model the per-endpoint state key

Give each destination its own small Redis hash. The key embeds the endpoint ID so breakers never collide:

import hashlib

def endpoint_id(tenant_id: str, url: str) -> str:
    # Normalize so equivalent URLs map to one breaker.
    norm = url.split("?", 1)[0].rstrip("/").lower()
    digest = hashlib.sha256(f"{tenant_id}|{norm}".encode()).hexdigest()[:16]
    return digest

def breaker_key(ep_id: str) -> str:
    return f"cb:ep:{ep_id}"

Store the breaker as a hash with state, fail_count, opened_at, and window_start fields. Hashes let the Lua script mutate several fields in one round trip, and a TTL on the key garbage-collects breakers for endpoints that go quiet.

Step 2: Make state transitions atomic with Lua

The correctness of a shared breaker hinges on atomicity. If two workers both read fail_count = 4, both increment, and both write 5, you lose a failure and trip late. Push the entire read-decide-write cycle into a single Lua script so Redis executes it atomically:

import redis

TRANSITION_LUA = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local outcome = ARGV[2]          -- "success" | "failure" | "check"
local threshold = tonumber(ARGV[3])
local window = tonumber(ARGV[4])
local reset_timeout = tonumber(ARGV[5])

local state = redis.call('HGET', key, 'state') or 'CLOSED'
local fails = tonumber(redis.call('HGET', key, 'fail_count') or '0')
local opened_at = tonumber(redis.call('HGET', key, 'opened_at') or '0')
local win_start = tonumber(redis.call('HGET', key, 'window_start') or now)

-- Roll the failure window.
if now - win_start > window then
  fails = 0
  win_start = now
end

if outcome == 'check' then
  if state == 'OPEN' and (now - opened_at) >= reset_timeout then
    state = 'HALF_OPEN'
    redis.call('HSET', key, 'state', state)
  end
elseif outcome == 'success' then
  if state == 'HALF_OPEN' or state == 'CLOSED' then
    state = 'CLOSED'; fails = 0; win_start = now
  end
elseif outcome == 'failure' then
  fails = fails + 1
  if state == 'HALF_OPEN' or fails >= threshold then
    state = 'OPEN'; opened_at = now
  end
end

redis.call('HSET', key, 'state', state, 'fail_count', fails,
           'opened_at', opened_at, 'window_start', win_start)
redis.call('EXPIRE', key, window + reset_timeout + 60)
return state
"""

class RedisBreaker:
    def __init__(self, client: redis.Redis, *, threshold=5, window=60, reset_timeout=30):
        self.r = client
        self.threshold = threshold
        self.window = window
        self.reset_timeout = reset_timeout
        self._script = client.register_script(TRANSITION_LUA)

    def _run(self, ep_id: str, outcome: str) -> str:
        import time
        result = self._script(
            keys=[breaker_key(ep_id)],
            args=[int(time.time()), outcome, self.threshold,
                  self.window, self.reset_timeout],
        )
        return result.decode() if isinstance(result, bytes) else result

The script is the single source of truth for transitions. Workers never compute state in Python; they only report outcomes.

Step 3: Gate every dispatch through the shared breaker

Wrap the actual HTTP POST so each attempt first asks the breaker whether to proceed, then reports the result:

import httpx

class Dispatcher:
    def __init__(self, breaker: RedisBreaker, http: httpx.Client):
        self.breaker = breaker
        self.http = http

    def deliver(self, tenant_id: str, url: str, payload: dict, idem_key: str):
        ep = endpoint_id(tenant_id, url)
        state = self.breaker._run(ep, "check")
        if state == "OPEN":
            return {"status": "skipped", "reason": "breaker_open", "endpoint": ep}

        try:
            resp = self.http.post(
                url, json=payload,
                headers={"X-Idempotency-Key": idem_key},
                timeout=5.0,
            )
            resp.raise_for_status()
            self.breaker._run(ep, "success")
            return {"status": "delivered"}
        except (httpx.HTTPStatusError, httpx.RequestError):
            self.breaker._run(ep, "failure")
            return {"status": "failed", "reason": "downstream_error"}

Because the state lives in Redis, a worker that just booted immediately inherits the breaker’s current state. Scale-out, restarts, and deploys never reset the breaker.

Step 4: Limit Half-Open probes across the whole fleet

When an Open breaker ages into Half-Open, you do not want every worker probing the recovering endpoint simultaneously — that recreates the thundering herd the breaker exists to prevent. Hand out a single probe token with an atomic SET NX:

def acquire_probe(client: redis.Redis, ep_id: str, ttl: int = 5) -> bool:
    # Only one worker wins the token; others fail fast until it expires.
    return bool(client.set(f"cb:probe:{ep_id}", "1", nx=True, ex=ttl))

Call acquire_probe before sending when the state is HALF_OPEN; workers that lose the race treat the endpoint as still open. Stagger the probe itself with exponential backoff and jitter so probe timing differs from normal traffic.

Verification

Drive the breaker with a fake endpoint and assert the shared state flips exactly once. Two simulated workers should converge:

import fakeredis

def test_shared_breaker_trips_once():
    client = fakeredis.FakeStrictRedis()
    b = RedisBreaker(client, threshold=3, window=60, reset_timeout=30)
    ep = endpoint_id("tenant-1", "https://hooks.example.com/in")

    # Worker A and Worker B both report failures against the same key.
    for _ in range(2):
        assert b._run(ep, "failure") == "CLOSED"
    assert b._run(ep, "failure") == "OPEN"   # third failure trips it
    assert b._run(ep, "check") == "OPEN"     # other workers see OPEN immediately

For a live check, watch the key while you take a test endpoint offline:

# Should show state flip to OPEN after the failure threshold.
redis-cli HGETALL cb:ep:<endpoint_id>

# Confirm only one probe token exists during Half-Open.
redis-cli --scan --pattern 'cb:probe:*'

A passing system shows exactly one cb:ep:* key per failing destination and at most one cb:probe:* token per endpoint at any moment.

Failure modes and gotchas