Article

Building a Go Reliability Lab: Retry, Circuit Breakers, and Failure by Design

March 13, 2026

Building a Go Reliability Lab

Most backend teams know they need retry logic and circuit breakers.

Few have a controlled environment to test what happens when those patterns interact under real failure conditions. So I built one.

go-reliability-lab is a laboratory for reproducing production reliability problems — not a product. An order-processing pipeline with a payment service that fails 30% of the time, async workers, a real database, and the patterns that separate graceful degradation from cascading failure.

12 commits. 2 days. 43 tests across 7 packages. The build log below covers what I built, the gotchas I hit, and why each decision matters.

For the language-agnostic version of the patterns themselves, see 12 Commits to a Reliable Backend.


System Architecture

Every component is real. The payment service is the only simulation — intentionally unreliable so every failure path gets exercised.


The Build: 12 Commits in 5 Phases

Phase 1: Foundation (Commits 1–3)

Commit 1 was documentation only — README, ARCHITECTURE.md, DEVELOPMENT_ROADMAP.md. Architecture decided before any code. Commits 2–3 added the HTTP server with chi, graceful shutdown, and structured logging with zap.

The key design choice: workers run on a separate context from the HTTP server. When the server stops accepting requests, workers finish their in-flight jobs instead of dropping them.

Phase 2: Domain & Persistence (Commits 4–5)

Order model, status lifecycle (pending → processing → completed/failed), Postgres repository with pgx v5. Falls back to in-memory storage if no database is configured — same interface, zero code changes to swap.

The architectural pattern that matters here:

The domain defines what it needs. Infrastructure provides it. No import cycles, no leaky abstractions.

Phase 3: Async Processing (Commits 6–8)

Worker pool — 5 goroutines consuming from a 100-job buffered channel. Order creation persists to Postgres and enqueues a background job. Workers drive the order through processing → payment → status update.

Commit 8 adds the payment simulator: configurable failure probability, 100–500ms random latency, context-aware cancellation. This is the core of the lab — an intentionally unreliable dependency that exercises every failure path.

Phase 4: Reliability Patterns (Commits 9–10)

Two commits. Two patterns. Most of the learning.


The Gotchas Worth Knowing

These are the implementation details that documentation doesn't foreground — the kind of thing that causes subtle bugs in production.

Retry: The Off-by-One Nobody Notices

Using cenkalti/backoff v4, the retry configuration looks straightforward:

// WithMaxRetries takes retries (not attempts) — subtract 1.
var policy backoff.BackOff = backoff.WithMaxRetries(b, attempts-1)

WithMaxRetries(b, 3) gives you 4 attempts (1 initial + 3 retries). The parameter says "retries" but engineers think in "attempts." Off-by-one here is silent — your system just retries one extra time, burning backoff budget and increasing tail latency with no visible error.

Circuit Breaker: Two Error Types, Not One

Using sony/gobreaker, most teams only check for one rejection:

func IsOpen(err error) bool {
    return errors.Is(err, gobreaker.ErrOpenState) ||
        errors.Is(err, gobreaker.ErrTooManyRequests)
}

ErrOpenState — breaker is open, all calls blocked. Expected.

ErrTooManyRequests — breaker is in half-open, probing one request. All other concurrent requests get this error. If you only check ErrOpenState, half-open rejections surface as unhandled application errors — 500s to users during the exact window when the system is trying to recover.

The Composition: Where It All Connects

The retry wraps the circuit breaker. The circuit breaker wraps the payment call. When the circuit opens mid-retry, Permanent() stops the retry loop immediately.

chargeErr := reliability.Do(ctx, retryCfg,
    func(err error, wait time.Duration) {
        logger.Warn("retrying payment",
            zap.Error(err), zap.Duration("wait", wait))
        observability.RetryAttemptsTotal.
            WithLabelValues("retried").Inc()
    },
    func() error {
        cbErr := paymentCB.Execute(func() error {
            return paymentSim.Charge(ctx, job.OrderID)
        })
        if cbErr != nil && reliability.IsOpen(cbErr) {
            observability.RetryAttemptsTotal.
                WithLabelValues("permanent_failure").Inc()
            return reliability.Permanent(cbErr)
        }
        return cbErr
    })

Without Permanent(), the retry fires three more times against an open breaker — guaranteed rejections, up to 2 seconds of wasted backoff. This is the contract between the two patterns: "this failure is not transient, stop immediately."

Also note: the RetryNotify callback is where metrics are incremented. Observability and reliability are wired together at the same callsite — not in separate middleware. You can't deploy retries without the corresponding visibility.


Observability: What Gets Measured

The lab tracks seven Prometheus metrics:

MetricWhat It Reveals
HTTP request count + latency by routeTraffic shape and endpoint health
Worker jobs by statusProcessing throughput and failure rate
Queue depthBackpressure — is work piling up?
Payment failuresDownstream service health
Retry attempts by resultHow often retries fire, and how often they give up
Circuit breaker state changesWhen the system decides a dependency is down

One gotcha with chi and Prometheus: you must use:

chi.RouteContext(r.Context()).RoutePattern()

for the route label. Without it, every /orders/abc-123 gets its own label value — unbounded cardinality, unbounded memory growth, broken dashboards.


How I Built It: Spec-First, AI-Assisted

Each commit was one AI-assisted session. I owned architecture and roadmap upfront; AI executed the items.

6 of 12 commits involved third-party libraries with ambiguous docs. For those, I used GitHits to surface real usage patterns from public Go repositories — actual code from projects that had already solved the same problems, not just documentation.

Standard library work (goroutines, channels, context, interfaces) needed no external context. AI handles well-documented patterns fluently. It's the library edge cases where external context makes the difference.

The result: 43 tests across 7 packages. Full coverage of the reliability layer. Every failure mode reproducible on demand.

The discipline of one session per commit, with a spec before each session, is what kept the codebase coherent. The tooling is secondary.


The Takeaway

Run the system. POST a few orders. Watch the logs.

The payment simulator fails ~30% of charges. Retries fire with increasing delays. After 5 consecutive failures, the circuit opens — and Permanent() short-circuits the retry loop. The circuit enters half-open, probes one request, and either recovers or re-opens.

/metrics shows it all in real time. No mystery. No silent failures.

Reliability isn't about preventing failures — it's about making them visible, bounded, and recoverable.

The full source is on GitHub. The best time to study these patterns is before your first outage.

For the language-agnostic version of these patterns, see 12 Commits to a Reliable Backend.