Building a Go Reliability Lab
Most backend teams know they need retry logic and circuit breakers.
Few have a controlled environment to test what happens when those patterns interact under real failure conditions. So I built one.
go-reliability-lab is a laboratory for reproducing production reliability problems — not a product. An order-processing pipeline with a payment service that fails 30% of the time, async workers, a real database, and the patterns that separate graceful degradation from cascading failure.
12 commits. 2 days. 43 tests across 7 packages. The build log below covers what I built, the gotchas I hit, and why each decision matters.
For the language-agnostic version of the patterns themselves, see 12 Commits to a Reliable Backend.
System Architecture
Every component is real. The payment service is the only simulation — intentionally unreliable so every failure path gets exercised.
The Build: 12 Commits in 5 Phases
Phase 1: Foundation (Commits 1–3)
Commit 1 was documentation only — README, ARCHITECTURE.md, DEVELOPMENT_ROADMAP.md. Architecture decided before any code. Commits 2–3 added the HTTP server with chi, graceful shutdown, and structured logging with zap.
The key design choice: workers run on a separate context from the HTTP server. When the server stops accepting requests, workers finish their in-flight jobs instead of dropping them.
Phase 2: Domain & Persistence (Commits 4–5)
Order model, status lifecycle (pending → processing → completed/failed), Postgres repository with pgx v5. Falls back to in-memory storage if no database is configured — same interface, zero code changes to swap.
The architectural pattern that matters here:
The domain defines what it needs. Infrastructure provides it. No import cycles, no leaky abstractions.
Phase 3: Async Processing (Commits 6–8)
Worker pool — 5 goroutines consuming from a 100-job buffered channel. Order creation persists to Postgres and enqueues a background job. Workers drive the order through processing → payment → status update.
Commit 8 adds the payment simulator: configurable failure probability, 100–500ms random latency, context-aware cancellation. This is the core of the lab — an intentionally unreliable dependency that exercises every failure path.
Phase 4: Reliability Patterns (Commits 9–10)
Two commits. Two patterns. Most of the learning.
The Gotchas Worth Knowing
These are the implementation details that documentation doesn't foreground — the kind of thing that causes subtle bugs in production.
Retry: The Off-by-One Nobody Notices
Using cenkalti/backoff v4, the retry configuration looks straightforward:
// WithMaxRetries takes retries (not attempts) — subtract 1.
var policy backoff.BackOff = backoff.WithMaxRetries(b, attempts-1)
WithMaxRetries(b, 3) gives you 4 attempts (1 initial + 3 retries). The parameter says "retries" but engineers think in "attempts." Off-by-one here is silent — your system just retries one extra time, burning backoff budget and increasing tail latency with no visible error.
Circuit Breaker: Two Error Types, Not One
Using sony/gobreaker, most teams only check for one rejection:
func IsOpen(err error) bool {
return errors.Is(err, gobreaker.ErrOpenState) ||
errors.Is(err, gobreaker.ErrTooManyRequests)
}
ErrOpenState — breaker is open, all calls blocked. Expected.
ErrTooManyRequests — breaker is in half-open, probing one request. All other concurrent requests get this error. If you only check ErrOpenState, half-open rejections surface as unhandled application errors — 500s to users during the exact window when the system is trying to recover.
The Composition: Where It All Connects
The retry wraps the circuit breaker. The circuit breaker wraps the payment call. When the circuit opens mid-retry, Permanent() stops the retry loop immediately.
chargeErr := reliability.Do(ctx, retryCfg,
func(err error, wait time.Duration) {
logger.Warn("retrying payment",
zap.Error(err), zap.Duration("wait", wait))
observability.RetryAttemptsTotal.
WithLabelValues("retried").Inc()
},
func() error {
cbErr := paymentCB.Execute(func() error {
return paymentSim.Charge(ctx, job.OrderID)
})
if cbErr != nil && reliability.IsOpen(cbErr) {
observability.RetryAttemptsTotal.
WithLabelValues("permanent_failure").Inc()
return reliability.Permanent(cbErr)
}
return cbErr
})
Without Permanent(), the retry fires three more times against an open breaker — guaranteed rejections, up to 2 seconds of wasted backoff. This is the contract between the two patterns: "this failure is not transient, stop immediately."
Also note: the RetryNotify callback is where metrics are incremented. Observability and reliability are wired together at the same callsite — not in separate middleware. You can't deploy retries without the corresponding visibility.
Observability: What Gets Measured
The lab tracks seven Prometheus metrics:
| Metric | What It Reveals |
|---|---|
| HTTP request count + latency by route | Traffic shape and endpoint health |
| Worker jobs by status | Processing throughput and failure rate |
| Queue depth | Backpressure — is work piling up? |
| Payment failures | Downstream service health |
| Retry attempts by result | How often retries fire, and how often they give up |
| Circuit breaker state changes | When the system decides a dependency is down |
One gotcha with chi and Prometheus: you must use:
chi.RouteContext(r.Context()).RoutePattern()
for the route label. Without it, every /orders/abc-123 gets its own label value — unbounded cardinality, unbounded memory growth, broken dashboards.
How I Built It: Spec-First, AI-Assisted
Each commit was one AI-assisted session. I owned architecture and roadmap upfront; AI executed the items.
6 of 12 commits involved third-party libraries with ambiguous docs. For those, I used GitHits to surface real usage patterns from public Go repositories — actual code from projects that had already solved the same problems, not just documentation.
Standard library work (goroutines, channels, context, interfaces) needed no external context. AI handles well-documented patterns fluently. It's the library edge cases where external context makes the difference.
The result: 43 tests across 7 packages. Full coverage of the reliability layer. Every failure mode reproducible on demand.
The discipline of one session per commit, with a spec before each session, is what kept the codebase coherent. The tooling is secondary.
The Takeaway
Run the system. POST a few orders. Watch the logs.
The payment simulator fails ~30% of charges. Retries fire with increasing delays. After 5 consecutive failures, the circuit opens — and Permanent() short-circuits the retry loop. The circuit enters half-open, probes one request, and either recovers or re-opens.
/metrics shows it all in real time. No mystery. No silent failures.
Reliability isn't about preventing failures — it's about making them visible, bounded, and recoverable.
The full source is on GitHub. The best time to study these patterns is before your first outage.
For the language-agnostic version of these patterns, see 12 Commits to a Reliable Backend.