12 Commits to a Reliable Backend: The Patterns Most Teams Skip
In a previous post, I wrote about what actually breaks when a SaaS gets its first real users. The short version: it's not traffic. It's retries without backoff, background jobs that fail silently, and observability gaps that hide problems until they compound.
That post described the symptoms. This one is about the fix.
Over the past few days, I built go-reliability-lab — a controlled backend environment for reproducing and studying those exact failure modes.
An order-processing pipeline with a payment service that fails 30% of the time (configurable), async workers, a real database, and the reliability patterns that make the difference between a system that degrades gracefully and one that falls apart under pressure.
12 commits. 2 days. 49 tests across 7 packages — all with a spec-first, AI-assisted workflow. The patterns are validated with real code and real tests, not just theory.
Not a product — a laboratory. And the patterns it surfaces apply to any language or stack.
This post is the founder-first lens: business impact first, language-agnostic patterns, and pseudocode-level concepts you can apply in any stack. For the same project through a Go implementation walkthrough, see Building a Go Reliability Lab.
What Happens Without These Patterns
Picture any backend that calls an external service — payment processing, email delivery, third-party APIs.
No retry logic? A single transient failure becomes permanent. The order stays in limbo. Revenue is lost.
Naive retries? Your system hammers a failing service with repeated requests. Tail latency spikes. Downstream gets pushed further toward failure.
Circuit breaker without coordination? The retry loop keeps firing against an open breaker — guaranteed rejections, burning your entire backoff budget while the user waits.
Observability with raw paths as labels? Every unique order ID creates a new time series. Prometheus memory grows unbounded. Dashboards break when you need them most.
Each pattern solves a real problem. Implemented in isolation, they create new ones.
The Three Patterns That Matter
The reliability lab implements retry with exponential backoff, circuit breakers, and structured observability. The individual patterns are well-documented in the repository. What's less documented is how they interact — and where the subtle mistakes live.
1. The Retry ↔ Circuit Breaker Contract
Retry and circuit breaker need explicit coordination:
The circuit breaker wraps the payment call. The retry wraps the circuit breaker.
The critical piece is PERMANENT_FAILURE — when the circuit opens mid-retry-cycle, it tells the retry loop: this is not a transient failure, don't retry.
Without it, retries keep firing against an open breaker. Each one is a guaranteed rejection. You burn the entire backoff budget waiting for something that cannot succeed.
In business terms: users wait longer, the system does more useless work, and the failure is no more recoverable than it was on the first rejection.
Every system that combines retry and circuit breaker needs this escape hatch. Most implementations I've seen in production don't have it.
2. The Half-Open Trap
Circuit breakers have three states. Most teams only handle two of them.
In half-open, the breaker allows exactly one probe request. All others get rejected with ErrTooManyRequests — a different error type than ErrOpenState.
If your code only checks for the "open" error, half-open rejections surface as unexpected application errors — unhandled, unlogged, potentially returning 500s to users.
That makes the recovery window look like fresh instability at exactly the moment the dependency is trying to come back.
This bug only appears under sustained failure conditions. Exactly when you can least afford surprise errors.
3. The Metric Label Cardinality Trap
The natural instinct is to use the request path as a metric label:
# What you expect:
http_requests_total{route="/orders/{id}"} 150
# What actually happens:
http_requests_total{route="/orders/abc-123"} 1
http_requests_total{route="/orders/def-456"} 1
... (thousands of unique series)
Every unique ID creates a new time series. Prometheus memory grows unbounded. Dashboards become unusable.
Translated: monitoring gets more expensive while incident visibility gets worse.
The fix: use the route pattern (template with placeholders), not the resolved path. Most routers expose this, but it's rarely the default. I only found the correct approach by studying how other production systems wire their routers to Prometheus.
What Observability Actually Looks Like
The lab tracks seven Prometheus metrics covering the full reliability picture:
| Metric | What It Reveals |
|---|---|
| HTTP request count by route | Traffic shape and endpoint health |
| HTTP request latency by route | User-facing responsiveness and endpoint health |
| Worker jobs by status | Processing throughput and failure rate |
| Queue depth | Backpressure — is work piling up? |
| Payment failures | Downstream service health |
| Retry attempts by result | How often retries fire, and how often they give up |
| Circuit breaker state changes | When the system decides a dependency is down |
The retry counter distinguishes "retried and succeeded" from "gave up (permanent failure)." The circuit breaker counter tracks state transitions. These are the signals that tell you whether your reliability patterns are actually working.
Key design choice: observability is wired into the reliability patterns at the same callsite — not in separate middleware. You can't accidentally deploy retries without the corresponding visibility.
How I Built It: Spec-First, AI-Assisted
Built in 2 days across 12 commits. Each commit was a single focused session with a defined scope.
Architecture Before Code
Commit 1 was documentation only: README, ARCHITECTURE.md, DEVELOPMENT_ROADMAP.md, AI_WORKFLOW.md. No runtime code.
Component boundaries, data flow, and failure modes — decided before a single function was written. This is the constraint that makes AI-assisted development work.
Without a clear spec, AI generates plausible code that doesn't compose.
The work was constrained by four core documents:
| Document | Role in the Build |
|---|---|
| ARCHITECTURE.md | System design: boundaries, dependency direction, request lifecycle, worker lifecycle, observability, and graceful shutdown |
| DEVELOPMENT_ROADMAP.md | Execution plan: a 12-commit sequence where each step introduces one major concept while keeping the repo runnable |
| AI_WORKFLOW.md | Operating contract for AI-assisted sessions: explicit scope, read-first discipline, and build/test verification before a change is considered done |
| ENGINEERING_CHECKLIST.md | Capability target: the reliability, observability, shutdown, and failure-simulation behaviors the system is supposed to demonstrate |
That is a better operating model than asking an AI to "build something production-grade" and hoping taste fills the gaps.
One Session, One Commit
Each commit followed the same loop:
- Spec — define scope, interfaces, tests needed
- Execute — AI implements within defined boundaries
- Verify — tests pass, code compiles, formatting clean
- Close — commit and move on
I owned architecture and roadmap. AI executed the items. Never the other way around.
DEVELOPMENT_ROADMAP.md defined the 12-commit sequence and enforced a simple rule: each commit should introduce one major architectural concept while leaving the repo runnable. AI_WORKFLOW.md added the operating discipline around that plan: explicit scope, read-first behavior, and build/test verification before a session could be called complete.
The moment you let an AI agent make architectural decisions mid-session, you get local optimizations that break system-wide invariants.
Where AI Needed Help
6 of 12 commits involved third-party libraries with genuinely ambiguous documentation. Retry semantics (retries vs. attempts — off-by-one), circuit breaker half-open behavior, metrics label resolution, profiler mounting.
In those cases, the important step was not speed. It was verification. I checked behavior against real-world implementations and tests instead of trusting the first plausible answer.
That is the broader lesson:
AI can accelerate delivery, but reliability still comes from clear specs, external validation, and disciplined verification.
The Result
49 tests across 7 packages. Full coverage of the reliability layer. Every failure mode reproducible on demand.
Two days. One engineer. Spec-driven, AI-assisted, verified at every step.
The tooling accelerates execution. But architecture, scope control, and verification discipline are what make the output production-grade. Without those, speed just means you produce bad code faster.
The Takeaway
Reliability isn't about preventing failures. It's about making them visible, bounded, and recoverable.
A retry without a circuit breaker is a slow failure. A circuit breaker without observability is a silent one. Neither is acceptable in production.
If your backend handles payments, processes async jobs, or calls any external service, these patterns aren't optional. They're structural.
And with a spec-first approach and the right tools, you can validate all of them in a weekend — not a quarter.
The best time to understand these patterns is before your first outage — not during it.
If you want the implementation detail layer with concrete Go code and commit-by-commit decisions, see Building a Go Reliability Lab.