Building a Go Reliability Lab

Most backend teams know they need retry logic and circuit breakers.

Few have a controlled environment to test what happens when those patterns interact under real failure conditions. So I built one.

go-reliability-lab is a laboratory for reproducing production reliability problems — not a product. An order-processing pipeline with a payment service that fails 30% of the time, async workers, a real database, and the patterns that separate graceful degradation from cascading failure.

This post is the middle path: founder-relevant tradeoffs plus concrete Go implementation details. It is the technical companion to 12 Commits to a Reliable Backend, which explains the same project in founder-first, language-agnostic terms.

12 commits. 2 days. 49 tests across 7 packages. The build log below covers what I built, the gotchas I hit, and why each decision matters.

This repo was specified on paper before it was assembled in code. ARCHITECTURE.md, DEVELOPMENT_ROADMAP.md, AI_WORKFLOW.md, and ENGINEERING_CHECKLIST.md were part of the build, not documentation after the fact. They defined the system boundaries, the execution order, the rules for AI-assisted sessions, and the production capabilities the lab had to demonstrate.

System Architecture

Every component is real. The payment service is the only simulation — intentionally unreliable so every failure path gets exercised.

The fuller version lives in ARCHITECTURE.md. It spells out the request lifecycle, worker lifecycle, observability architecture, and graceful shutdown model that make this a lab instead of just a demo app.

The Build: 12 Commits in 5 Phases

The five-phase shape of this post comes directly from DEVELOPMENT_ROADMAP.md. It laid out the build in advance and enforced a simple discipline: each commit introduces one major architectural concept while leaving the repository in a working state.

Phase 1: Foundation (Commits 1–3)

Commit 1 was documentation only — README, ARCHITECTURE.md, DEVELOPMENT_ROADMAP.md, AI_WORKFLOW.md. Architecture decided before any code. Commits 2–3 added the HTTP server with chi, graceful shutdown, and structured logging with zap.

The key design choice: workers run on a separate context from the HTTP server. When the server stops accepting requests, workers finish their in-flight jobs instead of dropping them.

Phase 2: Domain & Persistence (Commits 4–5)

Order model, status lifecycle (pending → processing → completed/failed), Postgres repository with pgx v5. Falls back to in-memory storage if no database is configured — same interface, zero code changes to swap.

The architectural pattern that matters here:

The domain defines what it needs. Infrastructure provides it. No import cycles, no leaky abstractions.

Phase 3: Async Processing (Commits 6–8)

Commit 6 introduced a bounded worker pool and job queue.

Commit 7 connected order creation to that background pipeline.

Commit 8 added the payment simulator: the intentionally unreliable dependency that gave the pipeline something real to process.

The behavior that matters here:

This is the moment the system stopped doing everything in the request path.

The API stayed fast.

The heavy work moved into the background.

That split made the later reliability work possible. Retries, circuit breakers, and queue metrics only mattered once there was real asynchronous work to protect and observe.

Phase 4: Reliability Patterns (Commits 9–10)

This is where the lab stopped being a failure simulator and became a reliability system.

Commit 9 added retry with exponential backoff.

Commit 10 added a circuit breaker around the payment dependency.

The behavior that matters here:

Retries help when the dependency is flaky.

Circuit breakers help when the dependency is unhealthy.

Together they changed the system from "keep trying until it hurts" to "recover when possible, stop quickly when not."

Phase 5: Observability (Commits 11–12)

This is where the lab stopped being merely resilient and became inspectable.

Commit 11 added /metrics.

Commit 12 added /debug/pprof.

The operational loop looked like this:

Metrics tell you that something is going wrong.

Profiling helps you find where and why.

That is what made the earlier reliability work operational. Retries, circuit breakers, and worker backlogs only become trustworthy once you can see them in real time and debug them under load.

The Gotchas Worth Knowing

These are the implementation details that documentation doesn't foreground — the kind of thing that causes subtle bugs in production.

Retry: The Off-by-One Nobody Notices

Using cenkalti/backoff v4, the retry configuration looks straightforward:

// WithMaxRetries takes retries (not attempts) — subtract 1.
var policy backoff.BackOff = backoff.WithMaxRetries(b, attempts-1)

WithMaxRetries(b, 3) gives you 4 attempts (1 initial + 3 retries). The parameter says "retries" but engineers think in "attempts." Off-by-one here is silent — your system just retries one extra time, burning backoff budget and increasing tail latency with no visible error.

Circuit Breaker: Two Error Types, Not One

Using sony/gobreaker, most teams only check for one rejection:

func IsOpen(err error) bool {
    return errors.Is(err, gobreaker.ErrOpenState) ||
        errors.Is(err, gobreaker.ErrTooManyRequests)
}

ErrOpenState — breaker is open, all calls blocked. Expected.

ErrTooManyRequests — breaker is in half-open, probing one request. All other concurrent requests get this error. If you only check ErrOpenState, half-open rejections surface as unhandled application errors — 500s to users during the exact window when the system is trying to recover.

The Composition: Where It All Connects

The retry wraps the circuit breaker. The circuit breaker wraps the payment call. When the circuit opens mid-retry, Permanent() stops the retry loop immediately.

chargeErr := reliability.Do(ctx, retryCfg,
    func(err error, wait time.Duration) {
        logger.Warn("retrying payment",
            zap.Error(err), zap.Duration("wait", wait))
        observability.RetryAttemptsTotal.
            WithLabelValues("retried").Inc()
    },
    func() error {
        cbErr := paymentCB.Execute(func() error {
            return paymentSim.Charge(ctx, job.OrderID)
        })
        if cbErr != nil && reliability.IsOpen(cbErr) {
            observability.RetryAttemptsTotal.
                WithLabelValues("permanent_failure").Inc()
            return reliability.Permanent(cbErr)
        }
        return cbErr
    })

Without Permanent(), the retry fires three more times against an open breaker — guaranteed rejections, up to 2 seconds of wasted backoff. This is the contract between the two patterns: "this failure is not transient, stop immediately."

Also note: the RetryNotify callback is where metrics are incremented. Observability and reliability are wired together at the same callsite — not in separate middleware. You can't deploy retries without the corresponding visibility.

Observability: What Gets Measured

The lab tracks seven Prometheus metrics:

Metric	What It Reveals
HTTP request count by route	Traffic shape and endpoint health
HTTP request latency by route	User-facing responsiveness and endpoint health
Worker jobs by status	Processing throughput and failure rate
Queue depth	Backpressure — is work piling up?
Payment failures	Downstream service health
Retry attempts by result	How often retries fire, and how often they give up
Circuit breaker state changes	When the system decides a dependency is down

One gotcha with chi and Prometheus: you must use:

chi.RouteContext(r.Context()).RoutePattern()

for the route label. Without it, every /orders/abc-123 gets its own label value — unbounded cardinality, unbounded memory growth, broken dashboards.

How I Built It: Spec-First, AI-Assisted

Each commit was one AI-assisted session. I owned architecture and roadmap upfront; AI executed the items. DEVELOPMENT_ROADMAP.md supplied the execution plan, and AI_WORKFLOW.md treated that plan as a binding contract: one commit at a time, explicit scope, existing code read first, and format/build/test gates before any session could be considered complete.

ARCHITECTURE.md defined the modular-monolith boundaries, dependency direction, request lifecycle, worker lifecycle, observability-first design, and graceful shutdown model before implementation.

ENGINEERING_CHECKLIST.md turned "production-style" from a vague aspiration into a concrete capability list: worker pools, graceful shutdown, retry with backoff, circuit breakers, metrics, structured logs, and failure simulation.

6 of 12 commits involved third-party libraries with ambiguous docs. For those, I used GitHits to surface real usage patterns from public Go repositories — actual code from projects that had already solved the same problems, not just documentation.

Standard library work (goroutines, channels, context, interfaces) needed no external context. AI handles well-documented patterns fluently. It's the library edge cases where external context makes the difference.

The result: 49 tests across 7 packages. Full coverage of the reliability layer. Every failure mode reproducible on demand.

The discipline of one session per commit, with a spec before each session, is what kept the codebase coherent. The tooling mattered, but the written constraints mattered more.

The Takeaway

Run the system. POST a few orders. Watch the logs.

The payment simulator fails ~30% of charges. Retries fire with increasing delays. After 5 consecutive failures, the circuit opens — and Permanent() short-circuits the retry loop. The circuit enters half-open, probes one request, and either recovers or re-opens.

/metrics shows it all in real time. No mystery. No silent failures.

Reliability isn't about preventing failures — it's about making them visible, bounded, and recoverable.

The full source is on GitHub. The best time to study these patterns is before your first outage.

If you want the founder-first version (business framing, language-agnostic patterns, minimal Go detail), see 12 Commits to a Reliable Backend.