Article

Building a Go Reliability Lab: Retry, Circuit Breakers, and Failure by Design

March 13, 2026

Building a Go Reliability Lab

Most backend engineers know they need retry logic and circuit breakers.

Few have a place to test what happens when those patterns interact under real failure conditions.

So I built one.

go-reliability-lab is a controlled environment for reproducing and studying production reliability problems. Not a product — a laboratory.

It simulates an order-processing backend: HTTP API → Order Service → Postgres → Async Worker Queue → Payment Service (simulated) → Order Status Update.

12 commits. 2 days. 43 tests across 7 packages.

This article is the build log — what I built, what surprised me, and the patterns worth knowing.


The Stack

  • chi — HTTP router
  • zap — structured logging
  • cenkalti/backoff — retry with exponential backoff
  • sony/gobreaker — circuit breaker
  • prometheus/client_golang — metrics
  • net/http/pprof — profiling
  • pgx — Postgres driver

Standard Go. No frameworks. No magic.


Phase 1: Foundation

The first three commits establish the skeleton.

Commit 1 is documentation only — README, ARCHITECTURE.md, DEVELOPMENT_ROADMAP.md. No runtime code. The architecture was decided before any code was written.

Commit 2 adds the HTTP server with chi: /healthz, /version, graceful shutdown via signal.NotifyContext. The shutdown sequence is where the first interesting decision appears:

// Signal context cancelled — HTTP server stops accepting new requests
serveErr := serve(ctx, server, logger)

// Workers drain jobs already in the buffer
dispatcher.Stop()   // close the job channel
dispatcher.Wait()   // block until all 5 workers exit
workerCancel()      // cancel the worker background context

The worker context is a separate context.Background() derivative — not the signal context. This means workers finish their current jobs after the HTTP server shuts down. You don't lose in-flight work.

Commit 3 adds structured logging with zap: a RequestLogger middleware that captures request ID, method, path, status code, and latency for every request.


Phase 2: Domain and Persistence

Commit 4 defines the order domain: Order struct, Status enum (pending / processing / completed / failed), Service with CreateOrder / GetOrder / UpdateOrder, sentinel errors, UUID generation via crypto/rand.

Commit 5 adds the Postgres repository with pgx v5, Docker Compose for a local postgres:16-alpine, and schema auto-creation on startup. Falls back to an in-memory repository if DATABASE_URL is unset. No code change to swap between them — the repository interface handles it.

The architectural decision that matters here is where the dependency interface lives.

The orders package defines its own ProcessingQueue interface. It doesn't import workers. At the composition root in main.go, an adapter bridges them:

type dispatcherQueue struct{ d *workers.Dispatcher }

func (dq *dispatcherQueue) Submit(ctx context.Context, orderID string) error {
    return dq.d.Submit(ctx, workers.Job{OrderID: orderID})
}

ordersSvc.SetQueue(&dispatcherQueue{dispatcher})

The domain layer describes what it needs. The infrastructure layer provides it. No import cycle, no leaky abstraction. This is the pattern that lets you swap the worker queue implementation without touching the domain.


Phase 3: Async Processing

Commit 6 introduces the worker pool: 5 goroutines consuming from a 100-job buffered channel. Each worker loops on two cases:

for {
    select {
    case job, ok := <-d.jobs:
        if !ok { return } // channel closed → graceful exit
        d.handler(ctx, job)
    case <-ctx.Done():    // hard cancel
        return
    }
}

The channel close (dispatcher.Stop()) is the graceful signal — workers finish any job in-flight, then exit. ctx.Done() is the escape hatch if you need to force them out.

Commit 7 wires order creation into the pipeline. Creating an order persists to Postgres and enqueues a background job. Workers pick it up and drive the order through processing → payment → completed/failed.

Commit 8 adds the payment simulator. 30% failure probability. 100–500ms random latency. Context-aware — it selects on both time.After and ctx.Done(), so it respects cancellation. Uses math/rand/v2 (Go 1.22+).

This simulator is the core of the lab. It gives you a service that fails unpredictably, which is the only way to study how your reliability patterns behave under real conditions.


Phase 4: Reliability Patterns

Two commits. Two patterns. Most of the learning.

Retry with Exponential Backoff

Commit 9. Using cenkalti/backoff v4.

Config: 3 attempts, 200ms initial interval, 2s maximum interval. Straightforward — until you read the API carefully.

// WithMaxRetries takes the number of retries (not attempts), so subtract 1.
var policy backoff.BackOff = backoff.WithMaxRetries(b, attempts-1)

WithMaxRetries(b, 3) gives you 4 attempts — 1 initial + 3 retries. The parameter name says "retries" but engineers think in attempts. Off-by-one here is silent: your system just retries one extra time, burning backoff budget and increasing tail latency without anyone noticing.

The RetryNotify callback fires on each retry with the error and the wait duration. That's where you log and increment metrics. You see the backoff intervals as they happen.

Circuit Breaker

Commit 10. Using sony/gobreaker.

Config: 5 consecutive failures trips the breaker for 30 seconds. The docs explain the open/closed states. What they don't foreground:

func IsOpen(err error) bool {
    return errors.Is(err, gobreaker.ErrOpenState) ||
        errors.Is(err, gobreaker.ErrTooManyRequests)
}

There are two rejection error types.

ErrOpenState — the breaker is open, all calls blocked. Expected.

ErrTooManyRequests — the breaker is in half-open state, probing one request to see if the downstream recovered. If you only check ErrOpenState, these half-open rejections surface as unexpected application errors.

The Composition That Matters

These two patterns need explicit coordination. Here's the actual job handler from main.go:

chargeErr := reliability.Do(ctx, retryCfg,
    func(err error, wait time.Duration) {
        logger.Warn("retrying payment", zap.Error(err), zap.Duration("wait", wait))
        observability.RetryAttemptsTotal.WithLabelValues("retried").Inc()
    },
    func() error {
        cbErr := paymentCB.Execute(func() error {
            return paymentSim.Charge(ctx, job.OrderID)
        })
        if cbErr != nil && reliability.IsOpen(cbErr) {
            observability.RetryAttemptsTotal.WithLabelValues("permanent_failure").Inc()
            return reliability.Permanent(cbErr)
        }
        return cbErr
    })

The circuit breaker wraps the payment call. The retry wraps the circuit breaker. When the circuit opens mid-retry-cycle, Permanent() stops the loop immediately.

Without this, the retry fires three more times against an open breaker. Each attempt is a guaranteed rejection. You burn the entire backoff budget — up to 2 seconds of delays — waiting for something that cannot succeed.

Permanent() is the contract between the two patterns. It means: "this is not a transient failure, don't retry." Every combination of retry and circuit breaker needs this escape hatch. Most implementations I've seen in production don't have it.

Also worth noting: the RetryNotify callback is also where metrics are incremented. The observability and the reliability pattern are wired together at the same callsite — not in separate middleware. That's intentional.


Phase 5: Observability

Commit 11 adds seven Prometheus metrics:

  • http_requests_total{method, route, status_code} — HTTP counter
  • http_request_duration_seconds{method, route, status_code} — latency histogram
  • worker_jobs_total{status} — submitted / processed counter by status
  • worker_queue_depth — current queue backlog gauge
  • payment_failures_total — payment charge failures
  • retry_attempts_total{result} — retried / permanent_failure
  • circuit_breaker_state_changes_total{from, to} — state transitions

The {route} label deserves attention:

pattern := chi.RouteContext(r.Context()).RoutePattern()

Without this, every /orders/abc-123 gets its own label value. Label cardinality explodes; Prometheus memory grows unbounded. RoutePattern() collapses all of them to /orders/{id}. This is not obvious in the chi docs — it required looking at how other projects use chi with Prometheus.

Commit 12 mounts pprof on the chi router. One gotcha: chi doesn't share the default net/http mux. Without an explicit /{name} catch-all route pointing to pprof.Index, named profiles like /debug/pprof/heap return 404. The standard library handles this transparently on DefaultServeMux — chi doesn't.


How I Built It

Each commit was one AI-assisted development session. I owned the architecture and roadmap upfront; AI executed the roadmap items. Never the other way around.

For third-party library integration I used GitHits to surface real usage patterns from public Go repos. Of the 12 commits, 6 involved library APIs where the documentation left genuine ambiguity: chi graceful shutdown, zap middleware shape, cenkalti/backoff semantics, gobreaker half-open state, chi's RouteContext resolution, and pprof mounting on custom routers. Standard library work — goroutines, channels, context, interfaces — needed none of it.

Every commit was verified with go test ./..., go build, and gofmt before closing the session.

The discipline of one session per commit, with a defined spec before each session started, is what kept the codebase coherent across the two days. The tooling is secondary.


What the Lab Proves

Run the system. POST a few orders. Watch the logs.

The payment simulator fails roughly 3 in 10 charges. Retries fire with increasing delays. After 5 consecutive failures, the circuit breaker opens — and the RetryNotify callback stops getting called, because Permanent() short-circuits the loop. The circuit enters half-open state, probes one request, and either recovers or re-opens.

/metrics shows queue depth, retry counts, and state change counters in real time. pprof shows goroutine count under load.

No mystery. No silent failures. Every failure mode is visible, labeled, and recoverable.


The Insight

Reliability isn't about preventing failures.

It's about making failures visible, bounded, and recoverable.

A retry without a circuit breaker is a slow failure. A circuit breaker without observability is a silent one. Neither is acceptable in production.

The lab exists to prove that — with real code, real failure modes, and real metrics.

If your backend handles payments, processes async jobs, or calls any external service, these patterns aren't optional.

They're structural.

And the best time to understand them is before your first outage — not during it.