Skip to content

Distributed Systems Patterns Advanced

Introduction

In distributed systems, failures are not exceptional — they're expected. Networks partition, services go down, databases become slow. The patterns here are the defensive toolkit that keeps your system running when individual components fail. In ad-tech, where latency budgets are measured in milliseconds, these patterns are the difference between serving an ad and serving nothing.

Why This Matters

Distributed systems interviews are fundamentally about failure handling. Can you design a system that degrades gracefully? Do you know when to retry vs. fail fast? Can you prevent one slow dependency from taking down everything? These patterns are daily production concerns.


Circuit Breaker

The circuit breaker prevents repeated calls to a failing service, giving it time to recover.

State Machine

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure threshold exceeded
    Open --> HalfOpen: timeout expires
    HalfOpen --> Closed: probe succeeds
    HalfOpen --> Open: probe fails
State Behavior
Closed Requests pass through. Failures counted.
Open Requests immediately rejected. Timer running.
Half-Open One probe request allowed. Success → Closed. Failure → Open.

Implementation

type State int

const (
    StateClosed State = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    mu               sync.Mutex
    state            State
    failures         int
    successes        int
    maxFailures      int
    resetTimeout     time.Duration
    halfOpenMax      int
    lastFailureTime  time.Time
}

func NewCircuitBreaker(maxFailures int, resetTimeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        maxFailures:  maxFailures,
        resetTimeout: resetTimeout,
        halfOpenMax:  1,
    }
}

func (cb *CircuitBreaker) Execute(fn func() error) error {
    cb.mu.Lock()

    switch cb.state {
    case StateOpen:
        if time.Since(cb.lastFailureTime) > cb.resetTimeout {
            cb.state = StateHalfOpen
            cb.successes = 0
        } else {
            cb.mu.Unlock()
            return ErrCircuitOpen
        }
    }

    cb.mu.Unlock()

    err := fn()

    cb.mu.Lock()
    defer cb.mu.Unlock()

    if err != nil {
        cb.failures++
        cb.lastFailureTime = time.Now()
        if cb.state == StateHalfOpen || cb.failures >= cb.maxFailures {
            cb.state = StateOpen
        }
        return err
    }

    if cb.state == StateHalfOpen {
        cb.successes++
        if cb.successes >= cb.halfOpenMax {
            cb.state = StateClosed
            cb.failures = 0
        }
    } else {
        cb.failures = 0
    }
    return nil
}

var ErrCircuitOpen = errors.New("circuit breaker is open")

Using sony/gobreaker

import "github.com/sony/gobreaker"

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "upstream-api",
    MaxRequests: 3,                    // half-open probe count
    Interval:    10 * time.Second,     // closed-state reset interval
    Timeout:     30 * time.Second,     // open → half-open timeout
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.ConsecutiveFailures > 5
    },
    OnStateChange: func(name string, from, to gobreaker.State) {
        slog.Warn("circuit breaker state change",
            "name", name, "from", from, "to", to)
    },
})

result, err := cb.Execute(func() (any, error) {
    return httpClient.Get("https://api.example.com/data")
})

Retry with Exponential Backoff and Jitter

Why Jitter Matters

Without jitter, all clients retry at the same time (thundering herd). Jitter spreads retries randomly.

graph LR
    R1["Retry 1: 100ms ± jitter"] --> R2["Retry 2: 200ms ± jitter"]
    R2 --> R3["Retry 3: 400ms ± jitter"]
    R3 --> R4["Retry 4: 800ms ± jitter"]
    R4 --> FAIL["Give up"]

Implementation

type RetryConfig struct {
    MaxAttempts int
    BaseDelay   time.Duration
    MaxDelay    time.Duration
    Retryable   func(error) bool
}

func Retry(ctx context.Context, cfg RetryConfig, fn func(ctx context.Context) error) error {
    var lastErr error

    for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
        lastErr = fn(ctx)
        if lastErr == nil {
            return nil
        }

        if cfg.Retryable != nil && !cfg.Retryable(lastErr) {
            return lastErr // non-retryable error
        }

        if attempt == cfg.MaxAttempts-1 {
            break
        }

        // Exponential backoff with full jitter
        backoff := cfg.BaseDelay * time.Duration(1<<uint(attempt))
        if backoff > cfg.MaxDelay {
            backoff = cfg.MaxDelay
        }
        jitter := time.Duration(rand.Int63n(int64(backoff)))

        select {
        case <-time.After(jitter):
        case <-ctx.Done():
            return ctx.Err()
        }
    }

    return fmt.Errorf("after %d attempts: %w", cfg.MaxAttempts, lastErr)
}

// Usage
err := Retry(ctx, RetryConfig{
    MaxAttempts: 5,
    BaseDelay:   100 * time.Millisecond,
    MaxDelay:    5 * time.Second,
    Retryable: func(err error) bool {
        // Only retry transient errors
        var netErr net.Error
        return errors.As(err, &netErr) && netErr.Timeout()
    },
}, func(ctx context.Context) error {
    return callUpstreamAPI(ctx)
})

Interview Tip

"I always use exponential backoff with jitter for retries. Without jitter, correlated retries create a thundering herd. I also make retryability explicit — 4xx errors shouldn't be retried, but 503s and timeouts should. And I always respect context cancellation between retries."


Rate Limiting

Token Bucket (golang.org/x/time/rate)

import "golang.org/x/time/rate"

// 100 requests/second, burst of 10
limiter := rate.NewLimiter(100, 10)

func handleRequest(w http.ResponseWriter, r *http.Request) {
    if !limiter.Allow() {
        http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
        return
    }
    // process request
}

// Or block until allowed (respects context)
func processItem(ctx context.Context, item Item) error {
    if err := limiter.Wait(ctx); err != nil {
        return err // context cancelled while waiting
    }
    return doWork(item)
}

// Reserve for advanced control
r := limiter.Reserve()
if !r.OK() {
    return errors.New("rate limit would never be satisfied")
}
time.Sleep(r.Delay()) // wait exactly the right amount

Per-Client Rate Limiting

type ClientLimiter struct {
    mu       sync.Mutex
    clients  map[string]*rate.Limiter
    rate     rate.Limit
    burst    int
}

func NewClientLimiter(r rate.Limit, burst int) *ClientLimiter {
    return &ClientLimiter{
        clients: make(map[string]*rate.Limiter),
        rate:    r,
        burst:   burst,
    }
}

func (cl *ClientLimiter) GetLimiter(clientID string) *rate.Limiter {
    cl.mu.Lock()
    defer cl.mu.Unlock()

    if l, ok := cl.clients[clientID]; ok {
        return l
    }
    l := rate.NewLimiter(cl.rate, cl.burst)
    cl.clients[clientID] = l
    return l
}

// HTTP middleware
func RateLimitMiddleware(cl *ClientLimiter) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            clientID := r.Header.Get("X-API-Key")
            if clientID == "" {
                clientID = r.RemoteAddr
            }

            if !cl.GetLimiter(clientID).Allow() {
                w.Header().Set("Retry-After", "1")
                http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
                return
            }

            next.ServeHTTP(w, r)
        })
    }
}

Sliding Window (for Precise Rate Limiting)

type SlidingWindowLimiter struct {
    mu       sync.Mutex
    window   time.Duration
    limit    int
    requests []time.Time
}

func NewSlidingWindowLimiter(window time.Duration, limit int) *SlidingWindowLimiter {
    return &SlidingWindowLimiter{
        window: window,
        limit:  limit,
    }
}

func (l *SlidingWindowLimiter) Allow() bool {
    l.mu.Lock()
    defer l.mu.Unlock()

    now := time.Now()
    cutoff := now.Add(-l.window)

    // Remove expired entries
    valid := l.requests[:0]
    for _, t := range l.requests {
        if t.After(cutoff) {
            valid = append(valid, t)
        }
    }
    l.requests = valid

    if len(l.requests) >= l.limit {
        return false
    }

    l.requests = append(l.requests, now)
    return true
}

Bulkhead Pattern

Isolate failures so one slow dependency doesn't consume all resources.

type Bulkhead struct {
    sem chan struct{}
}

func NewBulkhead(maxConcurrent int) *Bulkhead {
    return &Bulkhead{
        sem: make(chan struct{}, maxConcurrent),
    }
}

func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
    select {
    case b.sem <- struct{}{}:
        defer func() { <-b.sem }()
        return fn()
    case <-ctx.Done():
        return fmt.Errorf("bulkhead: %w", ctx.Err())
    }
}

// Isolate different dependencies
var (
    dbBulkhead    = NewBulkhead(20) // max 20 concurrent DB calls
    apiBulkhead   = NewBulkhead(10) // max 10 concurrent API calls
    cacheBulkhead = NewBulkhead(50) // max 50 concurrent cache calls
)

func handleRequest(ctx context.Context) error {
    // DB calls are isolated from API calls
    var user *User
    err := dbBulkhead.Execute(ctx, func() error {
        var err error
        user, err = db.GetUser(ctx, "123")
        return err
    })
    if err != nil {
        return err
    }

    return apiBulkhead.Execute(ctx, func() error {
        return api.EnrichProfile(ctx, user)
    })
}
graph TD
    REQ[Incoming Request] --> ROUTE{Route}
    ROUTE -->|DB calls| DBB[DB Bulkhead<br>max=20]
    ROUTE -->|API calls| APIB[API Bulkhead<br>max=10]
    ROUTE -->|Cache calls| CB[Cache Bulkhead<br>max=50]
    DBB --> DB[(Database)]
    APIB --> API[External API]
    CB --> CACHE[Redis Cache]
    style DBB fill:#f9f,stroke:#333
    style APIB fill:#9ff,stroke:#333
    style CB fill:#ff9,stroke:#333

Timeout Propagation with Context

func handleBidRequest(w http.ResponseWriter, r *http.Request) {
    // Ad-tech: total budget is 100ms
    ctx, cancel := context.WithTimeout(r.Context(), 100*time.Millisecond)
    defer cancel()

    // Each step gets the remaining time
    user, err := userService.GetUser(ctx, userID)       // uses ~20ms
    if err != nil {
        // context may be cancelled — check
        if ctx.Err() != nil {
            http.Error(w, "timeout", http.StatusGatewayTimeout)
            return
        }
    }

    ads, err := adService.GetAds(ctx, user)             // ~30ms remaining
    bid, err := bidService.ComputeBid(ctx, ads)          // ~50ms remaining

    json.NewEncoder(w).Encode(bid)
}

// Downstream service respects the deadline
func (s *AdService) GetAds(ctx context.Context, user *User) ([]Ad, error) {
    // Check if there's enough time to even try
    if deadline, ok := ctx.Deadline(); ok {
        if time.Until(deadline) < 10*time.Millisecond {
            return nil, fmt.Errorf("insufficient time budget")
        }
    }

    // Context propagated to DB query — auto-cancels if deadline exceeded
    return s.repo.FindAdsForUser(ctx, user.ID)
}

Health Checks

Liveness, Readiness, and Startup Probes

type HealthChecker struct {
    mu     sync.RWMutex
    checks map[string]func(ctx context.Context) error
}

func NewHealthChecker() *HealthChecker {
    return &HealthChecker{checks: make(map[string]func(ctx context.Context) error)}
}

func (hc *HealthChecker) Register(name string, check func(ctx context.Context) error) {
    hc.mu.Lock()
    defer hc.mu.Unlock()
    hc.checks[name] = check
}

type HealthStatus struct {
    Status string            `json:"status"`
    Checks map[string]string `json:"checks"`
}

func (hc *HealthChecker) Check(ctx context.Context) HealthStatus {
    hc.mu.RLock()
    defer hc.mu.RUnlock()

    status := HealthStatus{
        Status: "healthy",
        Checks: make(map[string]string),
    }

    for name, check := range hc.checks {
        if err := check(ctx); err != nil {
            status.Status = "unhealthy"
            status.Checks[name] = err.Error()
        } else {
            status.Checks[name] = "ok"
        }
    }
    return status
}

// Setup
health := NewHealthChecker()

health.Register("database", func(ctx context.Context) error {
    return db.PingContext(ctx)
})

health.Register("redis", func(ctx context.Context) error {
    return redisClient.Ping(ctx).Err()
})

// Liveness: is the process alive?
mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("ok"))
})

// Readiness: can the process serve traffic?
mux.HandleFunc("GET /readyz", func(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
    defer cancel()

    status := health.Check(ctx)
    if status.Status != "healthy" {
        w.WriteHeader(http.StatusServiceUnavailable)
    }
    json.NewEncoder(w).Encode(status)
})

Distributed Tracing (OpenTelemetry)

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("myservice")

func handleRequest(ctx context.Context, userID string) error {
    ctx, span := tracer.Start(ctx, "handleRequest",
        trace.WithAttributes(attribute.String("user.id", userID)),
    )
    defer span.End()

    user, err := getUser(ctx, userID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    return processUser(ctx, user)
}

func getUser(ctx context.Context, id string) (*User, error) {
    ctx, span := tracer.Start(ctx, "getUser")
    defer span.End()

    // Trace propagated via context — shows as child span
    span.AddEvent("querying database")
    user, err := db.FindByID(ctx, id)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.String("user.name", user.Name))
    return user, nil
}
graph LR
    A["handleRequest (50ms)"] --> B["getUser (20ms)"]
    B --> C["DB query (15ms)"]
    A --> D["processUser (25ms)"]
    D --> E["External API (20ms)"]

Idempotency

Ensure operations produce the same result regardless of how many times they're called.

type IdempotencyStore interface {
    Check(ctx context.Context, key string) (result []byte, exists bool, err error)
    Store(ctx context.Context, key string, result []byte, ttl time.Duration) error
}

func IdempotencyMiddleware(store IdempotencyStore) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            key := r.Header.Get("Idempotency-Key")
            if key == "" {
                next.ServeHTTP(w, r)
                return
            }

            // Check for cached result
            cached, exists, err := store.Check(r.Context(), key)
            if err == nil && exists {
                w.Header().Set("X-Idempotent-Replay", "true")
                w.Write(cached)
                return
            }

            // Capture the response
            rec := httptest.NewRecorder()
            next.ServeHTTP(rec, r)

            // Store for future replays
            body := rec.Body.Bytes()
            store.Store(r.Context(), key, body, 24*time.Hour)

            // Copy recorded response to real writer
            for k, v := range rec.Header() {
                w.Header()[k] = v
            }
            w.WriteHeader(rec.Code)
            w.Write(body)
        })
    }
}

Graceful Degradation

type CacheWithFallback struct {
    primary   Cache
    secondary Cache // fallback when primary is down
    cb        *gobreaker.CircuitBreaker
}

func (c *CacheWithFallback) Get(ctx context.Context, key string) (string, error) {
    // Try primary with circuit breaker
    result, err := c.cb.Execute(func() (any, error) {
        return c.primary.Get(ctx, key)
    })
    if err == nil {
        return result.(string), nil
    }

    // Degrade to secondary
    slog.Warn("primary cache unavailable, using fallback", "error", err)
    return c.secondary.Get(ctx, key)
}

// Multi-level degradation
func (s *AdService) GetBid(ctx context.Context, req BidRequest) (*Bid, error) {
    // Level 1: full personalized bid
    bid, err := s.personalizedBid(ctx, req)
    if err == nil {
        return bid, nil
    }

    // Level 2: cached bid for this segment
    bid, err = s.segmentBid(ctx, req)
    if err == nil {
        slog.Warn("degraded to segment bid")
        return bid, nil
    }

    // Level 3: default bid
    slog.Warn("degraded to default bid")
    return s.defaultBid(req), nil
}

Quick Reference

Pattern Problem It Solves Key Mechanism
Circuit Breaker Repeated calls to failing service State machine: closed/open/half-open
Retry + Backoff Transient failures Exponential delay + jitter
Rate Limiting Overload prevention Token bucket, sliding window
Bulkhead Failure isolation Bounded concurrency per dependency
Timeout Propagation Cascading delays context.WithTimeout through call chain
Health Checks Infrastructure awareness Liveness + readiness endpoints
Distributed Tracing Cross-service debugging OpenTelemetry spans via context
Idempotency Duplicate request safety Idempotency key + cached response
Graceful Degradation Partial availability Fallback chain with reduced functionality

Best Practices

  1. Set timeouts everywhere — every outbound call needs a deadline via context
  2. Circuit breakers on all external dependencies — prevent cascading failures
  3. Retry only idempotent, transient failures — don't retry 400s or non-idempotent POSTs without idempotency keys
  4. Always use jitter — prevents thundering herd on correlated retries
  5. Bulkhead critical dependencies — isolate database, cache, and external API call pools
  6. Health checks should be fast — liveness should be instant, readiness should check real dependencies
  7. Propagate trace context — pass context.Context through every function call and RPC
  8. Design for idempotency — use idempotency keys for non-idempotent operations in unreliable networks
  9. Monitor circuit breaker state — alert on state transitions (closed → open)
  10. Test failure modes — use fault injection to verify graceful degradation works

Common Pitfalls

Retrying Non-Idempotent Operations

Retrying a "create payment" without an idempotency key can charge the user twice. Only retry operations that are safe to repeat, or use idempotency keys.

Retry Without Backoff

Immediate retries to a struggling service make things worse. Always use exponential backoff with jitter. Without jitter, all clients hammer the recovering service simultaneously.

Circuit Breaker Too Sensitive

A circuit breaker that opens after 1 failure causes false positives. Tune thresholds based on normal error rates. Use consecutive failures or error percentage, not total failures.

Health Check That's Too Deep

A readiness check that calls multiple downstream services creates a cascading dependency. If service C is down, services A and B both report unready. Keep readiness checks checking direct dependencies only.

Ignoring Context Cancellation

When a context is cancelled (client disconnect, timeout), continuing work wastes resources. Check ctx.Err() at natural checkpoints and in loops.


Performance Considerations

  • golang.org/x/time/rate uses a lock-free atomic algorithm under the hood — very fast for Allow() checks
  • Circuit breaker state checks are O(1) — negligible overhead
  • Retry delays should be bounded — set MaxDelay to prevent exponential explosion
  • Bulkhead channel operations are ~50ns — minimal overhead for the isolation benefit
  • OpenTelemetry spans have sampling — use probability sampling (e.g., 1%) in high-throughput production
  • Idempotency stores should use Redis with TTL — checking is one round-trip, negligible compared to the operation

Interview Tips

Interview Tip

"My standard resilience stack for any external dependency is: timeout (via context), circuit breaker (to fail fast when down), retry with backoff (for transient failures), and bulkhead (to isolate). These compose naturally: the retry wraps the circuit breaker, which wraps the actual call, all within a timeout context."

Interview Tip

"In ad-tech, the total bid response budget is typically 100ms. I propagate this deadline through context to every downstream call. Each service checks the remaining time before starting expensive work. If we can't compute a personalized bid in time, we degrade to a cached segment-level bid, then to a default bid — always serving something within the deadline."

Interview Tip

"Rate limiting strategy depends on what you're protecting. Token bucket (like x/time/rate) is great for smoothing bursts. Sliding window is better for strict per-window limits. For distributed rate limiting across multiple pods, I use Redis with Lua scripts for atomic check-and-increment."


Key Takeaways

  1. Circuit breaker prevents cascading failures by failing fast when a dependency is down
  2. Retry with exponential backoff + jitter handles transient failures without thundering herds
  3. Rate limiting protects services from overload — token bucket for burst tolerance, sliding window for strict limits
  4. Bulkhead isolates failures so one slow dependency doesn't consume all goroutines/connections
  5. Timeout propagation via context is Go's killer feature for distributed systems — use it everywhere
  6. Health checks should separate liveness (process alive?) from readiness (can serve traffic?)
  7. Distributed tracing with OpenTelemetry provides cross-service visibility via context propagation
  8. Idempotency is essential for safe retries in unreliable networks
  9. Graceful degradation ensures the system always returns something — even if reduced quality
  10. These patterns compose — timeout → circuit breaker → retry → bulkhead → actual call