Distributed Systems Patterns Advanced¶
Introduction¶
In distributed systems, failures are not exceptional — they're expected. Networks partition, services go down, databases become slow. The patterns here are the defensive toolkit that keeps your system running when individual components fail. In ad-tech, where latency budgets are measured in milliseconds, these patterns are the difference between serving an ad and serving nothing.
Why This Matters
Distributed systems interviews are fundamentally about failure handling. Can you design a system that degrades gracefully? Do you know when to retry vs. fail fast? Can you prevent one slow dependency from taking down everything? These patterns are daily production concerns.
Circuit Breaker¶
The circuit breaker prevents repeated calls to a failing service, giving it time to recover.
State Machine¶
stateDiagram-v2
[*] --> Closed
Closed --> Open: failure threshold exceeded
Open --> HalfOpen: timeout expires
HalfOpen --> Closed: probe succeeds
HalfOpen --> Open: probe fails
| State | Behavior |
|---|---|
| Closed | Requests pass through. Failures counted. |
| Open | Requests immediately rejected. Timer running. |
| Half-Open | One probe request allowed. Success → Closed. Failure → Open. |
Implementation¶
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
type CircuitBreaker struct {
mu sync.Mutex
state State
failures int
successes int
maxFailures int
resetTimeout time.Duration
halfOpenMax int
lastFailureTime time.Time
}
func NewCircuitBreaker(maxFailures int, resetTimeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
maxFailures: maxFailures,
resetTimeout: resetTimeout,
halfOpenMax: 1,
}
}
func (cb *CircuitBreaker) Execute(fn func() error) error {
cb.mu.Lock()
switch cb.state {
case StateOpen:
if time.Since(cb.lastFailureTime) > cb.resetTimeout {
cb.state = StateHalfOpen
cb.successes = 0
} else {
cb.mu.Unlock()
return ErrCircuitOpen
}
}
cb.mu.Unlock()
err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.failures++
cb.lastFailureTime = time.Now()
if cb.state == StateHalfOpen || cb.failures >= cb.maxFailures {
cb.state = StateOpen
}
return err
}
if cb.state == StateHalfOpen {
cb.successes++
if cb.successes >= cb.halfOpenMax {
cb.state = StateClosed
cb.failures = 0
}
} else {
cb.failures = 0
}
return nil
}
var ErrCircuitOpen = errors.New("circuit breaker is open")
Using sony/gobreaker¶
import "github.com/sony/gobreaker"
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "upstream-api",
MaxRequests: 3, // half-open probe count
Interval: 10 * time.Second, // closed-state reset interval
Timeout: 30 * time.Second, // open → half-open timeout
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.ConsecutiveFailures > 5
},
OnStateChange: func(name string, from, to gobreaker.State) {
slog.Warn("circuit breaker state change",
"name", name, "from", from, "to", to)
},
})
result, err := cb.Execute(func() (any, error) {
return httpClient.Get("https://api.example.com/data")
})
Retry with Exponential Backoff and Jitter¶
Why Jitter Matters¶
Without jitter, all clients retry at the same time (thundering herd). Jitter spreads retries randomly.
graph LR
R1["Retry 1: 100ms ± jitter"] --> R2["Retry 2: 200ms ± jitter"]
R2 --> R3["Retry 3: 400ms ± jitter"]
R3 --> R4["Retry 4: 800ms ± jitter"]
R4 --> FAIL["Give up"]
Implementation¶
type RetryConfig struct {
MaxAttempts int
BaseDelay time.Duration
MaxDelay time.Duration
Retryable func(error) bool
}
func Retry(ctx context.Context, cfg RetryConfig, fn func(ctx context.Context) error) error {
var lastErr error
for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
lastErr = fn(ctx)
if lastErr == nil {
return nil
}
if cfg.Retryable != nil && !cfg.Retryable(lastErr) {
return lastErr // non-retryable error
}
if attempt == cfg.MaxAttempts-1 {
break
}
// Exponential backoff with full jitter
backoff := cfg.BaseDelay * time.Duration(1<<uint(attempt))
if backoff > cfg.MaxDelay {
backoff = cfg.MaxDelay
}
jitter := time.Duration(rand.Int63n(int64(backoff)))
select {
case <-time.After(jitter):
case <-ctx.Done():
return ctx.Err()
}
}
return fmt.Errorf("after %d attempts: %w", cfg.MaxAttempts, lastErr)
}
// Usage
err := Retry(ctx, RetryConfig{
MaxAttempts: 5,
BaseDelay: 100 * time.Millisecond,
MaxDelay: 5 * time.Second,
Retryable: func(err error) bool {
// Only retry transient errors
var netErr net.Error
return errors.As(err, &netErr) && netErr.Timeout()
},
}, func(ctx context.Context) error {
return callUpstreamAPI(ctx)
})
Interview Tip
"I always use exponential backoff with jitter for retries. Without jitter, correlated retries create a thundering herd. I also make retryability explicit — 4xx errors shouldn't be retried, but 503s and timeouts should. And I always respect context cancellation between retries."
Rate Limiting¶
Token Bucket (golang.org/x/time/rate)¶
import "golang.org/x/time/rate"
// 100 requests/second, burst of 10
limiter := rate.NewLimiter(100, 10)
func handleRequest(w http.ResponseWriter, r *http.Request) {
if !limiter.Allow() {
http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
return
}
// process request
}
// Or block until allowed (respects context)
func processItem(ctx context.Context, item Item) error {
if err := limiter.Wait(ctx); err != nil {
return err // context cancelled while waiting
}
return doWork(item)
}
// Reserve for advanced control
r := limiter.Reserve()
if !r.OK() {
return errors.New("rate limit would never be satisfied")
}
time.Sleep(r.Delay()) // wait exactly the right amount
Per-Client Rate Limiting¶
type ClientLimiter struct {
mu sync.Mutex
clients map[string]*rate.Limiter
rate rate.Limit
burst int
}
func NewClientLimiter(r rate.Limit, burst int) *ClientLimiter {
return &ClientLimiter{
clients: make(map[string]*rate.Limiter),
rate: r,
burst: burst,
}
}
func (cl *ClientLimiter) GetLimiter(clientID string) *rate.Limiter {
cl.mu.Lock()
defer cl.mu.Unlock()
if l, ok := cl.clients[clientID]; ok {
return l
}
l := rate.NewLimiter(cl.rate, cl.burst)
cl.clients[clientID] = l
return l
}
// HTTP middleware
func RateLimitMiddleware(cl *ClientLimiter) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
clientID := r.Header.Get("X-API-Key")
if clientID == "" {
clientID = r.RemoteAddr
}
if !cl.GetLimiter(clientID).Allow() {
w.Header().Set("Retry-After", "1")
http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
return
}
next.ServeHTTP(w, r)
})
}
}
Sliding Window (for Precise Rate Limiting)¶
type SlidingWindowLimiter struct {
mu sync.Mutex
window time.Duration
limit int
requests []time.Time
}
func NewSlidingWindowLimiter(window time.Duration, limit int) *SlidingWindowLimiter {
return &SlidingWindowLimiter{
window: window,
limit: limit,
}
}
func (l *SlidingWindowLimiter) Allow() bool {
l.mu.Lock()
defer l.mu.Unlock()
now := time.Now()
cutoff := now.Add(-l.window)
// Remove expired entries
valid := l.requests[:0]
for _, t := range l.requests {
if t.After(cutoff) {
valid = append(valid, t)
}
}
l.requests = valid
if len(l.requests) >= l.limit {
return false
}
l.requests = append(l.requests, now)
return true
}
Bulkhead Pattern¶
Isolate failures so one slow dependency doesn't consume all resources.
type Bulkhead struct {
sem chan struct{}
}
func NewBulkhead(maxConcurrent int) *Bulkhead {
return &Bulkhead{
sem: make(chan struct{}, maxConcurrent),
}
}
func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
select {
case b.sem <- struct{}{}:
defer func() { <-b.sem }()
return fn()
case <-ctx.Done():
return fmt.Errorf("bulkhead: %w", ctx.Err())
}
}
// Isolate different dependencies
var (
dbBulkhead = NewBulkhead(20) // max 20 concurrent DB calls
apiBulkhead = NewBulkhead(10) // max 10 concurrent API calls
cacheBulkhead = NewBulkhead(50) // max 50 concurrent cache calls
)
func handleRequest(ctx context.Context) error {
// DB calls are isolated from API calls
var user *User
err := dbBulkhead.Execute(ctx, func() error {
var err error
user, err = db.GetUser(ctx, "123")
return err
})
if err != nil {
return err
}
return apiBulkhead.Execute(ctx, func() error {
return api.EnrichProfile(ctx, user)
})
}
graph TD
REQ[Incoming Request] --> ROUTE{Route}
ROUTE -->|DB calls| DBB[DB Bulkhead<br>max=20]
ROUTE -->|API calls| APIB[API Bulkhead<br>max=10]
ROUTE -->|Cache calls| CB[Cache Bulkhead<br>max=50]
DBB --> DB[(Database)]
APIB --> API[External API]
CB --> CACHE[Redis Cache]
style DBB fill:#f9f,stroke:#333
style APIB fill:#9ff,stroke:#333
style CB fill:#ff9,stroke:#333
Timeout Propagation with Context¶
func handleBidRequest(w http.ResponseWriter, r *http.Request) {
// Ad-tech: total budget is 100ms
ctx, cancel := context.WithTimeout(r.Context(), 100*time.Millisecond)
defer cancel()
// Each step gets the remaining time
user, err := userService.GetUser(ctx, userID) // uses ~20ms
if err != nil {
// context may be cancelled — check
if ctx.Err() != nil {
http.Error(w, "timeout", http.StatusGatewayTimeout)
return
}
}
ads, err := adService.GetAds(ctx, user) // ~30ms remaining
bid, err := bidService.ComputeBid(ctx, ads) // ~50ms remaining
json.NewEncoder(w).Encode(bid)
}
// Downstream service respects the deadline
func (s *AdService) GetAds(ctx context.Context, user *User) ([]Ad, error) {
// Check if there's enough time to even try
if deadline, ok := ctx.Deadline(); ok {
if time.Until(deadline) < 10*time.Millisecond {
return nil, fmt.Errorf("insufficient time budget")
}
}
// Context propagated to DB query — auto-cancels if deadline exceeded
return s.repo.FindAdsForUser(ctx, user.ID)
}
Health Checks¶
Liveness, Readiness, and Startup Probes¶
type HealthChecker struct {
mu sync.RWMutex
checks map[string]func(ctx context.Context) error
}
func NewHealthChecker() *HealthChecker {
return &HealthChecker{checks: make(map[string]func(ctx context.Context) error)}
}
func (hc *HealthChecker) Register(name string, check func(ctx context.Context) error) {
hc.mu.Lock()
defer hc.mu.Unlock()
hc.checks[name] = check
}
type HealthStatus struct {
Status string `json:"status"`
Checks map[string]string `json:"checks"`
}
func (hc *HealthChecker) Check(ctx context.Context) HealthStatus {
hc.mu.RLock()
defer hc.mu.RUnlock()
status := HealthStatus{
Status: "healthy",
Checks: make(map[string]string),
}
for name, check := range hc.checks {
if err := check(ctx); err != nil {
status.Status = "unhealthy"
status.Checks[name] = err.Error()
} else {
status.Checks[name] = "ok"
}
}
return status
}
// Setup
health := NewHealthChecker()
health.Register("database", func(ctx context.Context) error {
return db.PingContext(ctx)
})
health.Register("redis", func(ctx context.Context) error {
return redisClient.Ping(ctx).Err()
})
// Liveness: is the process alive?
mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("ok"))
})
// Readiness: can the process serve traffic?
mux.HandleFunc("GET /readyz", func(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
status := health.Check(ctx)
if status.Status != "healthy" {
w.WriteHeader(http.StatusServiceUnavailable)
}
json.NewEncoder(w).Encode(status)
})
Distributed Tracing (OpenTelemetry)¶
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("myservice")
func handleRequest(ctx context.Context, userID string) error {
ctx, span := tracer.Start(ctx, "handleRequest",
trace.WithAttributes(attribute.String("user.id", userID)),
)
defer span.End()
user, err := getUser(ctx, userID)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
return processUser(ctx, user)
}
func getUser(ctx context.Context, id string) (*User, error) {
ctx, span := tracer.Start(ctx, "getUser")
defer span.End()
// Trace propagated via context — shows as child span
span.AddEvent("querying database")
user, err := db.FindByID(ctx, id)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.String("user.name", user.Name))
return user, nil
}
graph LR
A["handleRequest (50ms)"] --> B["getUser (20ms)"]
B --> C["DB query (15ms)"]
A --> D["processUser (25ms)"]
D --> E["External API (20ms)"]
Idempotency¶
Ensure operations produce the same result regardless of how many times they're called.
type IdempotencyStore interface {
Check(ctx context.Context, key string) (result []byte, exists bool, err error)
Store(ctx context.Context, key string, result []byte, ttl time.Duration) error
}
func IdempotencyMiddleware(store IdempotencyStore) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
key := r.Header.Get("Idempotency-Key")
if key == "" {
next.ServeHTTP(w, r)
return
}
// Check for cached result
cached, exists, err := store.Check(r.Context(), key)
if err == nil && exists {
w.Header().Set("X-Idempotent-Replay", "true")
w.Write(cached)
return
}
// Capture the response
rec := httptest.NewRecorder()
next.ServeHTTP(rec, r)
// Store for future replays
body := rec.Body.Bytes()
store.Store(r.Context(), key, body, 24*time.Hour)
// Copy recorded response to real writer
for k, v := range rec.Header() {
w.Header()[k] = v
}
w.WriteHeader(rec.Code)
w.Write(body)
})
}
}
Graceful Degradation¶
type CacheWithFallback struct {
primary Cache
secondary Cache // fallback when primary is down
cb *gobreaker.CircuitBreaker
}
func (c *CacheWithFallback) Get(ctx context.Context, key string) (string, error) {
// Try primary with circuit breaker
result, err := c.cb.Execute(func() (any, error) {
return c.primary.Get(ctx, key)
})
if err == nil {
return result.(string), nil
}
// Degrade to secondary
slog.Warn("primary cache unavailable, using fallback", "error", err)
return c.secondary.Get(ctx, key)
}
// Multi-level degradation
func (s *AdService) GetBid(ctx context.Context, req BidRequest) (*Bid, error) {
// Level 1: full personalized bid
bid, err := s.personalizedBid(ctx, req)
if err == nil {
return bid, nil
}
// Level 2: cached bid for this segment
bid, err = s.segmentBid(ctx, req)
if err == nil {
slog.Warn("degraded to segment bid")
return bid, nil
}
// Level 3: default bid
slog.Warn("degraded to default bid")
return s.defaultBid(req), nil
}
Quick Reference¶
| Pattern | Problem It Solves | Key Mechanism |
|---|---|---|
| Circuit Breaker | Repeated calls to failing service | State machine: closed/open/half-open |
| Retry + Backoff | Transient failures | Exponential delay + jitter |
| Rate Limiting | Overload prevention | Token bucket, sliding window |
| Bulkhead | Failure isolation | Bounded concurrency per dependency |
| Timeout Propagation | Cascading delays | context.WithTimeout through call chain |
| Health Checks | Infrastructure awareness | Liveness + readiness endpoints |
| Distributed Tracing | Cross-service debugging | OpenTelemetry spans via context |
| Idempotency | Duplicate request safety | Idempotency key + cached response |
| Graceful Degradation | Partial availability | Fallback chain with reduced functionality |
Best Practices¶
- Set timeouts everywhere — every outbound call needs a deadline via context
- Circuit breakers on all external dependencies — prevent cascading failures
- Retry only idempotent, transient failures — don't retry 400s or non-idempotent POSTs without idempotency keys
- Always use jitter — prevents thundering herd on correlated retries
- Bulkhead critical dependencies — isolate database, cache, and external API call pools
- Health checks should be fast — liveness should be instant, readiness should check real dependencies
- Propagate trace context — pass
context.Contextthrough every function call and RPC - Design for idempotency — use idempotency keys for non-idempotent operations in unreliable networks
- Monitor circuit breaker state — alert on state transitions (closed → open)
- Test failure modes — use fault injection to verify graceful degradation works
Common Pitfalls¶
Retrying Non-Idempotent Operations
Retrying a "create payment" without an idempotency key can charge the user twice. Only retry operations that are safe to repeat, or use idempotency keys.
Retry Without Backoff
Immediate retries to a struggling service make things worse. Always use exponential backoff with jitter. Without jitter, all clients hammer the recovering service simultaneously.
Circuit Breaker Too Sensitive
A circuit breaker that opens after 1 failure causes false positives. Tune thresholds based on normal error rates. Use consecutive failures or error percentage, not total failures.
Health Check That's Too Deep
A readiness check that calls multiple downstream services creates a cascading dependency. If service C is down, services A and B both report unready. Keep readiness checks checking direct dependencies only.
Ignoring Context Cancellation
When a context is cancelled (client disconnect, timeout), continuing work wastes resources. Check ctx.Err() at natural checkpoints and in loops.
Performance Considerations¶
golang.org/x/time/rateuses a lock-free atomic algorithm under the hood — very fast for Allow() checks- Circuit breaker state checks are O(1) — negligible overhead
- Retry delays should be bounded — set
MaxDelayto prevent exponential explosion - Bulkhead channel operations are ~50ns — minimal overhead for the isolation benefit
- OpenTelemetry spans have sampling — use probability sampling (e.g., 1%) in high-throughput production
- Idempotency stores should use Redis with TTL — checking is one round-trip, negligible compared to the operation
Interview Tips¶
Interview Tip
"My standard resilience stack for any external dependency is: timeout (via context), circuit breaker (to fail fast when down), retry with backoff (for transient failures), and bulkhead (to isolate). These compose naturally: the retry wraps the circuit breaker, which wraps the actual call, all within a timeout context."
Interview Tip
"In ad-tech, the total bid response budget is typically 100ms. I propagate this deadline through context to every downstream call. Each service checks the remaining time before starting expensive work. If we can't compute a personalized bid in time, we degrade to a cached segment-level bid, then to a default bid — always serving something within the deadline."
Interview Tip
"Rate limiting strategy depends on what you're protecting. Token bucket (like x/time/rate) is great for smoothing bursts. Sliding window is better for strict per-window limits. For distributed rate limiting across multiple pods, I use Redis with Lua scripts for atomic check-and-increment."
Key Takeaways¶
- Circuit breaker prevents cascading failures by failing fast when a dependency is down
- Retry with exponential backoff + jitter handles transient failures without thundering herds
- Rate limiting protects services from overload — token bucket for burst tolerance, sliding window for strict limits
- Bulkhead isolates failures so one slow dependency doesn't consume all goroutines/connections
- Timeout propagation via context is Go's killer feature for distributed systems — use it everywhere
- Health checks should separate liveness (process alive?) from readiness (can serve traffic?)
- Distributed tracing with OpenTelemetry provides cross-service visibility via context propagation
- Idempotency is essential for safe retries in unreliable networks
- Graceful degradation ensures the system always returns something — even if reduced quality
- These patterns compose — timeout → circuit breaker → retry → bulkhead → actual call