Observability and Resilience SDK¶
Repository:
go-observability-sdk/(sibling directory)
What This Project Demonstrates¶
Production systems need more than business logic. This SDK provides the building blocks that make services observable, resilient, and production-ready -- exactly the kind of platform engineering work a Tech Lead drives.
| Skill Area | How It's Demonstrated |
|---|---|
| Library / API design | Clean public APIs with functional options pattern |
| Interface design | Small, composable interfaces throughout |
| Resilience patterns | Circuit breaker, retry with backoff, rate limiting |
| Observability | Structured logging, Prometheus metrics, health checks |
| Testing discipline | Comprehensive tests for every component |
| Go idioms | Functional options, interface adapters, generics |
Components¶
graph TD
App["Your Application"] --> CB["Circuit Breaker\ncircuitbreaker/"]
App --> RL["Rate Limiter\nratelimiter/"]
App --> RT["Retry\nretry/"]
App --> HC["Health Checks\nhealthcheck/"]
App --> LG["Logger\nlogger/"]
App --> MT["Metrics\nmetrics/"]
CB -.->|"state changes"| LG
RL -.->|"rate exceeded"| MT
HC -.->|"HTTP handler"| HealthEndpoint["/health"]
MT -.->|"HTTP handler"| MetricsEndpoint["/metrics"]
Component Overview¶
| Component | What It Does | Key Pattern |
|---|---|---|
| Circuit Breaker | Prevents cascading failures by failing fast when a dependency is down | State machine (Closed → Open → Half-Open) |
| Rate Limiter | Controls request throughput using token bucket algorithm | Allow() non-blocking + Wait(ctx) blocking |
| Retry | Retries failed operations with exponential backoff and jitter | Configurable via functional options |
| Health Check | Aggregates dependency health into a single endpoint | Registry pattern with concurrent checks |
| Logger | Structured logging with context propagation | log/slog wrapper with middleware |
| Metrics | Prometheus metrics helpers for HTTP services | Counter + histogram + gauge middleware |
Project Structure¶
go-observability-sdk/
├── circuitbreaker/
│ ├── circuitbreaker.go # Three-state circuit breaker
│ └── circuitbreaker_test.go # State transition tests
├── ratelimiter/
│ ├── ratelimiter.go # Token bucket rate limiter
│ └── ratelimiter_test.go # Rate enforcement tests
├── retry/
│ ├── retry.go # Retry with exponential backoff
│ └── retry_test.go # Backoff timing tests
├── healthcheck/
│ ├── healthcheck.go # Health check registry + HTTP handler
│ └── healthcheck_test.go # Status aggregation tests
├── logger/
│ ├── logger.go # slog wrapper + HTTP middleware
│ └── logger_test.go # Format and context tests
├── metrics/
│ ├── metrics.go # Prometheus HTTP metrics helpers
│ └── metrics_test.go # Middleware tests
├── examples/
│ └── httpserver/
│ └── main.go # Complete example using all components
├── go.mod
├── Makefile
└── README.md
Key Design Decisions¶
Why Functional Options Pattern?¶
Every component uses functional options for configuration:
cb := circuitbreaker.New(
circuitbreaker.WithFailureThreshold(5),
circuitbreaker.WithTimeout(30 * time.Second),
circuitbreaker.WithOnStateChange(func(from, to State) {
log.Info("circuit breaker state change", "from", from, "to", to)
}),
)
This pattern provides: - Sensible defaults (zero-config works) - Backward-compatible API evolution (add options without breaking existing callers) - Self-documenting configuration - Compile-time type safety
Why Separate Packages?¶
Each component is an independent package:
- Users import only what they need (go-observability-sdk/retry)
- No coupling between components
- Each package has its own tests and documentation
- Follows the Go principle: "a little copying is better than a little dependency"
Why Token Bucket for Rate Limiting?¶
Token bucket is the standard algorithm for rate limiting in distributed systems: - Allows bursts up to a configurable limit - Smooth refill rate prevents thundering herd - Simple to reason about and configure - Used by AWS, Google Cloud, and most API gateways
Why Injectable Clock in Circuit Breaker?¶
The circuit breaker accepts a nowFunc option for time:
This makes tests deterministic -- no time.Sleep needed, no flaky timing tests.
Component Deep Dives¶
Circuit Breaker State Machine¶
stateDiagram-v2
Closed --> Open: Failures >= Threshold
Open --> HalfOpen: After Timeout
HalfOpen --> Closed: Success >= SuccessThreshold
HalfOpen --> Open: Any Failure
Retry Backoff Strategy¶
Attempt 1: immediate
Attempt 2: 100ms + jitter (0-50ms)
Attempt 3: 200ms + jitter (0-100ms)
Attempt 4: 400ms + jitter (0-200ms)
Attempt 5: 800ms + jitter (0-400ms)
... capped at MaxDelay
Jitter prevents thundering herd when multiple clients retry simultaneously.
Health Check Aggregation¶
| Individual Results | Overall Status | HTTP Code |
|---|---|---|
| All pass | healthy |
200 |
| Non-critical fail | degraded |
200 |
| Any critical fail | unhealthy |
503 |
Go Concepts Showcased¶
| Concept | Where It's Used |
|---|---|
| Functional options | Every component's configuration |
| Interfaces | Check interface, http.Handler adapter pattern |
sync.Mutex |
Thread-safe state in circuit breaker and rate limiter |
context.Context |
Cancellation in retry and rate limiter Wait |
log/slog |
Structured logging wrapper with context propagation |
| HTTP middleware | Logger, metrics, and rate limiter as func(http.Handler) http.Handler |
time.Ticker / time.Timer |
Rate limiting token refill, circuit breaker timeout |
| Custom error types | ErrCircuitOpen, RetryableError |
| Test helpers | Injectable clocks, assertion helpers |
http.HandlerFunc adapter |
Health check CheckFunc mirrors this pattern |
How to Talk About This in an Interview¶
Interview Talking Points
-
Start with why: "Production services need resilience primitives -- circuit breakers, retries, rate limiting. Instead of ad-hoc implementations scattered across services, I built a shared SDK with consistent API design."
-
API design philosophy: "Every component uses functional options for zero-config defaults with extensibility. The API is backward-compatible -- adding new options never breaks existing callers."
-
Testing approach: "Injectable clocks make tests deterministic. No
time.Sleepcalls, no flaky CI. Each component has comprehensive tests including concurrent access." -
How components compose: "The example server shows all components working together -- rate limiter as middleware, circuit breaker wrapping upstream calls, health checks aggregating dependency status, Prometheus metrics recording everything."
-
Library vs framework: "This is a library, not a framework. Each package is independently importable. Services adopt components incrementally."
Running the Project¶
cd go-observability-sdk
# Install dependencies
go mod tidy
# Run all tests
make test
# Run with race detector
go test -race ./...
# Run the example server
make example
# Then visit:
# http://localhost:8080/ (Hello endpoint with circuit breaker)
# http://localhost:8080/health (Health check status)
# http://localhost:8080/metrics (Prometheus metrics)
Example: All Components Together¶
The examples/httpserver/main.go demonstrates a production-like HTTP server:
func main() {
// Logger
log := logger.New(logger.WithFormat("json"), logger.WithLevel(slog.LevelInfo))
// Circuit breaker for upstream service
cb := circuitbreaker.New(
circuitbreaker.WithFailureThreshold(3),
circuitbreaker.WithTimeout(10 * time.Second),
)
// Rate limiter: 100 requests/second, burst of 20
rl := ratelimiter.New(100, 20)
// Health checks
health := healthcheck.NewRegistry()
health.Register("upstream", true, healthcheck.CheckFunc(func(ctx context.Context) error {
if cb.State() == circuitbreaker.StateOpen {
return errors.New("circuit breaker open")
}
return nil
}))
// Prometheus metrics
httpMetrics := metrics.NewHTTPMetrics("myservice")
// Wire up the handler chain
mux := http.NewServeMux()
mux.HandleFunc("/", handleRequest(cb))
mux.Handle("/health", health.Handler())
mux.Handle("/metrics", promhttp.Handler())
// Middleware stack: logging → metrics → rate limiting → handler
handler := log.HTTPMiddleware(
httpMetrics.Middleware(
rateLimitMiddleware(rl, mux)))
// Start with graceful shutdown...
}