Profiling and Performance Optimization Advanced¶
Introduction¶
Performance optimization in Go follows one rule: measure first, optimize second. Go ships with world-class profiling tools built into the standard library — runtime/pprof, net/http/pprof, benchmarks, escape analysis, and execution tracing. These tools let you identify exactly where CPU time is spent, where memory is allocated, and where goroutines are blocked.
This is especially critical in ad-tech, financial systems, and infrastructure code where microseconds matter. The methodology is always the same: profile under realistic load, identify the bottleneck, fix it, benchmark to confirm the improvement, repeat.
Syntax & Usage¶
Adding pprof to an HTTP Server¶
The simplest way to profile a running service — import the package for its side effects:
package main
import (
"log"
"net/http"
_ "net/http/pprof" // registers /debug/pprof/* handlers
)
func main() {
// Your application routes
http.HandleFunc("/api/data", handleData)
// pprof is now available at /debug/pprof/
log.Println("listening on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
Disable pprof in production or restrict access
The pprof endpoints expose internal details. Either run them on a separate port, gate them behind authentication, or remove the import in production builds using build tags.
For custom muxes (e.g., gorilla/mux or chi):
import "net/http/pprof"
func registerPprof(mux *http.ServeMux) {
mux.HandleFunc("/debug/pprof/", pprof.Index)
mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
mux.HandleFunc("/debug/pprof/trace", pprof.Trace)
}
Collecting and Reading Profiles¶
# CPU profile (30-second sample by default)
go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30
# Memory (heap) profile
go tool pprof http://localhost:8080/debug/pprof/heap
# Goroutine profile (find leaks)
go tool pprof http://localhost:8080/debug/pprof/goroutine
# Block profile (contention)
go tool pprof http://localhost:8080/debug/pprof/block
# Mutex profile (lock contention)
go tool pprof http://localhost:8080/debug/pprof/mutex
Inside the interactive pprof shell:
(pprof) top 20 # top 20 functions by CPU/memory
(pprof) list handleData # source-annotated view of a function
(pprof) web # open flame graph in browser (requires graphviz)
(pprof) svg > out.svg # export as SVG
(pprof) peek regexp # show callers/callees matching a regex
Profile Types at a Glance¶
| Profile | What It Measures | When to Use |
|---|---|---|
profile (CPU) |
Time spent in functions | High CPU usage, slow endpoints |
heap |
Current memory allocations | High memory usage, OOM |
allocs |
Total allocations over lifetime | Allocation-heavy hot paths |
goroutine |
Active goroutines and their stacks | Goroutine leaks |
block |
Time goroutines spend blocked | Channel/lock contention |
mutex |
Lock contention time | Mutex bottlenecks |
threadcreate |
OS threads created | Unexpected thread proliferation |
trace |
Execution events over time | Latency analysis, scheduler behavior |
Programmatic Profiling¶
For non-HTTP programs (CLI tools, batch jobs):
package main
import (
"os"
"runtime/pprof"
)
func main() {
// CPU profile
cpuFile, _ := os.Create("cpu.prof")
defer cpuFile.Close()
pprof.StartCPUProfile(cpuFile)
defer pprof.StopCPUProfile()
doWork()
// Memory profile
memFile, _ := os.Create("mem.prof")
defer memFile.Close()
pprof.WriteHeapProfile(memFile)
}
Benchmark Tests¶
Go's testing package has built-in benchmarking support:
func BenchmarkParseJSON(b *testing.B) {
data := []byte(`{"name":"Alice","age":30}`)
for b.Loop() {
var user User
json.Unmarshal(data, &user)
}
}
func BenchmarkParseJSONParallel(b *testing.B) {
data := []byte(`{"name":"Alice","age":30}`)
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
var user User
json.Unmarshal(data, &user)
}
})
}
# Run benchmarks
go test -bench=. -benchmem ./...
# Run specific benchmark
go test -bench=BenchmarkParseJSON -benchmem -count=5
# Compare benchmarks (install benchstat)
go test -bench=. -benchmem -count=10 > old.txt
# ... make optimization ...
go test -bench=. -benchmem -count=10 > new.txt
benchstat old.txt new.txt
Sample output:
Reading the output: 1,246,824 iterations, 962.3 nanoseconds per operation, 336 bytes allocated per operation, 7 heap allocations per operation.
b.ReportAllocs and Custom Metrics¶
func BenchmarkEncoding(b *testing.B) {
b.ReportAllocs() // equivalent to -benchmem but per-benchmark
data := generateLargePayload()
b.ResetTimer() // exclude setup from timing
for b.Loop() {
encode(data)
}
b.ReportMetric(float64(len(data)), "bytes/op")
b.SetBytes(int64(len(data))) // enables MB/s throughput reporting
}
Escape Analysis¶
The compiler decides whether variables live on the stack (cheap) or heap (requires GC). Inspect these decisions:
go build -gcflags='-m' ./... # basic escape analysis
go build -gcflags='-m -m' ./... # detailed escape decisions
go build -gcflags='-m -l' ./... # disable inlining to see pure escape behavior
func stackAlloc() int {
x := 42 // stays on stack — cheap
return x
}
func heapAlloc() *int {
x := 42 // escapes to heap — x must survive after return
return &x // compiler reports: "moved to heap: x"
}
func interfaceBoxing(x int) any {
return x // escapes: boxing an int into an interface allocates
}
Common causes of heap escape:
| Pattern | Why It Escapes |
|---|---|
| Returning a pointer to local | Variable must outlive function |
| Assigning to interface | Interface boxing allocates |
| Closure captures by reference | Variable shared across scopes |
| Slice grows beyond capacity | Reallocation moves to heap |
fmt.Println(x) |
Arguments boxed into []any |
Reducing Allocations¶
sync.Pool for Object Reuse¶
var bufPool = sync.Pool{
New: func() any { return new(bytes.Buffer) },
}
func processRequest(data []byte) string {
buf := bufPool.Get().(*bytes.Buffer)
defer func() {
buf.Reset()
bufPool.Put(buf)
}()
buf.Write(data)
return buf.String()
}
Pre-allocation¶
// BAD: grows dynamically — multiple allocations
func collectBad(n int) []int {
var result []int
for i := range n {
result = append(result, i*2)
}
return result
}
// GOOD: single allocation
func collectGood(n int) []int {
result := make([]int, 0, n)
for i := range n {
result = append(result, i*2)
}
return result
}
Avoiding Interface Boxing¶
// BAD: fmt.Sprintf boxes arguments into []any
msg := fmt.Sprintf("user %d: %s", id, name)
// GOOD: strconv + string concatenation — zero allocations beyond the result
msg := "user " + strconv.Itoa(id) + ": " + name
// GOOD: strings.Builder for complex strings
var b strings.Builder
b.Grow(64)
b.WriteString("user ")
b.WriteString(strconv.Itoa(id))
b.WriteString(": ")
b.WriteString(name)
msg := b.String()
Inlining¶
The compiler inlines small functions to eliminate call overhead. Check what gets inlined:
Functions that are not inlined: those with for loops (sometimes), defer, recover, type switches, or that exceed the inlining budget. Keep hot-path functions small and simple to encourage inlining.
Execution Tracing¶
go tool trace visualizes goroutine scheduling, GC events, and syscalls over time:
func TestWithTrace(t *testing.T) {
f, _ := os.Create("trace.out")
defer f.Close()
trace.Start(f)
defer trace.Stop()
// ... code under test ...
}
The trace view shows: - Goroutine lifecycles and scheduling - GC pauses (STW events) - Syscall blocking - Network I/O wait times - Processor utilization
Quick Reference¶
| Tool / Flag | Purpose | Command |
|---|---|---|
net/http/pprof |
HTTP profiling endpoints | import _ "net/http/pprof" |
go tool pprof |
Analyze profiles interactively | go tool pprof cpu.prof |
-bench |
Run benchmark tests | go test -bench=. -benchmem |
-gcflags='-m' |
Escape analysis output | go build -gcflags='-m' ./... |
-race |
Race condition detection | go test -race ./... |
go tool trace |
Execution trace visualization | go tool trace trace.out |
benchstat |
Compare benchmark results | benchstat old.txt new.txt |
b.ReportAllocs() |
Report allocations per benchmark | In benchmark function body |
b.ResetTimer() |
Exclude setup from timing | After setup, before hot loop |
Best Practices¶
- Profile before optimizing — never guess where the bottleneck is. A 10-minute profiling session prevents weeks of optimizing the wrong code.
- Use realistic workloads — profile with production-like data volumes and access patterns. Synthetic benchmarks can mislead.
- Benchmark with
-count=10andbenchstat— single benchmark runs are noisy. Run multiple times and usebenchstatfor statistical comparison. - Track allocations, not just CPU — in GC'd languages, allocation rate often matters more than raw CPU. Use
b.ReportAllocs()and-benchmemreligiously. - Optimize the hot path — if 95% of time is in one function, a 2x improvement there beats a 100x improvement in a function that takes 0.1% of time.
- Measure GC impact — set
GODEBUG=gctrace=1to see GC pause times in production. High allocation rates = frequent GC = latency spikes. - Use
sync.Poolfor high-throughput paths — HTTP handlers, encoders, and parsers benefit from pooling buffers and temporary objects.
Common Pitfalls¶
Optimizing without profiling
Profile first. Developers' intuition about performance is wrong more often than right. The profiler doesn't lie.Benchmark setup included in timing
Ignoring compiler optimizations in benchmarks
Premature use of sync.Pool
sync.Pool adds complexity and can hurt performance if the pooled objects are small or short-lived. Pool large buffers (>1KB) on hot paths. For small allocations, the allocator and GC are already fast enough. Profile first to confirm allocation pressure.
Performance Considerations¶
- Stack vs heap: Stack allocation is essentially free (just a pointer bump). Heap allocation involves the GC. Keep objects on the stack by avoiding pointer escapes, interface boxing, and closures that capture variables.
- GC tuning:
GOGCcontrols GC frequency (default 100 = trigger GC when heap doubles). RaisingGOGCreduces GC frequency at the cost of memory.GOMEMLIMIT(Go 1.19+) sets a soft memory limit — better for containerized workloads. - String operations: String concatenation with
+in a loop allocates on every iteration. Usestrings.Builder(pre-Growit if you know the size). For formatting,strconvfunctions are faster thanfmt.Sprintf. - Map pre-sizing:
make(map[K]V, hint)avoids rehashing. If you know the map will hold 10,000 entries, pre-size it. - Slice pre-sizing:
make([]T, 0, capacity)avoids multiple grow-and-copy cycles. This is one of the simplest and most effective optimizations. - Binary size: Use
go build -ldflags='-s -w'to strip debug info and symbol tables, reducing binary size by ~30%.
Interview Tips¶
Interview Tip
"How would you find a memory leak in a Go service?" Expose net/http/pprof, take a heap profile (/debug/pprof/heap), and look at inuse_objects for growing allocations. Take two profiles minutes apart and compare with go tool pprof -diff_base. Also check the goroutine profile — goroutine leaks are the most common source of memory leaks in Go.
Interview Tip
"What's escape analysis and why does it matter?" The compiler's escape analysis decides whether a variable is allocated on the stack (free) or the heap (costs GC time). Returning a pointer to a local variable forces heap allocation. Boxing a value into an interface forces heap allocation. You can see these decisions with go build -gcflags='-m'. In hot paths, restructuring code to keep values on the stack can eliminate millions of allocations per second.
Interview Tip
"How do you approach optimizing a slow Go service?" Systematic methodology: (1) Define the performance target. (2) Profile CPU with pprof — top and list to find hot functions. (3) Profile memory — allocs profile for allocation rate. (4) Write benchmarks for the hot path. (5) Optimize (pre-allocate, pool, reduce copies, avoid interface boxing). (6) Benchmark again to confirm improvement. (7) Repeat until target is met.
Interview Tip
"What is GOGC and how would you tune it?" GOGC sets the GC target percentage — default 100 means trigger GC when heap grows to 2x the live data. GOGC=200 means 3x — fewer GC cycles but more memory. GOGC=off disables GC entirely (dangerous). Go 1.19 added GOMEMLIMIT which is usually better for containers — it sets a soft memory cap and adjusts GC frequency automatically to stay under it.
Key Takeaways¶
- Profile first: Use
net/http/pproffor services andruntime/pproffor CLI tools. Never optimize without data. - Benchmark with
-benchmemand-count: Allocations per operation often matter more than raw speed. - Escape analysis (
-gcflags='-m') reveals what allocates — keeping values on the stack is the cheapest optimization. - Reduce allocations via
sync.Pool, pre-allocation (make([]T, 0, n)), and avoidingfmt.Sprintfin hot paths. go tool traceshows goroutine scheduling, GC pauses, and syscall blocking — essential for latency analysis.- Use
benchstatfor statistically sound before/after comparisons — never trust a single benchmark run. - Performance optimization is iterative: measure → identify → fix → verify → repeat.