Skip to content

Strings, Runes & UTF-8 Beginner

Introduction

Go strings are read-only slices of bytes, not arrays of characters. Every Go source file is UTF-8, and the language has first-class support for Unicode through the rune type. Understanding the byte-vs-rune distinction is essential -- it's a top interview question and a source of subtle production bugs when handling multi-byte characters.


Strings Are Immutable Byte Slices

A string in Go is a read-only, immutable sequence of bytes. It has no built-in notion of "characters."

s := "Hello, 世界"
fmt.Println(len(s))    // 13 (bytes, NOT characters)
fmt.Println(s[0])      // 72 (byte value of 'H')
// s[0] = 'h'          // compile error: strings are immutable

Critical Interview Point

len(s) returns the number of bytes, not the number of characters (runes). "世界" is 6 bytes (3 per character) but only 2 runes.


Runes: Unicode Code Points

A rune is an alias for int32 and represents a single Unicode code point.

var r rune = '世'
fmt.Printf("Type: %T, Value: %d, Char: %c\n", r, r, r)
// Type: int32, Value: 19990, Char: 世

Byte vs Rune Iteration

s := "café"

// Byte iteration -- iterates raw bytes
for i := 0; i < len(s); i++ {
    fmt.Printf("byte[%d] = %x\n", i, s[i])
}
// byte[0]=63 byte[1]=61 byte[2]=66 byte[3]=c3 byte[4]=a9
// 'é' splits into 2 bytes (c3, a9)

// Rune iteration -- range decodes UTF-8 automatically
for i, r := range s {
    fmt.Printf("rune[%d] = %c (U+%04X)\n", i, r, r)
}
// rune[0] = c (U+0063)
// rune[1] = a (U+0061)
// rune[2] = f (U+0066)
// rune[3] = é (U+00E9)  -- index 3, NOT 4

Interview Tip

range over a string iterates by rune, decoding UTF-8 on each step. The index jumps by the byte width of each rune, not by 1. A standard for i := 0; i < len(s); i++ iterates by byte.


String Conversions

s := "Hello, 世界"

// String ↔ byte slice
b := []byte(s)           // copies bytes out
s2 := string(b)          // copies bytes back

// String ↔ rune slice
r := []rune(s)           // decodes UTF-8 into code points
fmt.Println(len(r))      // 9 (rune count, the "character" count)
s3 := string(r)          // re-encodes to UTF-8

// Single rune/byte to string
fmt.Println(string(65))  // "A" -- int to string gives the rune
fmt.Println(string(r[7]))// "界"

Conversion Cost

[]byte(s) and []rune(s) both allocate and copy. In hot paths, avoid repeated conversions -- work with bytes directly or convert once.

Raw Strings (Backtick Literals)

Raw strings preserve everything literally -- no escape sequences, can span multiple lines.

path := `C:\Users\docs\file.txt`   // no need to escape backslashes
query := `SELECT *
FROM users
WHERE active = true`               // multi-line without \n
regex := `\d{3}-\d{4}`             // regex without double-escaping

String Concatenation

Method Use Case Allocations
+ operator Small, known concatenations (2-3 strings) New string each time
fmt.Sprintf Formatted output with mixed types Moderate overhead
strings.Builder Building strings in loops Amortized O(1) appends
strings.Join Joining a slice of strings Single allocation
// BAD: O(n²) in a loop -- each + allocates a new string
var s string
for _, word := range words {
    s += word + " "  // quadratic allocation
}

// GOOD: strings.Builder -- the standard approach for loops
var b strings.Builder
b.Grow(estimatedSize) // optional pre-allocation
for _, word := range words {
    b.WriteString(word)
    b.WriteByte(' ')
}
result := b.String()

// GOOD: strings.Join for slices
result := strings.Join(words, " ")

Key strings Package Functions

import "strings"

strings.Contains("seafood", "foo")       // true
strings.HasPrefix("Hello", "He")         // true
strings.HasSuffix("Hello", "lo")         // true
strings.Index("chicken", "ken")          // 4
strings.Count("cheese", "e")             // 3

strings.ToUpper("hello")                 // "HELLO"
strings.ToLower("HELLO")                 // "hello"
strings.TrimSpace("  hi  ")             // "hi"
strings.Trim("***hi***", "*")           // "hi"

strings.Split("a,b,c", ",")             // ["a", "b", "c"]
strings.Join([]string{"a","b","c"}, ",") // "a,b,c"

strings.Replace("oink oink", "oink", "moo", 1)  // "moo oink"
strings.ReplaceAll("oink oink", "oink", "moo")   // "moo moo"

strings.NewReader("hello")               // io.Reader from string

Key unicode/utf8 Package Functions

import "unicode/utf8"

s := "Hello, 世界"
utf8.RuneCountInString(s)          // 9 -- true character count
utf8.ValidString(s)                // true
utf8.RuneLen('世')                  // 3 -- bytes needed for this rune

b := []byte("世")
r, size := utf8.DecodeRune(b)     // r='世', size=3
utf8.Valid(b)                      // true

Quick Reference

Concept Type Notes
string Immutable []byte UTF-8 encoded, zero value is ""
byte uint8 Single byte, used for ASCII or raw data
rune int32 Unicode code point
len(s) Byte count Not character count
utf8.RuneCountInString(s) Rune count True character count
s[i] byte Index access returns a byte
range s (index, rune) Decodes UTF-8, index = byte offset
`raw` Raw string literal No escapes processed
[]byte(s) Conversion Allocates + copies
[]rune(s) Conversion Allocates + copies, decodes UTF-8

Best Practices

  1. Use strings.Builder for any string construction in loops
  2. Use range when you need to process runes (characters), never manual byte indexing for Unicode text
  3. Use utf8.RuneCountInString when you need the true character count
  4. Use raw string literals for regex patterns, file paths, and SQL queries
  5. Prefer strings.EqualFold over ToLower/ToUpper for case-insensitive comparison -- it avoids allocation
  6. Pre-allocate Builder with Grow() when you know the approximate final size

Common Pitfalls

Slicing multi-byte strings

Slicing a string by byte index can cut a multi-byte rune in half, producing invalid UTF-8:

s := "café"
fmt.Println(s[:4])  // "caf\xc3" -- broken! 'é' is 2 bytes
fmt.Println(s[:5])  // "café" -- correct byte boundary
Convert to []rune first if you need character-based slicing.

Comparing strings with special characters

Some Unicode characters have multiple representations. Use golang.org/x/text/unicode/norm for normalization when comparing user input.

Modifying strings through byte slice

Converting to []byte, modifying, and converting back works but creates copies at each step. The original string is never modified.


Performance Considerations

Operation Cost Notes
len(s) O(1) Stored in string header
s[i] O(1) Direct byte access
s + t O(n+m) Allocates new string
[]byte(s) O(n) Copy + allocation
[]rune(s) O(n) Decode + allocation
strings.Builder Amortized O(1) per write Doubles buffer on growth
utf8.RuneCountInString(s) O(n) Must scan entire string
range s (rune) O(n) total Decodes on the fly, no allocation

Compiler Optimizations

Go's compiler optimizes certain patterns: []byte conversions in map lookups and comparisons often avoid allocation. string(b) in map[string(b)] does not copy.


Interview Tips

Interview Tip

When asked "What is the output of len("日本語")?", the answer is 9 (3 runes × 3 bytes each), not 3. Always clarify whether a question is asking about bytes or runes.

Interview Tip

Know why string is immutable: it allows safe sharing across goroutines without locks, enables string interning optimizations, and makes strings usable as map keys.

Interview Tip

If asked to reverse a Unicode string, convert to []rune first:

func reverseString(s string) string {
    runes := []rune(s)
    for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
        runes[i], runes[j] = runes[j], runes[i]
    }
    return string(runes)
}
Reversing bytes directly would corrupt multi-byte characters.


Key Takeaways

  • Strings are immutable byte slices, not character arrays
  • rune is int32 -- a Unicode code point; byte is uint8
  • len() counts bytes; use utf8.RuneCountInString() for runes
  • range over a string yields (byte index, rune) pairs
  • Use strings.Builder for efficient concatenation in loops
  • Raw strings (backticks) are ideal for regex, paths, and multi-line text
  • String conversions ([]byte, []rune) always allocate and copy