Why Goroutines Scale: Stack Growth, Compiler Tricks, and Context Switching

Published: 1 month ago (January 3, 2026 at 03:03 PM EST)

4 min read

Source: Dev.to

Thread Overhead in C++ and Java

In the above‑mentioned languages, threads are a means of concurrency that take a lot of CPU time in context switching and consume a relatively huge amount of memory at the time of their creation. A single thread typically reserves ~1 MiB of stack space. Therefore, spawning 100 000 threads would require about 100 GiB of RAM, which is not economically feasible for most software projects.

For maintaining concurrency, the CPU generally uses time‑slicing to give an equal number of CPU cycles to each thread. While doing this, the CPU must perform a context switch, which is quite expensive:

The current thread’s state is saved in its TCB (Thread Control Block).
The new thread’s TCB is loaded into memory.
Context switches destroy cache locality, causing frequent L1/L2 cache misses.

When you have thousands of threads, the CPU spends more time switching context than actually executing code.

How Goroutines Optimize This

Goroutines are “lightweight threads” managed entirely in user space by the Go runtime, rather than by the OS kernel.

Memory Efficiency

A standard OS thread reserves a fixed 1 MiB stack.
A goroutine starts with a stack of only 2 KiB.

Comparison showing huge 1 MiB OS thread stack vs tiny 2 KiB goroutine stack

The Math: 2 KiB is roughly 0.2 % of 1 MiB.
The Impact: Instead of capping out at thousands of threads, you can easily spawn millions of goroutines on a standard laptop without running out of RAM.

The “Infinite” Stack

Unlike OS threads, which have a fixed stack size determined at creation, goroutine stacks are dynamic:

A goroutine starts with a 2 KiB stack.
If it runs out of space, the runtime allocates a larger segment (usually double the current size) and moves the stack there.

Flowchart showing how Go runtime allocates a larger stack and copies data when limit is hit

OS thread limit: Fixed (≈1–8 MiB). Hitting this limit causes a crash.
Goroutine limit: Dynamic (up to ~1 GiB on 64‑bit systems).

Thus, for all practical purposes, goroutine recursion depth is limited only by available memory, while OS threads are limited by their initial reservation.

Faster Context Switches

Diagram illustrating goroutines running in user space vs OS threads in kernel space

Both OS threads and goroutines must save their state when paused, but the cost differs dramatically:

	OS Thread Switch	Goroutine Switch
Typical latency	~1–2 µs	~200 ns (≈10× faster)
What is saved?	All CPU registers (including heavy FP/AVX registers) → TCB	Only 3 registers (PC, SP, DX) → a small Go struct (`g`)
Where it happens?	Kernel mode (trap)	User space (runtime)
Cache impact	Flushes caches, loses locality	Caches stay hot, locality preserved

Because goroutine switches stay in user space, the overhead is negligible.

How Goroutine Stack Allocation Works

The Go compiler inserts a function prologue at the start of every function. The prologue performs a check:

Compare the current stack pointer (SP) with a limit called the stack guard.
If insufficient space remains, branch to runtime.morestack.
runtime.morestack allocates a larger stack segment (usually 2× the current size).
The runtime copies the existing stack contents to the new segment and adjusts all pointers so they point to the new addresses.
Execution resumes on the larger stack.

Example

package main

import "fmt"

func main() {
    fmt.Println("Hello Ayush")
}

Running the compiler with the -gcflags -S option shows the generated assembly for main.main:

main.main STEXT size=83 args=0x0 locals=0x40 funcid=0x0 align=0x0
    0x0000 00000 (/Users/ayushanand/concurrency/main.go:7)  TEXT    main.main(SB), ABIInternal, $64-0
    0x0000 00000 (/Users/ayushanand/concurrency/main.go:7)  CMPQ    SP, 16(R14)          // compare SP with stack guard
    0x0004 00004 (/Users/ayushanand/concurrency/main.go:7)  PCDATA  $0, $-2
    0x0004 00004 (/Users/ayushanand/concurrency/main.go:7)  JLS     76                    // jump to morestack if SP )    NOP
0x002d 00045 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) LEAQ    go:itab.*os.File,io.Writer(SB), AX
0x0034 00052 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) LEAQ    main..autotmp_8+40(SP), CX
0x0039 00057 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) MOVL    $1, DI
0x003e 00062 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) MOVQ    DI, SI
0x0041 00065 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) PCDATA  $1, $0
0x0041 00065 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) CALL    fmt.Fprintln(SB)
0x0046 00070 (/Users/ayushanand/concurrency/main.go:9)  ADDQ    $56, SP
0x004a 00074 (/Users/ayushanand/concurrency/main.go:9)  POPQ    BP
0x004b 00075 (/Users/ayushanand/concurrency/main.go:9)  RET
0x004c 00076 (/Users/ayushanand/concurrency/main.go:9)  NOP
0x004c 00076 (/Users/ayushanand/concurrency/main.go:7)  PCDATA  $1, $-1
0x004c 00076 (/Users/ayushanand/concurrency/main.go:7)  PCDATA  $0, $-2
0x004c 00076 (/Users/ayushanand/concurrency/main.go:7)  CALL    runtime.morestack_noctxt(SB)
0x0051 00081 (/Users/ayushanand/concurrency/main.go:7)  PCDATA  $0, $-1
0x0051 00081 (/Users/ayushanand/concurrency/main.go:7)  JMP

You can see the assembly code that checks for stack size.

Ending Notes

Goroutines aren’t just “threads but smaller.” They represent a fundamental rethink of how we manage concurrency. By moving the stack management from the OS kernel to the Go runtime, we gain:

Massive Scalability: From a 100 k limit to millions of goroutines.
Dynamic Memory: Pay for what you use (≈ 2 KB), not what you might use (≈ 1 MB).
Low Latency: Context switches that are ~10× faster.

Next time you type go func(), remember: there is a tiny 2 KB stack and a smart compiler working in the background to make it “infinite.”

Why Goroutines Scale: Stack Growth, Compiler Tricks, and Context Switching

Thread Overhead in C++ and Java

How Goroutines Optimize This

Memory Efficiency

The “Infinite” Stack

Faster Context Switches

How Goroutine Stack Allocation Works

Example

Ending Notes

Related posts

GO-QUEUE@v1.1.1: 基于優先級的並發排程，自動提升優先權

Rust vs Go in Real-World Projects: Which One Developers Are Learning Faster

How We Measure HTTP Timing: DNS, TCP, TLS, TTFB Breakdown

GO-QUEUE@v1.1.1: Priority-based task queue with automatic timeout promotion