Why Goroutines Scale: Stack Growth, Compiler Tricks, and Context Switching
Source: Dev.to
Thread Overhead in C++ and Java
In the above‑mentioned languages, threads are a means of concurrency that take a lot of CPU time in context switching and consume a relatively huge amount of memory at the time of their creation. A single thread typically reserves ~1 MiB of stack space. Therefore, spawning 100 000 threads would require about 100 GiB of RAM, which is not economically feasible for most software projects.
For maintaining concurrency, the CPU generally uses time‑slicing to give an equal number of CPU cycles to each thread. While doing this, the CPU must perform a context switch, which is quite expensive:
- The current thread’s state is saved in its TCB (Thread Control Block).
- The new thread’s TCB is loaded into memory.
- Context switches destroy cache locality, causing frequent L1/L2 cache misses.
When you have thousands of threads, the CPU spends more time switching context than actually executing code.
How Goroutines Optimize This
Goroutines are “lightweight threads” managed entirely in user space by the Go runtime, rather than by the OS kernel.
Memory Efficiency
- A standard OS thread reserves a fixed 1 MiB stack.
- A goroutine starts with a stack of only 2 KiB.

The Math: 2 KiB is roughly 0.2 % of 1 MiB.
The Impact: Instead of capping out at thousands of threads, you can easily spawn millions of goroutines on a standard laptop without running out of RAM.
The “Infinite” Stack
Unlike OS threads, which have a fixed stack size determined at creation, goroutine stacks are dynamic:
- A goroutine starts with a 2 KiB stack.
- If it runs out of space, the runtime allocates a larger segment (usually double the current size) and moves the stack there.

- OS thread limit: Fixed (≈1–8 MiB). Hitting this limit causes a crash.
- Goroutine limit: Dynamic (up to ~1 GiB on 64‑bit systems).
Thus, for all practical purposes, goroutine recursion depth is limited only by available memory, while OS threads are limited by their initial reservation.
Faster Context Switches

Both OS threads and goroutines must save their state when paused, but the cost differs dramatically:
| OS Thread Switch | Goroutine Switch | |
|---|---|---|
| Typical latency | ~1–2 µs | ~200 ns (≈10× faster) |
| What is saved? | All CPU registers (including heavy FP/AVX registers) → TCB | Only 3 registers (PC, SP, DX) → a small Go struct (g) |
| Where it happens? | Kernel mode (trap) | User space (runtime) |
| Cache impact | Flushes caches, loses locality | Caches stay hot, locality preserved |
Because goroutine switches stay in user space, the overhead is negligible.
How Goroutine Stack Allocation Works
The Go compiler inserts a function prologue at the start of every function. The prologue performs a check:
- Compare the current stack pointer (SP) with a limit called the stack guard.
- If insufficient space remains, branch to
runtime.morestack. runtime.morestackallocates a larger stack segment (usually 2× the current size).- The runtime copies the existing stack contents to the new segment and adjusts all pointers so they point to the new addresses.
- Execution resumes on the larger stack.
Example
package main
import "fmt"
func main() {
fmt.Println("Hello Ayush")
}
Running the compiler with the -gcflags -S option shows the generated assembly for main.main:
main.main STEXT size=83 args=0x0 locals=0x40 funcid=0x0 align=0x0
0x0000 00000 (/Users/ayushanand/concurrency/main.go:7) TEXT main.main(SB), ABIInternal, $64-0
0x0000 00000 (/Users/ayushanand/concurrency/main.go:7) CMPQ SP, 16(R14) // compare SP with stack guard
0x0004 00004 (/Users/ayushanand/concurrency/main.go:7) PCDATA $0, $-2
0x0004 00004 (/Users/ayushanand/concurrency/main.go:7) JLS 76 // jump to morestack if SP ) NOP
0x002d 00045 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) LEAQ go:itab.*os.File,io.Writer(SB), AX
0x0034 00052 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) LEAQ main..autotmp_8+40(SP), CX
0x0039 00057 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) MOVL $1, DI
0x003e 00062 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) MOVQ DI, SI
0x0041 00065 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) PCDATA $1, $0
0x0041 00065 (/usr/local/Cellar/go/1.25.4/libexec/src/fmt/print.go:314) CALL fmt.Fprintln(SB)
0x0046 00070 (/Users/ayushanand/concurrency/main.go:9) ADDQ $56, SP
0x004a 00074 (/Users/ayushanand/concurrency/main.go:9) POPQ BP
0x004b 00075 (/Users/ayushanand/concurrency/main.go:9) RET
0x004c 00076 (/Users/ayushanand/concurrency/main.go:9) NOP
0x004c 00076 (/Users/ayushanand/concurrency/main.go:7) PCDATA $1, $-1
0x004c 00076 (/Users/ayushanand/concurrency/main.go:7) PCDATA $0, $-2
0x004c 00076 (/Users/ayushanand/concurrency/main.go:7) CALL runtime.morestack_noctxt(SB)
0x0051 00081 (/Users/ayushanand/concurrency/main.go:7) PCDATA $0, $-1
0x0051 00081 (/Users/ayushanand/concurrency/main.go:7) JMP
You can see the assembly code that checks for stack size.
Ending Notes
Goroutines aren’t just “threads but smaller.” They represent a fundamental rethink of how we manage concurrency. By moving the stack management from the OS kernel to the Go runtime, we gain:
- Massive Scalability: From a 100 k limit to millions of goroutines.
- Dynamic Memory: Pay for what you use (≈ 2 KB), not what you might use (≈ 1 MB).
- Low Latency: Context switches that are ~10× faster.
Next time you type go func(), remember: there is a tiny 2 KB stack and a smart compiler working in the background to make it “infinite.”