What Every Programmer Should Know About Memory Part 3

Published: 1 month ago (January 2, 2026 at 03:35 AM EST)

5 min read

Source: Dev.to

Source: Dev.to

1. UMA vs. NUMA: The Death of Equality
- 1.1 UMA (Uniform Memory Access)
- 1.2 NUMA (Non‑Uniform Memory Access)
2. The Cost of Remote Access
- 2.1 The Latency Penalty
- 2.2 Bandwidth Saturation: The Clogged Pipe
3. OS Policies: The “First Touch” Trap
4. Tools of the Trade
5. Conclusion

1. UMA vs. NUMA: The Death of Equality

To understand why modern servers behave the way they do, we need to look at the evolution of memory architectures.

UMA vs NUMA Architecture

1.1 UMA (Uniform Memory Access)

The Old Way: In the days of SMP (Symmetric Multi‑Processing) we had a single memory controller and a single system bus. All CPUs were attached to that bus.

What it means: “Uniform” means the cost to access RAM is the same for every core. Accessing address 0x0 takes, say, 100 ns for Core 0 and 100 ns for Core 1.

Why it failed: The shared bus became a bottleneck. As we added more cores (2, 4, 8, …) they all fought for the same bandwidth—like 64 cars trying to use a single‑lane highway.

1.2 NUMA (Non‑Uniform Memory Access)

The New Way: To solve the bottleneck, hardware architects split the memory up.

What it means: Instead of one giant bank of RAM, each processor socket gets a dedicated chunk of RAM. A Processor + its Local RAM is called a NUMA node.

How it works: Nodes are connected by a high‑speed interconnect (Intel UPI, AMD Infinity Fabric, etc.). If CPU 0 needs data that lives in CPU 1’s memory, it asks CPU 1 to fetch the data and ship it over the interconnect.

This solves the bandwidth problem (multiple highways!) but introduces a new problem: physics.

2. The Cost of Remote Access

Now that memory is physically distributed, distance matters.

NUMA Local vs Remote Access

If a CPU on Node 0 needs data located in Node 0’s RAM, the path is short and fast.
If a CPU on Node 0 needs data located in Node 1’s RAM, the request must travel over the interconnect, wait for Node 1’s memory controller, and ship the data back.

2.1 The Latency Penalty

We often express this cost as a latency factor:

Access type	Relative latency
Local	1.0 × (baseline)
Remote	1.5 × – 2.0 × slower

Every cache miss that hits remote memory can be twice as expensive as a local miss. In high‑performance computing (HPC) or low‑latency trading this is disastrous.

2.2 Bandwidth Saturation: The Clogged Pipe

It’s not just latency; it’s also capacity. The interconnect between sockets has limited bandwidth.

If you write a program where all threads on all 64 cores aggressively read from Node 0’s memory, you create a traffic jam. Local cores on Node 0 may get data fine, but remote cores on other nodes will stall as they fight for space on the interconnect.

3. OS Policies: The “First Touch” Trap

So how does the OS decide where to place your memory? If you malloc(1 GB), does it go to Node 0 or Node 1?

Linux uses a policy called First‑Touch Allocation.

3.1 How Linux Allocates Memory

malloc(1 GB) returns a virtual address range; no physical RAM is assigned yet.
The physical page is allocated only when the process writes to that page for the first time (a page‑fault).
At that moment the kernel looks at which CPU caused the fault.
The page is placed in the NUMA node that is local to that CPU.

Thus the first thread that touches a page determines its home node.

3.2 The Trap: Main‑Thread Initialization

If the main thread performs all the first writes, every page ends up on the node where the main thread runs. On a multi‑socket system this can concentrate memory on a single node, causing remote accesses for worker threads on other sockets.

The Scenario

The Main Thread (running on Node 0) allocates a huge array and memsets it.
All pages are allocated on Node 0.
64 worker threads are spawned across Nodes 0‑3 to process the data.

First Touch Trap

The Result

Threads on Node 0 enjoy local access.
Threads on Nodes 1‑3 suffer remote accesses, saturating the interconnect.
Scaling stalls or even degrades as more cores are added.

The Fix

Parallel Initialization – let each worker thread initialize the portion of data it will later process. This ensures the pages are allocated on the node where the thread runs, eliminating the remote‑access penalty.

3.3 The “Spillover” Behavior (Zone Reclaim)

When a node’s local memory is exhausted, the kernel may allocate pages from a remote node (zone reclaim).

This creates unpredictable latency spikes.
An application may run fast for a while, then slow down dramatically once the local node fills and allocations “spill over” to another node.
Monitoring numa_miss counters under /sys/devices/system/node/ is the only reliable way to detect this condition.

4. Tools of the Trade

4.1 Analyzing with `lscpu`

$ lscpu

lscpu prints the CPU topology, including the number of NUMA nodes, cores per node, and the interconnect architecture.

4.2 The Distance Matrix (`numactl`)

$ numactl --hardware

Typical output:

available: 2 nodes (0-1)
node 0 cpus: 0-15
node 1 cpus: 16-31
node 0 size: 128 GB
node 1 size: 128 GB
node 0 free: 124 GB
node 1 free: 126 GB
node distances:
node   0   1
  0:  10  20
  1:  20  10

The distance values are relative; a larger number means higher latency.

4.3 Controlling Policy with `numactl`

Run a program with an explicit memory policy:

# Bind the process to node 0 and allocate memory only from node 0
numactl --cpunodebind=0 --membind=0 ./my_program

# Interleave memory across nodes (good for large, uniformly accessed data)
numactl --interleave=all ./my_program

4.4 Programming with `libnuma`

#include <numa.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    if (numa_available() == -1) {
        fprintf(stderr, "NUMA not supported on this system.\n");
        return EXIT_FAILURE;
    }

    /* Allocate 1

5. Conclusion

Understanding the distinction between UMA and NUMA, the latency and bandwidth costs of remote memory accesses, and the OS’s first‑touch allocation policy is essential for writing scalable software on modern multi‑socket servers. By using the right tools (lscpu, numactl, libnuma) and adopting parallel initialization patterns, developers can avoid hidden performance pitfalls and fully exploit the hardware’s capabilities.

What Every Programmer Should Know About Memory Part 3

Table of Contents

1. UMA vs. NUMA: The Death of Equality

1.1 UMA (Uniform Memory Access)

1.2 NUMA (Non‑Uniform Memory Access)

2. The Cost of Remote Access

2.1 The Latency Penalty

2.2 Bandwidth Saturation: The Clogged Pipe

3. OS Policies: The “First Touch” Trap

3.1 How Linux Allocates Memory

3.2 The Trap: Main‑Thread Initialization

The Scenario

The Result

The Fix

3.3 The “Spillover” Behavior (Zone Reclaim)

4. Tools of the Trade

4.1 Analyzing with `lscpu`

4.2 The Distance Matrix (`numactl`)

4.3 Controlling Policy with `numactl`

4.4 Programming with `libnuma`

5. Conclusion

Related posts

Supercharge Your Node.js Application with Hedge-Fetch: Eliminating Tail Latency with Speculative Execution

From 3+ Days to 3.8 Hours: Scaling a .NET CSV Importer for SQL Server

MAWA - A language as simple as Python but as powerful as Assembler, modern ASM but much simpler

How to Scan QR Codes Safely Using Your Phone

Table of Contents

1. UMA vs. NUMA: The Death of Equality

1.1 UMA (Uniform Memory Access)

1.2 NUMA (Non‑Uniform Memory Access)

2. The Cost of Remote Access

2.1 The Latency Penalty

2.2 Bandwidth Saturation: The Clogged Pipe

3. OS Policies: The “First Touch” Trap

3.1 How Linux Allocates Memory

3.2 The Trap: Main‑Thread Initialization

The Scenario

The Result

The Fix

3.3 The “Spillover” Behavior (Zone Reclaim)

4. Tools of the Trade

4.1 Analyzing with lscpu

4.2 The Distance Matrix (numactl)

4.3 Controlling Policy with numactl

4.4 Programming with libnuma

5. Conclusion

Related posts

Supercharge Your Node.js Application with Hedge-Fetch: Eliminating Tail Latency with Speculative Execution

From 3+ Days to 3.8 Hours: Scaling a .NET CSV Importer for SQL Server

MAWA - A language as simple as Python but as powerful as Assembler, modern ASM but much simpler

How to Scan QR Codes Safely Using Your Phone

4.1 Analyzing with `lscpu`

4.2 The Distance Matrix (`numactl`)

4.3 Controlling Policy with `numactl`

4.4 Programming with `libnuma`