Huge Binaries
Source: Hacker News
A problem I experienced while pursuing my PhD and submitting academic articles was that I had built solutions to problems that required dramatic scale to be effective and worthwhile.
Responses to my publication submissions often claimed such problems did not exist; however, I had observed them during my time in industry (e.g., at Google), but I couldn’t cite them!
One problem that is only present in these mega‑codebases is massive binaries.
What’s the largest binary (ELF file) you’ve ever seen? I have observed binaries beyond 25 GiB, including debug symbols. How is this possible?
These companies prefer to statically build their services to speed up startup and simplify deployment. Statically including all code in some of the world’s largest codebases is a recipe for massive binaries.
Similar to the sound barrier, there is a point at which code size becomes problematic and we must rethink how we link and build code. For x86_64, that point is the 2 GiB “Relocation Barrier.”
Why 2 GiB? 🤔
Let’s take a look at how position‑independent code is put together.
A simple example
/* simple-relocation.c */
extern void far_function();
int main(void) {
far_function();
return 0;
}
Compile the file:
gcc -c simple-relocation.c -o simple-relocation.o
Inspect the object file with objdump:
objdump -dr simple-relocation.o
0000000000000000 :
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: e8 00 00 00 00 call e
a: R_X86_64_PLT32 far_function-0x4
e: b8 00 00 00 00 mov $0x0,%eax
13: 5d pop %rbp
14: c3 ret
The e8 byte is the CALL opcode (it takes a 32‑bit signed relative offset).
Right now the offset is 0 (four bytes of 0). objdump also tells us that a relocation is required to fix up this code when we finalize it.
Note
The-0x4is needed because the offset is relative to the instruction pointer after it has advanced past the 4‑byte operand.
We can view the relocation entry with readelf:
readelf -r simple-relocation.o -d
Relocation section '.rela.text' at offset 0x170 contains 1 entry:
Offset Info Type Sym. Value Sym. Name + Addend
00000000000a 000400000004 R_X86_64_PLT32 0000000000000000 far_function - 4
The entry tells the linker that the 4‑byte operand at offset 0x0a (the start of the CALL’s immediate) must be patched with the address of far_function.
Adding the callee
/* far-function.c */
void far_function(void) {
}
Compile it:
gcc -c far-function.c -o far-function.o
Link the two object files:
gcc simple-relocation.o far-function.o -o simple-relocation
Inspect the final executable:
objdump -dr simple-relocation
0000000000401106 :
401106: 55 push %rbp
401107: 48 89 e5 mov %rsp,%rbp
40110a: b8 00 00 00 00 mov $0x0,%eax
40110f: e8 07 00 00 00 call 40111b
401114: b8 00 00 00 00 mov $0x0,%eax
401119: 5d pop %rbp
40111a: c3 ret
000000000040111b :
40111b: 55 push %rbp
40111c: 48 89 e5 mov %rsp,%rbp
40111f: 90 nop
401120: 5d pop %rbp
401121: c3 ret
The linker has calculated the relative offset (0x07) and patched the CALL instruction correctly.
The 2 GiB Barrier
The CALL opcode (e8) uses a 32‑bit signed displacement, which limits the jump range to ±2 GiB (2³¹ bytes).
Thus a callsite can only reach code that lies within a 2 GiB window forward or backward. This limit is known as the 2 GiB Relocation Barrier.
What happens when the target is farther than 2 GiB?
We can force the linker to place far_function far away using a linker script.
/* overflow.lds */
SECTIONS
{
/* 1. Standard low‑address sections */
. = 0x400000;
.text : {
simple-relocation.o(.text.*)
}
.rodata : { *(.rodata .rodata.*) }
.data : { *(.data .data.*) }
.bss : { *(.bss .bss.*) }
/* 2. Move the location counter far away */
. = 0x120000000; /* ≈4.5 GiB */
.text.far : {
far-function.o(.text*)
}
}
Now link with LLVM’s lld (its error messages are a bit clearer):
gcc simple-relocation.o far-function.o -T overflow.lds \
-o simple-relocation-overflow -fuse-ld=lld
Output:
ld.lld: error: :(.eh_frame+0x6c):
relocation R_X86_64_PC32 out of range:
5364513724 is not in [-2147483648, 2147483647]; references section '.text'
ld.lld: error: simple-relocation.o:(function main: .text+0xa):
relocation R_X86_64_PLT32 out of range:
5364514572 is not in [-2147483648, 2147483647]; references 'far_function'
>>> referenced by simple-relocation.c
>>> defined in far-function.o
The linker reports a relocation overflow because the required displacement does not fit into a signed 32‑bit field.
Dealing with the Barrier
When we hit this problem we have several options, which fall under the broader topic of code models. The appropriate solution depends on whether we are accessing:
- Data (static variables, constants)
- Code (functions, jump targets)
A great, in‑depth discussion of these techniques can be found in the blog post “Relocation overflow and code models” by @maskray.
TL;DR
- The x86‑64 CALL/JMP instructions use a 32‑bit signed relative offset, limiting direct jumps to ±2 GiB.
- Massive static binaries can easily exceed this limit, causing relocation overflow errors at link time.
- Solutions involve using different code models (e.g., small, medium, large, or PIE), indirect jumps, trampolines, or dynamic linking to keep all call targets within reach.
Understanding and working around the 2 GiB relocation barrier is essential when dealing with the mega‑codebases that produce binaries tens of gigabytes in size.
com/maskray — the author of lld.
The simplest solution, however, is to use -mcmodel=large, which changes all the relative CALL instructions to absolute JMP.
# Build the overflow example
gcc simple-relocation.o far-function.o -T overflow.lds -o simple-relocation-overflow
# Compile with the large code model
gcc -c simple-relocation.c -o simple-relocation.o -mcmodel=large -fno-asynchronous-unwind-tables
# Link again
gcc simple-relocation.o far-function.o -T overflow.lds -o simple-relocation-overflow
# Run
./simple-relocation-overflow
Note
I needed to add-fno-asynchronous-unwind-tablesto disable some additional data that might cause overflow for the purpose of this demonstration.
What does the disassembly look like now?
objdump -dr simple-relocation-overflow
0000000120000000 :
120000000: 55 push %rbp
120000001: 48 89 e5 mov %rsp,%rbp
120000004: 90 nop
120000005: 5d pop %rbp
120000006: c3 ret
00000000004000e6 :
4000e6: 55 push %rbp
4000e7: 48 89 e5 mov %rsp,%rbp
4000ea: b8 00 00 00 00 mov $0x0,%eax
4000ef: 48 ba 00 00 00 20 01 movabs $0x120000000,%rdx
4000f6: 00 00 00
4000f9: ff d2 call *%rdx
4000fb: b8 00 00 00 00 mov $0x0,%eax
400100: 5d pop %rbp
400101: c3 ret
There is no longer a sole CALL instruction; it has become a MOVABS followed by a CALL 😲. This changes the instruction size from 5 bytes (opcode + 4‑byte relative offset) to a whopping 12 bytes (2‑byte ABS opcode + 8‑byte absolute address + 2‑byte CALL).
Notable downsides
- Instruction bloat – We’ve gone from 5 bytes per call to 12. In a binary with millions of call sites, this can add up quickly.
- Register pressure – We’ve burned a general‑purpose register (
%rdx) to perform the jump.
Caution
I had a lot of trouble building a benchmark that demonstrated a worse lower IPC (instructions‑per‑cycle) for the largemcmodel, so let’s just take my word for it. 🤷
We would like to keep our small code model. What other strategies can we pursue?
More to come in subsequent writings.