IT博客汇 | Relocation overflow and code models

Relocation overflow and code models

MaskRay发表于 2023-08-03 18:32:10

When linking an oversized executable, it is possible to encountererrors such asrelocation truncated to fit: R_X86_64_PC32 against `.text'(GNU ld) or relocation R_X86_64_PC32 out of range (ld.lld).These diagnostics are a result of the relocation overflow check, afeature in the linker.

% gcc -fuse-ld=bfd @response.txt
...
a.o: in function `_start':
(.text+0x0): relocation truncated to fit: R_X86_64_PC32 against `.text'
% gcc -fuse-ld=lld @response.txt
ld.lld: error: a.o:(.text+0x0): relocation R_X86_64_PC32 out of range: -2147483649 is not in [-2147483648, 2147483647]; references section '.text'

This article aims to explain why such issues can occur and providesinsights on how to mitigate them.

Static linking

In this section, we will deviate slightly from the main topic todiscuss static linking. By including all dependencies within theexecutable itself, it can run without relying on external sharedobjects. This eliminates the potential risks associated with updatingdependencies separately.

Certain users prefer static linking or mostly static linking for thesake of deployment convenience and performance aspects:

Link-time optimization is more effective when all dependencies areknown. Providing shared object information during executableoptimization is possible, but it may not be a worthwhile engineeringeffort.
Profiling techniques are more efficient dealing with one singleexecutable.
The traditional ELF dynamic linking approach incurs overhead tosupport symbolinterposition.
Dynamic linking involves PLT and GOT, which can introduce additionaloverhead. Static linking eliminates the overhead.
Loading libraries in the dynamic loader has a time complexityO(|libs|^2*|libname|). The existing implementations aredesigned to handle tens of shared objects, rather than a thousand ormore.

Furthermore, the current lack of techniques to partition anexecutable into a few larger shared objects, as opposed to numeroussmaller shared objects, exacerbates the overhead issue.

In scenarios where the distributed program contains a significantamount of code (related: software bloat), employing full or mostlystatic linking can result in very large executable files. Consequently,certain relocations may be close to the distance limit, and even a minordisruption (e.g. add a function or introduce a dependency) can triggerrelocation overflow linker errors.

Relocation overflow

We will use the following C program to illustrate the concepts.

int var0; // known non-preemptible if -fno-pic or -fpie
extern int var1; // possibly-preemptible
int callee();
int caller() { return callee() + var0 + var1; }

The generated x86-64 assembly may appear as follows, with commentsindicating the relocation types associated with each instruction:

# gcc -S -O1 -fpie -mno-direct-extern-access -masm=intel a.c
.globl caller
caller:
  call callee@PLT                          # R_X86_64_PLT32
  add  eax, DWORD PTR [rip + var0]         # R_X86_64_PC32; load from &var0
  mov  rdx, QWORD PTR var1@GOTPCREL[rip]   # R_X86_64_REX_GOTPCRELX; rdx = .got[n] = &var1
  add  eax, DWORD PTR [rdx]                # load from &var1

.bss
.globl var0
var0: .long 0

You see that I specify -mno-direct-extern-access forsome assembly output. The option is to prefer GOT-generating codesequences for possibly-preemptible data symbols. See-fdirect-access-external-data on Copyrelocations, canonical PLT entries and protected visibility.

All of R_X86_64_PLT32, R_X86_64_PC32, andR_X86_64_REX_GOTPCRELX have a value range of[-2**31,2**31). If the referenced symbol is too far awayfrom the relocated location, we may get a relocation overflow.

In practice, relocation overflows due to code referencing code arenot common. The more frequent occurrences of overflows involve thefollowing categories (where we use .text to represent codesections, .rodata for read-only data, and so on):

.text <-> .rodata
.text <-> .eh_frame: .eh_frame has32-bit offsets. 64-bit code offsets are possible, but I don't know if animplementation exists.
.text <-> .data/.bss
.rodata <-> .data/.bss

In many programs, .text <-> .data/.bss relocationshave the most stringent constraints. Overflows due to.text <-> .rodata relocations are possible but rare(although I have encountered such issues in the past).

.rodata <-> .data/.bss overflows are generallyinfrequent. However, caution must be exercised when working withmetadata .quad label-. instead of.long label-.. Such issues can be easily addressed on thecompiler side.

On x86-64, linkers optimize some GOT-indirect instructions(R_X86_64_REX_GOTPCRELX; e.g.movq var@GOTPCREL(%rip), %rax) to PC-relative instructions.The distance between a code section and .got is usuallysmaller than the distance between a code section and.data/.bss. ld.lld's one-pass relocationscanning scheme has a limitation: if it decides to suppress a GOT entryand it turns out that optimizing the instruction will lead to relocationoverflow, the decision cannot be reverted. It should be easy to workaround the issue with -Wl,--no-relax.

The current section layout of ld.lld is as follows:

.rodata
.text
.data
.bss

One notable distinction from GNU ld is that .rodataprecedes .text. This ordering decreases the distancebetween .text and .data/.bss,thereby alleviating relocation overflow pressure for references from.text to .data/.bss.

For handling .eh_frame, I suggest compiling with the -fno-asynchronous-unwind-tablesoption. .eh_frame is used by runtime support for C++exceptions. Many programs don't utilitize exceptions, making.eh_frame non-mandatory. For profiling purposes, thelimitations of the .eh_frame format have become apparent,and it is not the most suitable unwinding format.

x86-64 code models

The x86-64 psABI defines multiple code models. Here is a summary:

Small: symbols are required to be located within the range[0, 2**31 - 2**24). Use 32-bit PC-relative or absoluteaddressing
Kernel: similar to the small code model, but symbols are within thehigh end range
Medium: keep using 32-bit offsets for code and GOT, but split datasections into 2 parts: regular and large. Large data can be more than2GiB away
Large: all of code, GOT, and data can be more than 2GiB away

x86-64 medium code model

The medium code model maintains the assumption that code and the GOTis within the ±2GiB range from the program counter, while allowing datato be located outside of that range. Data that resides outside the rangeis placed in large data sections such as .lrodata,.ldata, and .lbss, as well as.ldata's variants like .ldata.rel,.ldata.rel.local, .ldata.rel.ro, and.ldata.rel.ro.local.

These sections may have the SHF_X86_64_LARGE flag.

-mlarge-data-threshold decides whether a data sectionshould be treated as large.

Accessing code and GOT-indirect data has the same code sequence asthe small code model.

To access data without GOT indirection (usually a knownnon-preemptible symbol, e.g. var0), GCC obtains the addressof the GOT base symbol _GLOBAL_OFFSET_TABLE_, then adds theoffset from _GLOBAL_OFFSET_TABLE_ to the symbol.

# gcc -S -O1 -fpie -mcmodel=medium -mlarge-data-threshold=3 -masm=intel a.c
call    callee@PLT                        # R_X86_64_PLT32
lea     rdx, _GLOBAL_OFFSET_TABLE_[rip]   # rdx = &_GLOBAL_OFFSET_TABLE_
movabs  rcx, OFFSET FLAT:var0@GOTOFF      # R_X86_64_GOTOFF64; rcx = &var0 - &_GLOBAL_OFFSET_TABLE_
add     eax, DWORD PTR [rcx+rdx]          # load from &var0
mov     rdx, QWORD PTR var1@GOTPCREL[rip] # R_X86_64_REX_GOTPCRELX; rdx = .got[n] = &var1
add     eax, DWORD PTR [rdx]              # load from &var1

The relocation type R_X86_64_GOTOFF64, and another typethat we will see below, R_X86_64_PLTOFF64, have confusingnames. If you haven't encountered these relocation types before, don'tthink too hard how they are related to the "GOT" concept.

For position-dependent code, accessing data without GOT indirectionis simplified as we can just use abolute addressing.

# gcc -S -O1 -fno-pic -mcmodel=medium -mno-direct-extern-access -mlarge-data-threshold=3 -masm=intel a.c
call    callee                            # R_X86_64_PLT32
mov     edx, eax
movabs  eax, DWORD PTR [var0]             # R_X86_64_64; load from &var0
add     eax, edx
mov     rdx, QWORD PTR var1@GOTPCREL[rip] # R_X86_64_REX_GOTPCRELX; rdx = .got[n] = &var1
add     eax, DWORD PTR [rdx]              # load from &var1

Note that we use -mno-direct-extern-access in the abovesnippet, otherwise GCC's x86-64 port doesn't use GOT indirection forvar1. This will lead to a copy relocation ifvar1 is defined in a shared object.

x86-64 large code model

In the large code model, we no longer assume that GOT is within the±2GiB range from the program counter, solea rdx, _GLOBAL_OFFSET_TABLE_[rip] cannot be used. Anextra movabs instruction is needed to obtain the address of_GLOBAL_OFFSET_TABLE_.

Similarly, for a function call, we no longer assume that the addressof the function or its PLT entry is within the ±2GiB range from theprogram counter, so call callee cannot be used.

Actually, call callee can still be used if we implementrange extension thunks in the linker, unfortunately GCC/GNU ld did notpursue this direction.

# gcc -S -O1 -fpie -mcmodel=large -masm=intel
.L2:
lea     r15, .L2[rip]                     # r15 = &.L2
movabs  r11, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_-.L2  # R_X86_64_GOTPC64; r11 = &_GLOBAL_OFFSET_TABLE_ - &.L2
add     r15, r11                          # r15 = &_GLOBAL_OFFSET_TABLE_
mov     eax, 0
movabs  rdx, OFFSET FLAT:callee@PLTOFF    # R_X86_64_PLTOFF64; rdx = (the address of callee or its PLT) - &_GLOBAL_OFFSET_TABLE_
add     rdx, r15                          # rdx = the address of callee or its PLT
call    rdx                               # indirectly call callee
movabs  rdx, OFFSET FLAT:var0@GOTOFF      # R_X86_64_GOTOFF64; rdx = &var0 - &_GLOBAL_OFFSET_TABLE_
add     eax, DWORD PTR [rdx+r15]          # load from &var0
movabs  rdx, OFFSET FLAT:var1@GOT         # R_X86_64_GOT64; rdx = &.got[n] - _GLOBAL_OFFSET_TABLE_
mov     rdx, QWORD PTR [r15+rdx]          # rdx = .got[n] = &var1
add     eax, DWORD PTR [rdx]              # load from &var1

For position-dependent code, GCC uses absolute addressing and doesnot rely on _GLOBAL_OFFSET_TABLE_. The code sequence foraccessing data symbols is the same as that used in theposition-dependent medium code model.

# gcc -S -O1 -fno-pic -mcmodel=large -masm=intel a.c
movabs  rdx, OFFSET FLAT:callee           # R_X86_64_64; obstain the address of callee; canonical PLT entry if defined in a DSO
call    rdx
mov     edx, eax
movabs  eax, DWORD PTR [var0]             # R_X86_64_64; load from &var0
add     eax, edx
movabs  rdx, QWORD PTR [var1]             # R_X86_64_64; load from &var1; copy relocation if var1 is defined in a DSO
add     eax, edx

x86-64 linker requirement

Linkers are expected to recognize these large data sections and placethem in appropriate locations. GNU ld uses the following section layoutin its internal linker scripts:

.text
.rodata   # if -z separate-code, MAXPAGESIZE alignment
RELRO     # DATA_SEGMENT_ALIGN
.data     # DATA_SEGMENT_RELRO_END
.bss
.lbss
.lrodata  # MAXPAGESIZE alignment
.ldata    # MAXPAGESIZE alignment

GNU ld places .lbss, .lrodata,.ldata after .bss and inserts 2 MAXPAGESIZEalignments for .lrodata and .ldata.

.lbss is placed immediately after .bss,creating a single BSS, i.e. a read-write PT_LOAD programheader with p_filesz<p_memsz. The file image in thesegment is smaller than the memory image. When the dynamic loadercreates the memory image for the PT_LOAD segment, it willset the byte range [p_filesz,p_memsz) to zeros. However,there is a missing optimization that the [p_filesz,p_memsz)portion occupies zero bytes in the object file, preventing overlapbetween .lrodata and BSS in the file image.

ld.lld from 17 onwardsuses the following section layout:

.lrodata
.rodata
.text     # if --ro-segment, MAXPAGESIZE alignment
RELRO     # MAXPAGESIZE alignment
.data     # MAXPAGESIZE alignment
.bss
.ldata    # MAXPAGESIZE alignment
.lbss

In both section layouts, .rodata, .text,.data, and .bss are not interspersed withlarge data sections.

We have mentioned that when using -mcmodel=medium, GCCgenerates both regular and large data sections. In practice, programsoften include a mix of object files built with small and medium/largecode models. The small code model components may come from prebuiltobject files (e.g. libc). The large data sections do not exertrelocation pressure on sections in object files built with-mcmodel=small.

However, GCC only generates regular data sections with-mcmodel=large. -mlarge-data-threshold isignored. As a result, the data sections built with-mcmodel=large may exert relocation pressure on sections inobject files with -mcmodel=small.

I propose that we make -mcmodel=large respect-mlarge-data-threshold and generate large data sections aswell. I posted a GCC patch and Uros Bizjak askedme to discuss the issue with the x86-64 psABI group. So I created Large datasections for the large code model.

The psABI mentions .ltext, .lgot, and.lplt, but GNU ld doesn't do anything with these sectionsand GCC doesn't generate .ltext sections. I don't thinkthat .ltext is necessary. We just need to implement rangeextension thunks in the rare case that the ±2GiB range ofcall callee becomes a problem.

I implemented https://sourceware.org/PR30592 so with binutils 2.42 youcan useobjcopy --set-section-flags .lrodata=alloc,readonly,large --set-section-flags .ldata=alloc,large a.o b.oto set SHF_X86_64_LARGE flag for large data sections. Ifiled a feature request objcopy--set-section-flags: support toggling a flag.

AArch64 code models

Likewise, the psABI defines multiple code models. The small codemodel allows for a maximum text segment size of 2GiB and a maximumcombined span of text and data segments of 4GiB. For smallposition-independent code (pic), there is an additional restriction onthe size of the Global Offset Table (GOT), which must be smaller than32KiB. The maximum combined span of text and data segments is largerthan that of x86-64.

Linked object file sizes for AArch64 and x86-64 are comparable, butAArch64 linked object files are more resistant to relocationoverflows.

For data references from code, x86-64 usesR_X86_64_REX_GOTPCRELX/R_X86_64_PC32relocations, which have a smaller range [-2**31,2**31). Incontrast, AArch64 employs R_AARCH64_ADR_PREL_PG_HI21relocations, which has a doubled range of [-2**32,2**32).This larger range makes it unlikely for AArch64 to encounter relocationoverflow issues before the binary becomes excessively oversized forx86-64.

1
2
3

bl      callee               // R_AARCH64_CALL26; [-2**27, 2**27)
adrp    x8, var0             // R_AARCH64_ADR_PREL_PG_HI21; [-2**32,2**32)
ldr     w8, [x8, :lo12:var0] // R_AARCH64_LDST32_ABS_LO12_NC

For function calls, the shorter range ofR_AARCH64_CALL26 doesn't matter. The linker will generaterange extensionthunks if callee is not directly reachable.

  b       __AArch64AbsLongThunk_callee
...
__AArch64AbsLongThunk_callee:
  ldr     x16, .+8
  br      x16
  .xword  callee

GCC and Clang don't implement -mcmodel=large for PIC.GCC and Clang don't implement -mcmodel=medium. This makessense as we haven't identified a use case for the unimplemented modelsyet.

Nevertheless, let's see the code sequence for the position-dependentlarge code model.

// aarch64-linux-gnu-gcc -S -O1 -mcmodel=large -fno-pic
bl      callee
adrp    x1, .LC0
ldr     x1, [x1, #:lo12:.LC0]
ldr     w1, [x1]
add     w0, w0, w1
adrp    x1, .LC1
ldr     x1, [x1, #:lo12:.LC1]
ldr     w1, [x1]
add     w0, w0, w1
...
        .align  3
.LC0:
        .xword  .LANCHOR0
        .align  3
.LC1:
        .xword  var1
        .global var0

call callee is still used since we can uselinker-synthesized range extension thunks. Accessing data symbolsrequires indirection through an 64-bit absolute symbol.

Power architecture codemodels

Object file sizes are usually much larger than x86-64's.

GCC defaults to the medium code model, which is like x86-64 andAArch64's small code model. Data symbols and TOC/GOT entries are assumedto be within the [-0x80008000, 0x7fff8000) range from theTOC.

# powerpc64le-linux-gnu-gcc -S -O1 -fpie -mcmodel=medium -mcpu=power10 a.c
bl callee@notoc            # R_PPC64_REL24_NOTOC
pld 9,var1@got@pcrel       # R_PPC64_GOT_PCREL34; r9 = &.got[n] for var1
lwz 10,0(9)                # load from &.got[n]
plwz 8,.LANCHOR0@pcrel     # R_PPC64_PCREL34; load from &var0
add 9,10,8
add 9,9,3
extsw 3,9

In the large code model, GCC simply uses GOT-indirect addressing toaccess data symbols, including the non-preemptible ones. This maintainsthe assumption that the GOT entries are within the[-0x80008000, 0x7fff8000) range from the TOC, so the largecode model is more limited than x86-64's.

addis 9,2,.LC0@toc@ha      # R_PPC64_TOC16_HA; [-0x80008000, 0x7fff8000)
ld 9,.LC0@toc@l(9)         # R_PPC64_TOC16_LO_DS
lwz 8,0(9)
addis 9,2,.LC1@toc@ha      # R_PPC64_TOC16_HA; [-0x80008000, 0x7fff8000)
ld 9,.LC1@toc@l(9)         # R_PPC64_TOC16_LO_DS
lwz 10,0(9)
add 9,10,8
add 9,9,3
extsw 3,9

Mitigation

There are several strategies to mitigate relocation overflowissues.

Make the program smaller by reducing code and data size.
Partition the large monolithic executable into the main executableand a few shared objects.
Switch to the medium code model
Use compiler options such as -Os, -Oz andlink-time optimization that focuses on decreasing the code size.
For compiler instrumentations (e.g. -fsanitize=address,-fprofile-generate), move some data to large datasections.
Use linker script commands INSERT BEFOREand INSERT AFTER to reorder output sections.

Debug information

For large executables, it is possible to encounter DWARF32 limitation(e.g. relocation R_X86_64_32 out of range). I will addressthis topic in another article.