This article describes target-specific details about Power ISA in ELFlinkers. Initially there was IBM POWER. The 1991 Apple–IBM–Motorolaalliance created PowerPC. In 2006, the architecture was rebranded asPower ISA. According to the ISA manual, "In 2006, Freescale and IBMcollaborated on the creation of the Power ISA Version 2.03, whichrepresented the reunification of the architecture by combining Book Econtent with the more general purpose PowerPC Version 2.02."
The terms "PowerPC" and "powerpc" remain popular in numerous places,including the powerpc-*-*-*
andpowerpc64-*-*-*
in official target triple names. Theabbreviation "PPC" ("ppc") is used in numerous places as well. Forsimplicity, I will refer to the 32-bit architecture as "PPC32" and the64-bit architecture as "PPC64".
We will see how the lack of PC-relative addressing before Power10 hascaused great complexity to the ABI and linkers.
The 32-bit ELF ABI is more or less not cared for by maintainers andonly remains relevant among some enthusiasts. In 2019, I spent one weekstudying PPC32 ABI and added the PPC32 port to ld.lld.
For a 64-bit object file, the presence of a section .opd
is a good indicator for ELFv1. e_flags
being 2 is a goodindicator for ELFv2. e_flags
being 0 is either an ELFv1object file, or an object file not using any feature affected by thedifferences.
A new ABI for little-endian PowerPC64 Design &Implementation (2014) describes the motivation for introducingELFv2.
On PPC32, _GLOBAL_OFFSET_TABLE_
is defined at the startof the section .got
. .got
has 3 reservedentries. _GLOBAL_OFFSET_TABLE_[0]
stores the link-timeaddress of _DYNAMIC
, which is used by glibcsysdeps/powerpc/powerpc32/dl-machine.h
._GLOBAL_OFFSET_TABLE_[1]
and_GLOBAL_OFFSET_TABLE_[2]
are for lazy binding PLT(_dl_runtime_resolve
and link map in glibc).
.plt
is like .got.plt
for otherarchitectures. .plt[n]
holds the address of a PLT entry(somewhere in .glink
).
Like x86-32, PPC32 lacks memory load with PC-relative addressing. Asa poor man's replacement, PPC32 sets up r30 to hold a GOT base forposition-independent code (PIC). The GOT base is different for small PICand large PIC.
-fpic
and -fpie
, r30 refers to_GLOBAL_OFFSET_TABLE_
in the component.-fPIC
and -fPIE
, r30 refers to.got2
for the current translation unit. This hasimplications for PLT-generating relocations as we will see below.1 | .section ".got2","aw" |
The component may have multiple translation units and each has adifferent .got2
. In the output file, .got2
inone file may have an arbitrary offset relative to the output.got2
.
On PPC64, .got
has 1 reserved entry: the link-timeaddress of .TOC.
. .TOC.
is defined at thestart of the section .got
plus 0x8000.
.plt
is like .got.plt
for otherarchitectures. .plt
has the type SHT_NOBITS
and an alignment of 4.
Before Power10, PPC64 uses .toc
instead of.got
to hold the addresses of global variables andaddress-taken functions. This is different from most architectures.
1 | extern int var0, var1; |
The above C program compiles to the following assembly:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23foo:
.Lfunc_begin0:
.Lfunc_gep0:
addis 2, 12, .TOC.-.Lfunc_gep0@ha
addi 2, 2, .TOC.-.Lfunc_gep0@l
.Lfunc_lep0:
.localentry foo, .Lfunc_lep0-.Lfunc_gep0
addis 3, 2, .LC0@toc@ha
addis 4, 2, .LC1@toc@ha
ld 3, .LC0@toc@l(3)
ld 4, .LC1@toc@l(4)
lwz 3, 0(3)
lwz 4, 0(4)
add 3, 4, 3
extsw 3, 3
blr
.section .toc,"aw",@progbits
.LC0:
.tc var0[TC],var0
.LC1:
.tc var1[TC],var1
foo
has a global entryfoo
/.Lfunc_gep0
and a local entry.Lfunc_lep0
. After the local entry, r2 holds the address ofthe TOC base of the current component.
If foo
and a caller of foo
are in the samecomponent, the caller may branch directly to the local entry, skipping afew instructions starting at the global entry (usually 2). Otherwise,the caller needs to branch to the global entry so that foo
will update r2 itself. This update requires that r12 points to thefunction entry address. We will see that maintaining r2 and r12 causes alot of trouble in sections diving into call stubs.
Another difference is the explicit mention of .toc
. Thisscheme gives the compiler control within the translation unit. With thetraditional GOT scheme, input files do not mention .got
.The compiler does not control how the linker will layout.got
. Well, I disagree with the presumed advantage of.toc
: the compiler does not know the global information,and the translation unit local layout may not be ideal. A linker isbetter placed to do such link-time optimization.
A .tc
directive is a fancy way to produce a relocationof type R_PPC64_ADDR64
. If the linker decides to create aTOC entry, the entry will be a link-time constant (-no-pie
)or be associated with a dynamic relocation (-pie
or-shared
).
See Allabout Global Offset Table#GOT optimization.
Power Architecture® 32-bit Application Binary InterfaceSupplement 1.0 - Linux® & Embedded specifies two PLT ABIs:BSS-PLT and Secure-PLT.
BSS-PLT is the older method, which is now obsolete. While.plt
on other architectures is created by the linker,BSS-PLT lets ld.so generate the PLT entries. This has the advantage thatthe section can be made SHT_NOBITS
and therefore not occupyfile size. However, the downside is the security concern of writable andexecutable memory pages. Even worse, as an implementation issue, GNU ldplaces .plt
in the text segment, making the whole textsegment is writable and executable. This renders-z relro -z now
ineffective.
In the newer Secure-PLT ABI, .plt
holds the table offunction addresses. .plt
is like .got.plt
forother architectures.
The linker synthesizes .glink
, which is like.plt
for other architectures. Unlike most architectures,.glink
has a footer rather than a header. Each PLT entry iseither b footer
or a nop falling through to the footer. Inld.lld, we only use b footer
for simplicity. See https://reviews.llvm.org/D75394 forPPC32GlinkSection
in ld.lld.
1 | 000102b4 <.glink>: |
For non-PIC code, a possibly preemptible branch uses the relocationtype R_PPC_REL24
. 1
2bl foo # R_PPC_REL24
bl foo # R_PPC_REL24
If the call target is preemptible, the linker creates a non-PIC callstub and redirects the caller's branch instruction to the call stub. Thenon-PIC call stub will use absolute addressing to load.plt[n]
into r11 (call-clobbered) and branch there. Thisbehavior is different from most other architectures where the caller canbranch directly to the PLT entry. 1
2
3
4
5
6
7
8
9 bl 00000000.plt_call32.f
bl 00000000.plt_call32.f
...
00000000.plt_call32.f:
lis 11, .plt[n]@ha
lwz 11, .plt[n]@l(11)
mtctr 11
bctr
For PIC code, a branch to a possibly preemptible target usesR_PPC_PLTREL24
as the PLT-generating relocation type. Theaddend encodes r30 set up by the caller. Yes, this is unusual.
-fpic
and -fpie
, the addend is 0.-fPIC
and -fPIE
, the addend is 0x8000.Linking this relocatable object file in -r
mode mayincrease the addend.When calling a function, if the target is preemptible, the linkercreates a PIC call stub and redirects the caller's branch instruction tothe call stub. GNU ld names small PIC call stubs as*.plt_pic32.*
and large PIC call stubs as*.got2.plt_pic32.*
. ld.lld follows the namingconvention.
A call stub knows the value of r30 (GOT base) set up by the caller.The distance from .plt[n]
to r30 is a constant. The callstub computes the address of .plt[n]
, loads the entry, andbranches there. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2200000000.plt_pic32.f:
## If the GOT offset is beyond 64KiB
addis 11, 30, .plt[n]-_GLOBAL_OFFSET_TABLE_@ha(30)
lwz 11, .plt[n]-_GLOBAL_OFFSET_TABLE_@l(30)
mtctr 11
bctr
## If the GOT offset is within 64KiB
# lwz 11, .plt[n]-_GLOBAL_OFFSET_TABLE_(30)
# mtctr 11
# bctr
# nop
00000000.got2.plt_pic32.f:
## .got2 refers to the copy belonging to the current translation unit.
## Different translation units have to use different stubs.
addis 11, 30, .plt[n]-(.got2+0x8000)(30)
lwz 11, .plt[n]-(.got2+0x8000)@l(30)
mtctr 11
bctr
## The case when the GOT offset is within 64KiB is similar to plt_pic32.f.
While we have a working solution, if we revisit the scheme, we willfind that setting up r30 is extremely expensive. A trivial tail callexample (void foo() { bar(); }
) needs numerousinstructions:
1 | <foo>: |
.glink
is like .plt
for other architecturesand has a header of 60 bytes. Each PLT entry consists of one instructionb .plt
. The PLT header subtracts the address of the firstPLT entry from r12
to compute the PLT index.
An unconditional branch instruction b
/bl
may produce a relocation of either R_PPC64_REL24
orR_PPC64_REL24_NOTOC
. R_PPC64_REL24
indicatesthat the caller uses TOC. R_PPC64_REL24_NOTOC
indicatesthat the caller does not use TOC or preserve r2.
A conditional branch instruction may produce a relocation of typeR_PPC64_REL14
.
All of R_PPC64_REL14
, R_PPC64_REL24
, andR_PPC64_REL24_NOTOC
are PLT-generating relocation types. Ifa PLT entry is needed, the linker will create a traditional orPC-relative PLT call stub, and redirect the caller's branch instructionto the call stub. This behavior is different from most otherarchitectures where the caller can branch directly to the PLT entry. Theinefficiency comes from maintaining r2 and r12 for TOC.
There is no R_PPC64_REL14_NOTIC
.R_PPC64_REL14
used by conditional branches is generally notused for function calls.
Below I will describe call stubs for TOC/NOTOC interop and for rangeextension in detail.
Both PPC32 and PPC64 use a variant of TLS Variant I: the static TLSblocks are placed above the thread pointer. The thread pointer points tothe end of the thread control block.
The linker performs TLS optimization.
See Allabout thread-local storage.
R_PPC64_TLSGD
or R_PPC64_TLSLD
is requiredto mark bl __tls_get_addr
for General Dynamic/Local Dynamiccode sequences.
1 | addis r3, r2, x@got@tlsgd@ha # R_PPC64_GOT_TLSGD16_HA |
However, there are two deviations from the above:
__tls_get_addr
. This is essential toimplement rtld in glibc/musl/FreeBSD.1 | bl __tls_get_addr |
This is only used in a -shared
link, and thus notsubject to the GD/LD to IE/LE relaxation issue below.
R_PPC64_TLSGD
/R_PPC64_TLSGD
forcompiler generated TLS referencesAccording to Stefan Pintille, "In the early days of the transitionfrom the ELFv1 ABI that is used for big endian PowerPC Linuxdistributions to the ELFv2 ABI that is used for little endian PowerPCLinux distributions, there was some ambiguity in the specification ofthe relocations for TLS. The GNU linker has implemented support forcorrect handling of calls to __tls_get_addr
with a missingrelocation. Unfortunately, we didn't notice that the IBM XL compiler didnot handle TLS according to the updated ABI until we tried linking XLcompiled libraries with LLD."
It is unfortunate but in short ld.lld needs to work around the oldIBM XL compiler issue. Otherwise, if the object file is linked in-no-pie
or -pie
mode, the result will beincorrect because the 4 instructions are partially rewritten (the latter2 are not changed).
A caller using TOC marks its function calls with relocation typeR_PPC64_REL24
.
The caller expects that r2 does not change while the callee may alterr2. To address the issue, the compiler and the linker collaborate topreserve r2.
For a call target which may resolve to a different translation unit(e.g. non-definition declaration, hidden visibility definition), thecompiler inserts a NOP after the branch instruction. A call targetguaranteed to resolve to the current translation unit (e.g. internallinkage) does not need a NOP since r2 will not change.1
2
3
4
5caller:
bl foo
nop # may become `ld 2, 24(1)`
bl nonpreemptible
blr
Note: An external linkage hidden visibility call target needs a NOPas well in case the callee clobbers r2 if it does not maintain the TOCpointer.
If the callee is preemptible, the caller and the callee may be indifferent components.
The linker creates a PLT call stub to save r2 in the caller stackframe, and patches the nop
to ld 2, 24(1)
torestore r2.
1 | <caller>: |
localentry=1
A non-TOC callee may or may not preserve r2. Its.localentry
value may be 0 or 1, where 1 indicates that r2may be clobbered.
Similar to the preemptible callee case, the linker creates a callstub to save r2, and patches the nop
told 2, 24(1)
to restore r2.
1 | <caller>: |
If the call stub cannot reach the call target with a singleb
instruction, the linker will try computing the targetaddress with addis+addi
. 1
2
3
4
5
6<__toc_save_far>:
std 2, 24(1) # save r2
addis 12, 2, ...
addi 12, 12, ...
mtctr 12
bctr # jump to the callee
If addis+addi
cannot reach the call target, the linkerwill store the target address in a .branch_lt
entry andperform an indirect branch. 1
2
3
4
5
6<__toc_save_farther>:
std 2, 24(1) # save r2
addis 12, 2, ...
ld 12, ...(12) # load .branch_lt[n]
mtctr 12
bctr # jump to the callee
A caller not using TOC marks its function calls with the relocationtype R_PPC64_REL24_NOTOC
.
1 | caller: |
Here is a test about a non-TOC caller and a TOC callee. Ina0
and a1
, the callee foo
isnon-preemtpbile while in a2
, foo
ispreemptible.
1 | echo 'int x = 42; void foo(); int main() { foo(); }' > a.c |
Invoke bmake
to run the test.
The callee may or may not use TOC. If the callee uses TOC and has a.localentry
value larger than 1, its global entry pointrequires that r12 is set to the function entry address by thecaller.
The linker creates a PC-relative PLT call stub to set r12 in case thecallee needs r12.
1 | <caller>: |
If we don't use Power10 pld
(--power10-stubs=no
), we will need more instructions:1
2
3
4
5
6
7
8
9<__plt_pcrel_foo>:
mflr 12 # save lr
bcl 20, 31, .+4
mflr 11 # r11 = current location
mtlr 12 # restore lr
addis 12, 11, offset@ha
ld 12, offset@l(12) # load .plt[n]
mtctr 12
bctr # jump to the PLT entry
A non-preemptible callee may or may not use TOC.
1 | <caller>: |
If we don't use Power10 paddi
(--power10-stubs=no
), we will need more instructions.1
2
3
4
5
6
7
8
9<__gep_setup_foo>:
mflr 12
bcl 20, 31, .+4
mflr 11
mtlr 12
addis 12, 11, offset@ha
addi 12, 12, offset@l
mtctr 12
bctr
Non-preemptible IFUNC are placed in .glink
on PPC64. Ifthere is a non-GOT non-PLT relocation, for pointer equality, we changethe type of the symbol from STT_IFUNC
andSTT_FUNC
and bind it to the .glink
entry.
On PPC64 ELFv2, every bl
instruction in.glink
is associated with a .plt
entryrelocated by R_PPC64_JUMP_SLOT
. An IPLT does not have anassociated R_PPC64_JUMP_SLOT
, so we cannot usebl
in .iplt
. Instead, we create a regular TOCcall stub.
A non-preemptible ifunc implementation may not save the TOC pointer,so if another DSO defines an ifunc resolver which resolves to thisimplementation, calling that ifunc will not set the TOC pointercorrectly. This is the restriction described by https://sourceware.org/glibc/wiki/GNU_IFUNC (though onmany architectures it works in practice):
Requirement (a): Resolver must be defined in the same translationunit as the implementations.
See https://reviews.llvm.org/D71509.
On PPC32, an unconditional branch instructionb
/bl
has a range of +-32MiB and may use 3relocation types: R_PPC_LOCAL24PC
,R_PPC_REL24
, and R_PPC_PLTREL24
. If the targetis not reachable from the instruction location, a range extension thunkwill be used. R_PPC_LOCAL24PC
is a useless relocation. Alloccurrences can be replaced with R_PPC_REL24
.
On PPC64, an unconditional branch instructionb
/bl
has a range of +-32MiB and may useR_PPC64_REL24
or R_PPC64_REL24_NOTOC
. Theaforementioned call stubs for TOC/NOTOC interop have handled many longbranches. The cases which haven't been handled are:
ld.lld only has an implementation for the first case. After linking acaller may look like:
1 | <caller>: |
The branch target of a thunk may be a PLT entry.
GPR Save and Restore Functions defines some special functions whichmay be referenced by GCC produced assembly (LLVM does not referencethem).
With GCC -Os, when the number of call-saved registers exceeds acertain threshold, GCC generates _savegpr[01]_{14..31}
and_restgpr[01]_{14..31}
calls and expects the linker todefine them. See https://sourceware.org/pipermail/binutils/2002-February/017444.htmland https://sourceware.org/pipermail/binutils/2004-August/036765.html.
This is weird because libgcc.a
would be the naturalplace. However, the linker generation approach has the advantage thatthe linker can generate multiple copies to avoid long branch thunks. Idon't consider the advantage significant enough to complicate ld.lld'strunk implementation, so I take a simple approach.
_savegpr0_{14..31}
are used