IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Linker notes on Power ISA

    MaskRay发表于 2023-03-06 05:55:09
    love 0

    This article describes target-specific details about Power ISA in ELFlinkers. Initially there was IBM POWER. The 1991 Apple–IBM–Motorolaalliance created PowerPC. In 2006, the architecture was rebranded asPower ISA. According to the ISA manual, "In 2006, Freescale and IBMcollaborated on the creation of the Power ISA Version 2.03, whichrepresented the reunification of the architecture by combining Book Econtent with the more general purpose PowerPC Version 2.02."

    The terms "PowerPC" and "powerpc" remain popular in numerous places,including the powerpc-*-*-* andpowerpc64-*-*-* in official target triple names. Theabbreviation "PPC" ("ppc") is used in numerous places as well. Forsimplicity, I will refer to the 32-bit architecture as "PPC32" and the64-bit architecture as "PPC64".

    We will see how the lack of PC-relative addressing before Power10 hascaused great complexity to the ABI and linkers.

    ABI documents

    • Power Architecture™ 32-bit Application Binary InterfaceSupplement 1.0 - Linux® & Embedded revised in 2011.
    • 64-bit PowerPC ELF Application Binary Interface Supplement1.9. This is commonly referred to as ELFv1 and is obsolete. Some64-bit targets still use this ABI.
    • 64-Bit ELF V2 ABI Specification: Power Architecture

    The 32-bit ELF ABI is more or less not cared for by maintainers andonly remains relevant among some enthusiasts. In 2019, I spent one weekstudying PPC32 ABI and added the PPC32 port to ld.lld.

    For a 64-bit object file, the presence of a section .opdis a good indicator for ELFv1. e_flags being 2 is a goodindicator for ELFv2. e_flags being 0 is either an ELFv1object file, or an object file not using any feature affected by thedifferences.

    A new ABI for little-endian PowerPC64 Design &Implementation (2014) describes the motivation for introducingELFv2.

    Global Offset Table

    PPC32 GOT

    On PPC32, _GLOBAL_OFFSET_TABLE_ is defined at the startof the section .got. .got has 3 reservedentries. _GLOBAL_OFFSET_TABLE_[0] stores the link-timeaddress of _DYNAMIC, which is used by glibcsysdeps/powerpc/powerpc32/dl-machine.h._GLOBAL_OFFSET_TABLE_[1] and_GLOBAL_OFFSET_TABLE_[2] are for lazy binding PLT(_dl_runtime_resolve and link map in glibc).

    .plt is like .got.plt for otherarchitectures. .plt[n] holds the address of a PLT entry(somewhere in .glink).

    Like x86-32, PPC32 lacks memory load with PC-relative addressing. Asa poor man's replacement, PPC32 sets up r30 to hold a GOT base forposition-independent code (PIC). The GOT base is different for small PICand large PIC.

    • For -fpic and -fpie, r30 refers to_GLOBAL_OFFSET_TABLE_ in the component.
    • For -fPIC and -fPIE, r30 refers to.got2 for the current translation unit. This hasimplications for PLT-generating relocations as we will see below.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    .section        ".got2","aw"
    .align 2
    .LCTOC1 = .+32768
    .LC0:
    .long var

    ...
    bcl 20,31,.L2
    .L2:
    mflr 30 # r30 = lr
    addis 30,30,.LCTOC1-.L2@ha
    addi 30,30,.LCTOC1-.L2@l # finish setting up the GOT base
    lwz 9,.LC0-.LCTOC1(30) # load the address of var relative to the GOT base

    The component may have multiple translation units and each has adifferent .got2. In the output file, .got2 inone file may have an arbitrary offset relative to the output.got2.

    PPC64 GOT

    On PPC64, .got has 1 reserved entry: the link-timeaddress of .TOC.. .TOC. is defined at thestart of the section .got plus 0x8000.

    .plt is like .got.plt for otherarchitectures. .plt has the type SHT_NOBITSand an alignment of 4.

    PPC64 ELFv2 Table of Contents(TOC)

    Before Power10, PPC64 uses .toc instead of.got to hold the addresses of global variables andaddress-taken functions. This is different from most architectures.

    1
    2
    extern int var0, var1;
    int foo() { return var0 + var1; }

    The above C program compiles to the following assembly:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    foo:
    .Lfunc_begin0:
    .Lfunc_gep0:
    addis 2, 12, .TOC.-.Lfunc_gep0@ha
    addi 2, 2, .TOC.-.Lfunc_gep0@l
    .Lfunc_lep0:
    .localentry foo, .Lfunc_lep0-.Lfunc_gep0

    addis 3, 2, .LC0@toc@ha
    addis 4, 2, .LC1@toc@ha
    ld 3, .LC0@toc@l(3)
    ld 4, .LC1@toc@l(4)
    lwz 3, 0(3)
    lwz 4, 0(4)
    add 3, 4, 3
    extsw 3, 3
    blr

    .section .toc,"aw",@progbits
    .LC0:
    .tc var0[TC],var0
    .LC1:
    .tc var1[TC],var1

    foo has a global entryfoo/.Lfunc_gep0 and a local entry.Lfunc_lep0. After the local entry, r2 holds the address ofthe TOC base of the current component.

    If foo and a caller of foo are in the samecomponent, the caller may branch directly to the local entry, skipping afew instructions starting at the global entry (usually 2). Otherwise,the caller needs to branch to the global entry so that foowill update r2 itself. This update requires that r12 points to thefunction entry address. We will see that maintaining r2 and r12 causes alot of trouble in sections diving into call stubs.

    Another difference is the explicit mention of .toc. Thisscheme gives the compiler control within the translation unit. With thetraditional GOT scheme, input files do not mention .got.The compiler does not control how the linker will layout.got. Well, I disagree with the presumed advantage of.toc: the compiler does not know the global information,and the translation unit local layout may not be ideal. A linker isbetter placed to do such link-time optimization.

    A .tc directive is a fancy way to produce a relocationof type R_PPC64_ADDR64. If the linker decides to create aTOC entry, the entry will be a link-time constant (-no-pie)or be associated with a dynamic relocation (-pie or-shared).

    TOC-indirect toTOC-relative optimization

    See Allabout Global Offset Table#GOT optimization.

    Procedure Linkage Table

    PPC32 PLT

    Power Architecture® 32-bit Application Binary InterfaceSupplement 1.0 - Linux® & Embedded specifies two PLT ABIs:BSS-PLT and Secure-PLT.

    BSS-PLT is the older method, which is now obsolete. While.plt on other architectures is created by the linker,BSS-PLT lets ld.so generate the PLT entries. This has the advantage thatthe section can be made SHT_NOBITS and therefore not occupyfile size. However, the downside is the security concern of writable andexecutable memory pages. Even worse, as an implementation issue, GNU ldplaces .plt in the text segment, making the whole textsegment is writable and executable. This renders-z relro -z now ineffective.

    In the newer Secure-PLT ABI, .plt holds the table offunction addresses. .plt is like .got.plt forother architectures.

    The linker synthesizes .glink, which is like.plt for other architectures. Unlike most architectures,.glink has a footer rather than a header. Each PLT entry iseither b footer or a nop falling through to the footer. Inld.lld, we only use b footer for simplicity. See https://reviews.llvm.org/D75394 forPPC32GlinkSection in ld.lld.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    000102b4 <.glink>:
    b 0x102c0 <.glink+0xc>
    b 0x102c0 <.glink+0xc>
    b 0x102c0 <.glink+0xc>
    addis 11, 11, 0 # start of the resolver
    mflr 0
    bcl 20, 31, 0x102cc <.glink+0x18>
    addi 11, 11, 24
    mflr 12
    mtlr 0
    sub 11, 11, 12
    addis 12, 12, 1
    lwz 0, 184(12)
    lwz 12, 188(12)
    mtctr 0
    add 0, 11, 11
    add 11, 0, 11
    bctr
    nop
    nop

    For non-PIC code, a possibly preemptible branch uses the relocationtype R_PPC_REL24.

    1
    2
    bl foo  # R_PPC_REL24
    bl foo # R_PPC_REL24

    If the call target is preemptible, the linker creates a non-PIC callstub and redirects the caller's branch instruction to the call stub. Thenon-PIC call stub will use absolute addressing to load.plt[n] into r11 (call-clobbered) and branch there. Thisbehavior is different from most other architectures where the caller canbranch directly to the PLT entry.

    1
    2
    3
    4
    5
    6
    7
    8
    9
      bl 00000000.plt_call32.f
    bl 00000000.plt_call32.f
    ...

    00000000.plt_call32.f:
    lis 11, .plt[n]@ha
    lwz 11, .plt[n]@l(11)
    mtctr 11
    bctr

    For PIC code, a branch to a possibly preemptible target usesR_PPC_PLTREL24 as the PLT-generating relocation type. Theaddend encodes r30 set up by the caller. Yes, this is unusual.

    • For -fpic and -fpie, the addend is 0.
    • For -fPIC and -fPIE, the addend is 0x8000.Linking this relocatable object file in -r mode mayincrease the addend.

    When calling a function, if the target is preemptible, the linkercreates a PIC call stub and redirects the caller's branch instruction tothe call stub. GNU ld names small PIC call stubs as*.plt_pic32.* and large PIC call stubs as*.got2.plt_pic32.*. ld.lld follows the namingconvention.

    A call stub knows the value of r30 (GOT base) set up by the caller.The distance from .plt[n] to r30 is a constant. The callstub computes the address of .plt[n], loads the entry, andbranches there.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    00000000.plt_pic32.f:
    ## If the GOT offset is beyond 64KiB
    addis 11, 30, .plt[n]-_GLOBAL_OFFSET_TABLE_@ha(30)
    lwz 11, .plt[n]-_GLOBAL_OFFSET_TABLE_@l(30)
    mtctr 11
    bctr

    ## If the GOT offset is within 64KiB
    # lwz 11, .plt[n]-_GLOBAL_OFFSET_TABLE_(30)
    # mtctr 11
    # bctr
    # nop

    00000000.got2.plt_pic32.f:
    ## .got2 refers to the copy belonging to the current translation unit.
    ## Different translation units have to use different stubs.
    addis 11, 30, .plt[n]-(.got2+0x8000)(30)
    lwz 11, .plt[n]-(.got2+0x8000)@l(30)
    mtctr 11
    bctr

    ## The case when the GOT offset is within 64KiB is similar to plt_pic32.f.

    While we have a working solution, if we revisit the scheme, we willfind that setting up r30 is extremely expensive. A trivial tail callexample (void foo() { bar(); }) needs numerousinstructions:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    <foo>:
    stwu 1, -16(1) # allocate stack
    mflr 0
    bcl 20, 31, 0x1bc # set lr to PC
    stw 30, 8(1) # save r30 which is used as the GOT base
    mflr 30
    addis 30, 30, 2 # high 16 bits of the GOT base (.got2+0x8000)
    stw 0, 20(1) # save lr (copied to r0)
    addi 30, 30, 32140 # low 16 bits of the GOT base (.got2+0x8000)
    bl 0x1f0
    lwz 0, 20(1)
    lwz 30, 8(1)
    addi 1, 1, 16
    mtlr 0
    blr

    PPC64 ELFv2 PLT

    .glink is like .plt for other architecturesand has a header of 60 bytes. Each PLT entry consists of one instructionb .plt. The PLT header subtracts the address of the firstPLT entry from r12 to compute the PLT index.

    An unconditional branch instruction b/blmay produce a relocation of either R_PPC64_REL24 orR_PPC64_REL24_NOTOC. R_PPC64_REL24 indicatesthat the caller uses TOC. R_PPC64_REL24_NOTOC indicatesthat the caller does not use TOC or preserve r2.

    A conditional branch instruction may produce a relocation of typeR_PPC64_REL14.

    All of R_PPC64_REL14, R_PPC64_REL24, andR_PPC64_REL24_NOTOC are PLT-generating relocation types. Ifa PLT entry is needed, the linker will create a traditional orPC-relative PLT call stub, and redirect the caller's branch instructionto the call stub. This behavior is different from most otherarchitectures where the caller can branch directly to the PLT entry. Theinefficiency comes from maintaining r2 and r12 for TOC.

    There is no R_PPC64_REL14_NOTIC.R_PPC64_REL14 used by conditional branches is generally notused for function calls.

    Below I will describe call stubs for TOC/NOTOC interop and for rangeextension in detail.

    Thread Local Storage

    Both PPC32 and PPC64 use a variant of TLS Variant I: the static TLSblocks are placed above the thread pointer. The thread pointer points tothe end of the thread control block.

    The linker performs TLS optimization.

    See Allabout thread-local storage.

    Workaround for old IBM XLcompilers

    R_PPC64_TLSGD or R_PPC64_TLSLD is requiredto mark bl __tls_get_addr for General Dynamic/Local Dynamiccode sequences.

    1
    2
    3
    4
    addis r3, r2, x@got@tlsgd@ha # R_PPC64_GOT_TLSGD16_HA
    addi r3, r3, x@got@tlsgd@l # R_PPC64_GOT_TLSGD16_LO
    bl __tls_get_addr(x@tlsgd) # R_PPC64_TLSGD followed by R_PPC64_REL24
    nop

    However, there are two deviations from the above:

    1. direct call to __tls_get_addr. This is essential toimplement rtld in glibc/musl/FreeBSD.
    1
    2
    bl __tls_get_addr
    nop

    This is only used in a -shared link, and thus notsubject to the GD/LD to IE/LE relaxation issue below.

    1. Missing R_PPC64_TLSGD/R_PPC64_TLSGD forcompiler generated TLS references

    According to Stefan Pintille, "In the early days of the transitionfrom the ELFv1 ABI that is used for big endian PowerPC Linuxdistributions to the ELFv2 ABI that is used for little endian PowerPCLinux distributions, there was some ambiguity in the specification ofthe relocations for TLS. The GNU linker has implemented support forcorrect handling of calls to __tls_get_addr with a missingrelocation. Unfortunately, we didn't notice that the IBM XL compiler didnot handle TLS according to the updated ABI until we tried linking XLcompiled libraries with LLD."

    It is unfortunate but in short ld.lld needs to work around the oldIBM XL compiler issue. Otherwise, if the object file is linked in-no-pie or -pie mode, the result will beincorrect because the 4 instructions are partially rewritten (the latter2 are not changed).

    PPC64 ELFv2 TOC caller

    A caller using TOC marks its function calls with relocation typeR_PPC64_REL24.

    The caller expects that r2 does not change while the callee may alterr2. To address the issue, the compiler and the linker collaborate topreserve r2.

    For a call target which may resolve to a different translation unit(e.g. non-definition declaration, hidden visibility definition), thecompiler inserts a NOP after the branch instruction. A call targetguaranteed to resolve to the current translation unit (e.g. internallinkage) does not need a NOP since r2 will not change.

    1
    2
    3
    4
    5
    caller:
    bl foo
    nop # may become `ld 2, 24(1)`
    bl nonpreemptible
    blr

    Note: An external linkage hidden visibility call target needs a NOPas well in case the callee clobbers r2 if it does not maintain the TOCpointer.

    TOC caller and preemptiblecallee

    If the callee is preemptible, the caller and the callee may be indifferent components.

    • If the callee uses TOC, it may change r2 to the TOC base of itscomponent.
    • If the callee uses PC-relative addressing, it may treat r2 ascaller-saved and clobber r2.

    The linker creates a PLT call stub to save r2 in the caller stackframe, and patches the nop to ld 2, 24(1) torestore r2.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    <caller>:
    bl __plt_foo
    ld 2, 24(1) # restore r2
    bl nonpreemptible
    bl nonpreemptible

    <__plt_foo>:
    std 2, 24(1) # save r2
    addis 12, 2, ...
    ld 12, ...(12) # load .plt[n]
    mtctr 12
    bctr # jump to the PLT entry

    TOCcaller and non-preemptible callee with localentry=1

    A non-TOC callee may or may not preserve r2. Its.localentry value may be 0 or 1, where 1 indicates that r2may be clobbered.

    Similar to the preemptible callee case, the linker creates a callstub to save r2, and patches the nop told 2, 24(1) to restore r2.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    <caller>:
    bl __toc_save_foo
    ld 2, 24(1) # restore r2
    bl nonpreemptible
    blr

    <__toc_save_foo>:
    std 2, 24(1) # save r2
    b foo # jump to the callee

    If the call stub cannot reach the call target with a singleb instruction, the linker will try computing the targetaddress with addis+addi.

    1
    2
    3
    4
    5
    6
    <__toc_save_far>:
    std 2, 24(1) # save r2
    addis 12, 2, ...
    addi 12, 12, ...
    mtctr 12
    bctr # jump to the callee

    If addis+addi cannot reach the call target, the linkerwill store the target address in a .branch_lt entry andperform an indirect branch.

    1
    2
    3
    4
    5
    6
    <__toc_save_farther>:
    std 2, 24(1) # save r2
    addis 12, 2, ...
    ld 12, ...(12) # load .branch_lt[n]
    mtctr 12
    bctr # jump to the callee

    PPC64 ELFv2 non-TOC caller

    A caller not using TOC marks its function calls with the relocationtype R_PPC64_REL24_NOTOC.

    1
    2
    3
    caller:
    bl foo@notoc
    blr

    Here is a test about a non-TOC caller and a TOC callee. Ina0 and a1, the callee foo isnon-preemtpbile while in a2, foo ispreemptible.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    echo 'int x = 42; void foo(); int main() { foo(); }' > a.c
    printf '#include <stdio.h>\nextern int x; void foo() { printf("%%d\\n", x); }' > b.c
    sed 's/^ /\t/' > Makefile <<'eof'
    .MAKE.MODE := meta curDirOk=true
    CC := /tmp/Rel/bin/clang --target=powerpc64le-linux-gnu
    LDFLAGS := -fuse-ld=lld -Wl,--dynamic-linker=/usr/powerpc64le-linux-gnu/lib64/ld64.so.2,-rpath=/usr/powerpc64le-linux-gnu/lib -Wl,--no-power10-stubs

    run: a0 a1 a2
    qemu-ppc64le-static -cpu power10 ./a0
    qemu-ppc64le-static -cpu power10 ./a1
    qemu-ppc64le-static -cpu power10 ./a2

    a0: a.o b.o
    ${LINK.c} $> -o $@

    a1: a.o b.o
    ${LINK.c} -r $> -o $@.ro
    ${LINK.c} $@.ro -o $@

    a2: a.o b.so
    ${LINK.c} a.o ./b.so -o $@

    a.o: a.c
    ${CC} -mcpu=power10 -c $>

    b.so: b.o
    ${LINK.c} -shared $> -o $@
    eof

    Invoke bmake to run the test.

    Non-TOC caller andpreemptible callee

    The callee may or may not use TOC. If the callee uses TOC and has a.localentry value larger than 1, its global entry pointrequires that r12 is set to the function entry address by thecaller.

    The linker creates a PC-relative PLT call stub to set r12 in case thecallee needs r12.

    1
    2
    3
    4
    5
    6
    7
    8
    <caller>:
    bl __plt_pcrel_foo
    blr

    <__plt_pcrel_foo>:
    pld 12, .plt[n]@pcrel # load .plt[n]
    mtctr 12
    bctr # jump to the PLT entry

    If we don't use Power10 pld(--power10-stubs=no), we will need more instructions:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    <__plt_pcrel_foo>:
    mflr 12 # save lr
    bcl 20, 31, .+4
    mflr 11 # r11 = current location
    mtlr 12 # restore lr
    addis 12, 11, offset@ha
    ld 12, offset@l(12) # load .plt[n]
    mtctr 12
    bctr # jump to the PLT entry

    Non-TOC callerand non-preemptible TOC callee

    A non-preemptible callee may or may not use TOC.

    • If the callee doesn't use TOC, the branch instruction can point tothe target directly.
    • If the callee uses TOC, use the preemptible callee case.
    1
    2
    3
    4
    5
    6
    7
    8
    <caller>:
    bl __gep_setup_foo
    blr

    <__gep_setup_foo>:
    paddi 12, 0, foo@pcrel # compute target address
    mtctr 12
    bctr # jump to target

    If we don't use Power10 paddi(--power10-stubs=no), we will need more instructions.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    <__gep_setup_foo>:
    mflr 12
    bcl 20, 31, .+4
    mflr 11
    mtlr 12
    addis 12, 11, offset@ha
    addi 12, 12, offset@l
    mtctr 12
    bctr

    IPLT code sequencefor non-preemptible IFUNC

    Non-preemptible IFUNC are placed in .glink on PPC64. Ifthere is a non-GOT non-PLT relocation, for pointer equality, we changethe type of the symbol from STT_IFUNC andSTT_FUNC and bind it to the .glink entry.

    On PPC64 ELFv2, every bl instruction in.glink is associated with a .plt entryrelocated by R_PPC64_JUMP_SLOT. An IPLT does not have anassociated R_PPC64_JUMP_SLOT, so we cannot usebl in .iplt. Instead, we create a regular TOCcall stub.

    A non-preemptible ifunc implementation may not save the TOC pointer,so if another DSO defines an ifunc resolver which resolves to thisimplementation, calling that ifunc will not set the TOC pointercorrectly. This is the restriction described by https://sourceware.org/glibc/wiki/GNU_IFUNC (though onmany architectures it works in practice):

    Requirement (a): Resolver must be defined in the same translationunit as the implementations.

    See https://reviews.llvm.org/D71509.

    Range extension thunks

    On PPC32, an unconditional branch instructionb/bl has a range of +-32MiB and may use 3relocation types: R_PPC_LOCAL24PC,R_PPC_REL24, and R_PPC_PLTREL24. If the targetis not reachable from the instruction location, a range extension thunkwill be used. R_PPC_LOCAL24PC is a useless relocation. Alloccurrences can be replaced with R_PPC_REL24.

    On PPC64, an unconditional branch instructionb/bl has a range of +-32MiB and may useR_PPC64_REL24 or R_PPC64_REL24_NOTOC. Theaforementioned call stubs for TOC/NOTOC interop have handled many longbranches. The cases which haven't been handled are:

    • TOC caller and non-preemptible TOC callee
    • non-TOC caller and non-preemptible non-TOC callee

    ld.lld only has an implementation for the first case. After linking acaller may look like:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    <caller>:
    bl __long_branch_nonpreemptible
    blr

    <__long_branch_nonpreemptible>:
    addis 12, 2, offset@ha
    ld 12, offset@l(12) # load .branch_lt[n]
    mtctr 12
    bctr # jump to the target

    The branch target of a thunk may be a PLT entry.

    GPR Save and restorefunctions

    GPR Save and Restore Functions defines some special functions whichmay be referenced by GCC produced assembly (LLVM does not referencethem).

    With GCC -Os, when the number of call-saved registers exceeds acertain threshold, GCC generates _savegpr[01]_{14..31} and_restgpr[01]_{14..31} calls and expects the linker todefine them. See https://sourceware.org/pipermail/binutils/2002-February/017444.htmland https://sourceware.org/pipermail/binutils/2004-August/036765.html.

    This is weird because libgcc.a would be the naturalplace. However, the linker generation approach has the advantage thatthe linker can generate multiple copies to avoid long branch thunks. Idon't consider the advantage significant enough to complicate ld.lld'strunk implementation, so I take a simple approach.

    • Check whether _savegpr0_{14..31} are used
    • If yes, define needed symbols and add an InputSection with the codesequence.


沪ICP备19023445号-2号
友情链接