IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Linker notes on Power ISA

    MaskRay发表于 2023-02-27 08:56:13
    love 0

    UNDER CONSTRUCTION

    This article describes target-specific things about Power ISA in ELFlinkers. The architecture was originally named "PowerPC". In 2016 thearchitecture was rebranded as "Power ISA". The ISA manual says: "In2006, Freescale and IBM collaborated on the creation of the Power ISAVersion 2.03, which represented the reunification of the architecture bycombining Book E content with the more general purpose PowerPC Version2.02."

    The terms "PowerPC" and "powerpc" remain popular in numerous places,including the powerpc-*-*-* andpowerpc64-*-*-* in official target triple names. Theabbreviation "PPC" ("ppc") is used in numerous places as well. Forsimplicity, I will refer to the 32-bit architecture as "PPC32" and the64-bit architecture as "PPC64".

    ABI documents

    • Power Architecture™ 32-bit Application Binary InterfaceSupplement 1.0 - Linux® & Embedded revised in 2011.
    • 64-bit PowerPC ELF Application Binary Interface Supplement1.9. This is commonly referred to as ELFv1 and is obsolete. Sometargets still use this ABI.
    • 64-Bit ELF V2 ABI Specification: Power Architecture

    The 32-bit ELF ABI is more or less not cared for by maintainers andonly remains relevant among some enthusiasts. In 2019, I spent one weekstudying PPC32 ABI and added the PPC32 port to ld.lld.

    For a 64-bit object file, the presence of a section .opdis a good indicator for ELFv1. e_flags being 2 is a goodindicator for ELFv2. e_flags being 0 is either an ELFv1object file, or an object file not using any feature affected by thedifferences.

    A new ABI for little-endian PowerPC64 Design &Implementation (2014) describes the motivation for introducingELFv2.

    Global Offset Table

    PPC32 GOT

    On PPC32, _GLOBAL_OFFSET_TABLE_ is defined at the startof the section .got. .got has 3 reservedentries. _GLOBAL_OFFSET_TABLE_[0] stores the link-timeaddress of _DYNAMIC, which is used by glibcsysdeps/powerpc/powerpc32/dl-machine.h._GLOBAL_OFFSET_TABLE_[1] and_GLOBAL_OFFSET_TABLE_[2] are for lazy binding PLT(_dl_runtime_resolve and link map).

    .plt is like .got.plt for otherarchitectures. .plt[n] holds the address of a PLT entry(somewhere in .glink).

    Like x86-32, PPC32 lacks of memory load with PC-relative addressing.As a poor man's replacement, PPC32 sets up r30 to hold a GOT base forPIC code. The GOT base is different for small PIC and large PIC.

    • For -fpic and -fpie, r30 refers to_GLOBAL_OFFSET_TABLE_ in the component.
    • For -fPIC and -fPIE, r30 refers to.got2 for the current translation unit. This hasimplications for PLT-generating relocations as we will see below.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    .section        ".got2","aw"
    .align 2
    .LCTOC1 = .+32768
    .LC0:
    .long var

    ...
    bcl 20,31,.L2
    .L2:
    mflr 30 # r30 = lr
    addis 30,30,.LCTOC1-.L2@ha
    addi 30,30,.LCTOC1-.L2@l # finish setting up the GOT base
    lwz 9,.LC0-.LCTOC1(30) # load the address of var relative to the GOT base

    The component may have multiple translation units and each has adifferent .got2. In the output file, .got2 inone file may have an arbitrary offset relative to the output.got2.

    PPC64 GOT

    On PPC64, .got has 1 reserved entry: the link-timeaddress of .TOC.. .TOC. is defined at thestart of the section .got plus 0x8000.

    PPC64 ELFv2 Table of Contents(TOC)

    Different from most architectures, PPC64 uses .tocinstead of .got to hold the addresses of global variablesand address-taken functions.

    1
    2
    extern int var0, var1;
    int foo() { return var0 + var1; }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
      addis 3, 2, .LC0@toc@ha
    addis 4, 2, .LC1@toc@ha
    ld 3, .LC0@toc@l(3)
    ld 4, .LC1@toc@l(4)
    lwz 3, 0(3)
    lwz 4, 0(4)
    add 3, 4, 3
    extsw 3, 3
    blr

    .section .toc,"aw",@progbits
    .LC0:
    .tc var0[TC],var0
    .LC1:
    .tc var1[TC],var1

    While with .got relocatable object files do notreference .got directly, the TOC scheme may be thought ofas a compiler-managed GOT: .toc is explicit in relocatableobject files. A .tc directive is a fancy way to produce aR_PPC64_ADDR64 relocation. If the linker decides to createa TOC entry, the entry will be a link-time constant(-no-pie) or be associated with a dynamic relocation(-pie or -shared).

    The TOC layout is under control of the compiler and presumably thecompiler can leverage better information to optimize the layout forlocality. Well, I disagree with this point. The compiler does not knowthe global information. A linker is better placed to do such link-timeoptimization.

    .plt is like .got.plt for otherarchitectures. .plt has the type SHT_NOBITSand an alignment of 4.

    TOC-indirect toTOC-relative optimization

    See Allabout Global Offset Table#GOT optimization.

    Procedure Linkage Table

    PPC32 PLT

    Power Architecture® 32-bit Application Binary InterfaceSupplement 1.0 - Linux® & Embedded specifies two PLT ABIs:BSS-PLT and Secure-PLT.

    BSS-PLT is the older one. While .plt on otherarchitectures are created by the linker, BSS-PLT let ld.so generate thePLT entries. This has the advantage that the section can be madeSHT_NOBITS and therefore not occupy file size. The downsideis the security concern of writable and executable memory pages. Evenworse, as an implementation issue, GNU ld places .plt inthe text segment and therefore the whole text segment is writable andexecutable. -z relro -z now has no effect.

    In the newer Secure-PLT ABI, .plt holds the table offunction addresses. .plt is like .got.plt forother architectures.

    The linker synthesizes .glink, which is like.plt for other architectures. Unlike most architectures,.glink has a footer rather than a header. Each PLT entry iseither b footer or a nop falling through to the footer. Inld.lld, we only use b footer for simplicity. See https://reviews.llvm.org/D75394 forPPC32GlinkSection in ld.lld.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    000102b4 <.glink>:
    b 0x102c0 <.glink+0xc>
    b 0x102c0 <.glink+0xc>
    b 0x102c0 <.glink+0xc>
    addis 11, 11, 0 # start of the resolver
    mflr 0
    bcl 20, 31, 0x102cc <.glink+0x18>
    addi 11, 11, 24
    mflr 12
    mtlr 0
    sub 11, 11, 12
    addis 12, 12, 1
    lwz 0, 184(12)
    lwz 12, 188(12)
    mtctr 0
    add 0, 11, 11
    add 11, 0, 11
    bctr
    nop
    nop

    For non-PIC code, a possibly preemptible branch uses the relocationtype R_PPC_REL24.

    1
    2
    bl foo  # R_PPC_REL24
    bl foo # R_PPC_REL24

    If the call target is preemtible, the linker creates a non-PIC callstub and redirects the caller's branch instruction to the call stub. Thenon-PIC call stub will use absolute addressing to load.plt[n] into r11 (call-clobbered) and branch there. This isdifferent from most other architectures where the caller can branchdirectly to the PLT entry.

    1
    2
    3
    4
    5
    6
    7
    8
    9
      bl 00000000.plt_call32.f
    bl 00000000.plt_call32.f
    ...

    00000000.plt_call32.f:
    lis 11, .plt[n]@ha
    lwz 11, .plt[n]@l(11)
    mtctr 11
    bctr

    For PIC code, a branch to a possibly preemptible target usesR_PPC_PLTREL24 as the PLT-generating relocation type. Theaddend encodes r30 set up by the caller. Yes, this is unusual.

    • For -fpic and -fpie, the addend is 0.
    • For -fPIC and -fPIE, the addend is 0x8000.Linking this relocatable object file in -r mode mayincrease the addend.

    If the call target is preemtible, the linker creates a PIC call stuband redirects the caller's branch instruction to the call stub. GNU ldnames a small PIC call stub as *.plt_pic32.* and a largePIC call stub as *.got2.plt_pic32.*.

    The call stub knows the value of r30 (GOT base) set up by the caller.The distance from .plt[n] to r30 is a constant. The callstub computes the address of .plt[n], loads the entry, andbranches there.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    00000000.plt_pic32.f:
    ## If the GOT offset is beyond 64KiB
    addis 11, 30, .plt[n]-_GLOBAL_OFFSET_TABLE_@ha(30)
    lwz 11, .plt[n]-_GLOBAL_OFFSET_TABLE_@l(30)
    mtctr 11
    bctr

    ## If the GOT offset is within 64KiB
    # lwz 11, .plt[n]-_GLOBAL_OFFSET_TABLE_(30)
    # mtctr 11
    # bctr
    # nop

    00000000.got2.plt_pic32.f:
    ## .got2 refers to the copy belonging to the current translation unit.
    ## Different translation units have to use different stubs.
    addis 11, 30, .plt[n]-(.got2+0x8000)(30)
    lwz 11, .plt[n]-(.got2+0x8000)@l(30)
    mtctr 11
    bctr

    ## The case when the GOT offset is within 64KiB is similar to plt_pic32.f.

    Setting up r30 is extremely expensive. A function tail callinganother one requires the following many instructions:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    <foo>:
    stwu 1, -16(1) # allocate stack
    mflr 0
    bcl 20, 31, 0x1bc # set lr to PC
    stw 30, 8(1) # save r30 which is used as the GOT base
    mflr 30
    addis 30, 30, 2 # high 16 bits of the GOT base (.got2+0x8000)
    stw 0, 20(1) # save lr (copied to r0)
    addi 30, 30, 32140 # low 16 bits of the GOT base (.got2+0x8000)
    bl 0x1f0
    lwz 0, 20(1)
    lwz 30, 8(1)
    addi 1, 1, 16
    mtlr 0
    blr

    PPC64 ELFv2 PLT

    .plt is like .got.plt for otherarchitectures. .plt[n] holds the address of a PLT entry(somewhere in .glink).

    .glink is like .plt for otherarchitectures. .glink has a header of 60 bytes. Each PLTentry consists of one instruction b .plt. The PLT headersubtracts the address of the first PLT entry from r12 tocompute the PLT index.

    An unconditional branch instruction b/blmay use either R_PPC64_REL24 orR_PPC64_REL24_NOTOC. R_PPC64_REL24 indicatesthat the caller uses TOC. R_PPC64_REL24_NOTOC indicatesthat the caller does not use TOC or preserve r2.

    If a PLT entry is needed, the linker creates a traditional orPC-relative PLT call stub, and redirect the caller's branch instructionto the call stub. This is different from most other architectures wherean indirection is unneeded.

    Thread Local Storage

    Both PPC32 and PPC64 use TLS Variant I: the static TLS blocks areplaced above the thread pointer. The thread pointer points to the end ofthe thread control block.

    The linker performs TLS optimization.

    See Allabout thread-local storage.

    Workaround for old IBM XLcompilers

    R_PPC64_TLSGD or R_PPC64_TLSLD is requiredto mark bl __tls_get_addr for General Dynamic/Local Dynamiccode sequences.

    1
    2
    3
    4
    addis r3, r2, x@got@tlsgd@ha # R_PPC64_GOT_TLSGD16_HA
    addi r3, r3, x@got@tlsgd@l # R_PPC64_GOT_TLSGD16_LO
    bl __tls_get_addr(x@tlsgd) # R_PPC64_TLSGD followed by R_PPC64_REL24
    nop

    However, there are two deviations from the above:

    1. direct call to __tls_get_addr. This is essential toimplement rtld in glibc/musl/FreeBSD.
    1
    2
    bl __tls_get_addr
    nop

    This is only used in a -shared link, and thus notsubject to the GD/LD to IE/LE relaxation issue below.

    1. Missing R_PPC64_TLSGD/R_PPC64_TLSGD forcompiler generated TLS references

    According to Stefan Pintille, "In the early days of the transitionfrom the ELFv1 ABI that is used for big endian PowerPC Linuxdistributions to the ELFv2 ABI that is used for little endian PowerPCLinux distributions, there was some ambiguity in the specification ofthe relocations for TLS. The GNU linker has implemented support forcorrect handling of calls to __tls_get_addr with a missing relocation.Unfortunately, we didn't notice that the IBM XL compiler did not handleTLS according to the updated ABI until we tried linking XL compiledlibraries with LLD."

    It is unfortunate but in short ld.lld needs to work around the oldIBM XL compiler issue. Otherwise, if the object file is linked in-no-pie or -pie mode, the result will beincorrect because the 4 instructions are partially rewritten (the latter2 are not changed).

    Range extension thunks

    On PPC32, an unconditional branch instructionb/bl has a range of +-32MiB and may use 3relocation types: R_PPC_LOCAL24PC,R_PPC_REL24, and R_PPC_PLTREL24. If the targetis not reachable from the instruction location, a range extension thunkwill be used. R_PPC_LOCAL24PC is a useless relocation. Alloccurrences can be replaced with R_PPC_REL24.

    Interop betweenPC-relative and TOC functions

    TODO

    TODO --power10-stubs/--no-power10-stubs



沪ICP备19023445号-2号
友情链接