IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Toolchain notes on z/Architecture

    MaskRay发表于 2024-02-12 02:22:39
    love 0

    This article describes some notes about z/Architecturewith a focus on the ELF ABI and ELF linkers. An lld/ELF patch sparked mymotivation to write this post.

    z/Architectureis a mainframe computer architecture supporting 24-bit, 31-bit, and64-bit addressing modes. It is the latest generation in a lineagestretching back to the 1964 with IBM System/360 (32-bit general purposeregisters and 24-bit addressing). This lineage includes System/370(1970), System/370 Extended Architecture (1983), Enterprise SystemsArchitecture/370 (1988), and Enterprise Systems Architecture/390 (1990).For a deeper dive into the design choices behind z/Architecture'sextension from ESA/390, you can refer to "Development and attributesof z/Architecture."

    Linux on IBMZ is a 64-bit operating system on z/Architecture, related to anolder effort porting Linux to ESA/390. As the Wikipedia pageclarifies:

    Historically the Linux kernel architecture designations were "s390"and "s390x" to distinguish between the 32-bit and 64-bit Linux on IBM Zkernels respectively, but "s390" now also refers generally to the oneLinux on IBM Z kernel architecture.

    Documents

    • z/Architecture Principles of Operation: This is theinstruction set manual with an unusual name inheirted from IBMSystem/360 Principles of Operation.
    • Assembler Language Programming for IBM System z: This bookis more readable than Principles of Operation.
    • z/Architecture Reference Summary: A concise reference ofinstructions.
    • zSeriesELF Application Binary Interface Supplement (v1.0.2), 2002:This ABI document has been superseded by s390x-abi.
    • https://github.com/IBM/s390x-abi: The latest version ofthe psABI (processor supplement to the System V ABI) resides here. Whilethe absence of updates between 2002 and 2021 might seem odd, restassured the documentation is actively maintained.

    Instruction notes

    Each instruction has a length of two, four or six bytes, and must belocated at a 2-byte boundary. Six-byte instructions have been availablesince S/360.

    There are 16 64-bit general purpose registers. r14 is used as thelink register while r15 is the stack pointer. In s390x-abi, registers r6to r13, and r15 are designated as designated as non-volatile (notclobbered by a function call). Registers r2 to r6 are used for integerarguments.

    • r6 being non-volatile for argument storage seems uncommon comparedto other architectures.
    • Only 4 registers are used for integer argument storage, which isinadequate. It is unclear why r0 and r1 are not used.

    There are no PC-relative addressing. Fortunately, only oneinstruction is needed to load _GLOBAL_OFFSET_TABLE_ (see"Global Offset Table" below) into a register (usually r12).

    1
    larl    %r12, _GLOBAL_OFFSET_TABLE_ # r12 = _GLOBAL_OFFSET_TABLE_

    Global Offset Table

    The .got section has 3 reserved entries. The linkerdefines _GLOBAL_OFFSET_TABLE_ at the start of.got. _GLOBAL_OFFSET_TABLE_[0] stores thelink-time address of _DYNAMIC, which is used by glibc._GLOBAL_OFFSET_TABLE_[1] and_GLOBAL_OFFSET_TABLE_[2] are for lazy binding PLT(_dl_runtime_resolve and link map in glibc).

    The assembler modifier @GOTENT is an alias for@GOT.

    Compilers generate a LGRL (Load Relative Long) instruction to loadthe GOT entry of a symbol. When the symbol is non-preemptible and not anifunc,the GOTindirection can be optimized to LARL (Load Address Relative Long).This is similar to x86-64's GOTPCRELX optimization.

    1
    2
    3
    4
    5
    lgrl %r1, var@GOT            # R_390_GOTENT(var)

    =>

    larl %r1, var

    Procedure Linkage Table

    At 32 bytes per entry, PLTs are notably larger than otherarchitectures. Only the first 14 bytes (encompassing three instructions)are strictly necessary for eager binding.

    1
    2
    3
    4
    5
    6
    7
    larl %r1, .got.plt[n]
    lg %r1, 0(%r1)
    br %r1
    basr %r1, %r0
    lgf %r1, 12(%r1)
    jg .plt[0]
    .long relocation offset

    Relocations

    There are 5 absolute relocation types:R_390_{8,16,20,32,64}. They can be used as data relocations(.byte, .short, etc) as well as coderelocations.

    • R_390_8 is used by instruction formats with a 8-bitimmediate operand (e.g. SI).
    • R_390_16 is used by instruction formats with a 16-bitimmediate operand (e.g. RI).
    • R_390_20 is used by instruction formats with a 20-bitdisplacement (e.g. RSY, RXY).
    • R_390_32 is used by instruction formats with a 32-bitdisplacement (e.g. RIL).

    R_390_GOTPLT* relocations seem unused.

    Thread Local Storage

    Refer to All aboutthread-local storage for TLS. On s390x, TLS Variant II is employed,with the glibc implementation completedin 2003. overall, this design exhibits lower efficiency compared toother architectures. I believe the low efficiency is a self-inflictedproblem instead of an architectural limitation.

    First, let's look at thread pointer accessing.

    • s390: 32-bit thread pointer stored in 32-bit access registera0.
    • s390x: 64-bit thread pointer split across a0 anda1, both still 32-bit.

    This necessitates three instructions (14 bytes) to retrieve the fullthread pointer, while 64-bit access registers would simplify this:

    1
    2
    3
    ear     %r0, %a0             # r0 = hi(r0) | a0
    sllg %r1, %r0, 32 # r1 = r0<<32
    ear %r1, %a1 # r1 = hi(r1) | a1 = a0<<32 | a1

    General dynamic TLS model

    In the general dynamic TLS model, a key difference compared to otherarchitectures is the use of __tls_get_offset instead of__tls_get_addr. The process involves several steps,illustrated by the provided assembly code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    ear     %r0, %a0
    sllg %r1, %r0, 32
    ear %r1, %a1 # r1 = TP
    larl %r12, _GLOBAL_OFFSET_TABLE_ # r12 = _GLOBAL_OFFSET_TABLE_

    lgrl %r2, .LCPI0_0 # r2 = *(.LCPI0_0) = an offset into .got
    brasl %r14, __tls_get_offset@PLT:tls_gdcall:a # r2 = __tls_get_offset(r2) = dtv[m]+a@DTPOFF - TP
    lgf %r2, 0(%r2,%r1) # r2 = *(r2+r1) = *(dtv[m]+a@DTPOFF) = a

    .section .data.rel.ro,"aw",@progbits
    .LCPI0_0:
    .quad a@TLSGD # R_390_TLS_GD64; linker resolves this to an offset into .got
    • Retrieving the thread pointer and_GLOBAL_OFFSET_TABLE_: Four instructions are required butcan be shared by subsequent TLS accesses. This step can bereordered.
    • Obtaining the GOT offset: The offset (a@TLSGD) isstored in the .data.rel.ro section. The offset refers totwo GOT entries (a tls_index structure), relocated bydynamic relocations R_390_TLS_DTPMOD andR_390_TLS_DTPOFF. The dynamic loader will set the values to(m, a@DTPOFF), the module ID and an offset of the symbolrelative to the dynamic TLS block.
    • Finding the offset relative to the current dynamic TLS block(DTPOFF): __tls_get_offset(r2) returnsdtv[m] + a@DTPOFF - TP. __tls_get_addr inother architectures just return dtv[m] + a@DTPOFF.
    • Adding the thread pointer to get the symbol address in the currentthread

    In glibc, __tls_get_offset is defined as:

    1
    2
    3
    4
    5
    // unsigned long __tls_get_offset(unsigned long offset);

    __tls_get_offset:
    la %r2,0(%r2,%r12)
    jg __tls_get_addr

    While this general dynamic approach works, it's considered the leastefficient implementation of general dynamic TLS among the architecturesI have analyzed. Here is why:

    • Ineffecient tls_index argument (similar to AArch32):This requires an extra lookup in .data.rel.ro.
    • Redundant argument: __tls_get_offset takes the GOToffset instead of the direct GOT entry address.
    • Indirect return value: Instead of returning the final TLS symboladdress directly, __tls_get_offset only provides an offset,requiring an extra instruction for addition with the TP.

    The motivation behind this design might be related to reducing thenumber of instructions rewritten during TLS optimizations. However, itclearly comes at the cost of performance.

    The general-dynamic code sequence can be optimized to initial-exec orlocal-exec.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    // general-dynamic to initial-exec
    lgrl %r2, .LCPIC0_0 # r2 = *(.LCPI0_0) = &.got[n]-_GLOBAL_OFFSET_TABLE_
    lg %r2, 0(%r2,%r12) # r2 = TP offset
    lgf %r2, 0(%r2,%r1) # r2 = *(r2+r1) = TLS value in the current thread

    .section .data.rel.ro,"aw",@progbits
    .LCPI0_0:
    .quad &.got[n]-_GLOBAL_OFFSET_TABLE_ # .got[n], relocated by R_PPC64_TPREL64, holds the TP offset

    // general-dynamic to local-exec
    lgrl %r2, .LCPIC0_0 # r2 = *(.LCPI0_0) = TP offset
    brasl 0, .+0 # nop
    lgf %r2, 0(%r2,%r1) # r2 = *(r2+r1) = TLS value in the current thread

    .section .data.rel.ro,"aw",@progbits
    .LCPI0_0:
    .quad a@NTPOFF

    In both cases, the linker only needs to patch one instruction,instead of four for PPC64.

    Local dynamic TLS model

    The process involves several steps, illustrated by the providedassembly code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    lgrl    %r2,.LC0             # r2 = *(.LC0) = GOT offset of a tls_index object holding {module_ID, 0}
    brasl %r14,__tls_get_offset@PLT:tls_ldcall:a # r2 = __tls_get_offset(r2) = dtv[m]-TP

    ear %r3, %a0
    sllg %r4, %r3, 32
    ear %r4, %a1 # r4 = TP
    la %r2,0(%r2,%r4) # r2 = r2+r4 = dtv[m]

    lgrl %r1, .LC1 # r1 = a@DTPOFF
    lgf %r1,0(%r1,%r2) # r1 = *(a@DTPOFF + dtv[m]) = a

    lgrl %r1, .LC2 # r1 = b@DTPOFF
    lgf %r1,0(%r1,%r2) # r1 = *(b@DTPOFF + dtv[m]) = b

    .section .data.rel.ro,"aw"
    .align 8
    .LC0: .quad a@TLSLDM # R_390_TLS_LDM64(a); linker resolves this to a GOT offset of tls_index{m, 0}
    .LC1: .quad a@DTPOFF # R_390_TLS_LDO64(a); linker resolves this to a's offset relative to dtv[m]
    .LC2: .quad b@DTPOFF # R_390_TLS_LDO64(b); linker resolves this to b's offset relative to dtv[m]
    • Retrieving the thread pointer and_GLOBAL_OFFSET_TABLE_
    • Obtaining the GOT offset: The offset (a@TLSLDM) isstored in the .data.rel.ro section. The offset refers totwo GOT entries (a tls_index structure): the module ID anda zero. The module ID entry is relocated by a dynamic relocationR_390_TLS_DTPMOD.
    • Finding the dynamic TLS block address:__tls_get_offset(r2) returns dtv[m] - TP. Itis not dtv[m] + XXX - TP because the second GOT entry iszero.
    • Adding DTPOFF to get the symbol address in the current thread

    The first three steps can be shared among TLS symbols.

    The local-dynamic code sequence can be optimized to local-exec.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    lgrl    %r2,.LC0             # r2 = 0
    brcl 0, . # nop

    ear %r3, %a0
    sllg %r4, %r3, 32
    ear %r4, %a1 # r4 = TP
    la %r2,0(%r2,%r4) # r2 = r2+r4 = TP

    lgrl %r1, .LC1 # r1 = a@NTPOFF
    lgf %r1,0(%r1,%r2) # r1 = *(a@NTPOFF + TP) = a

    lgrl %r1, .LC2 # r1 = b@NTPOFF
    lgf %r1,0(%r1,%r2) # r1 = *(b@NTPOFF + TP) = b

    .section .data.rel.ro,"aw"
    .align 8
    .LC0: .quad 0
    .LC1: .quad a@NTPOFF # a's TP offset
    .LC2: .quad b@NTPOFF # b's TP offset

    Initial Exec TLS model

    1
    2
    lgrl    %r1, a@INDNTPOFF     # R_390_TLS_IEENT(a); linker resolves this to a GOT holding the TP offset
    lgf %r1, 0(%r1,%r7) # r1 = *(a@NTPOFF + TP) = a

    Unfortunately, initial-exec cannot be optimized to local-exec. PPC32has a similar initial-exec TLS code sequence and it allows TLSoptimization by defining a marker relocation.

    Local Exec TLS model

    The code sequence loads the TP offset indirectly in a manner similarto AArch32.

    1
    2
    3
    4
    5
    lgrl    %r1, .LC0            # r1 = a@NTPOFF
    lgf %r1, 0(%r1,%r7) # r1 = *(a@NTPOFF + TP) = a

    .section .data.rel.ro,"aw"
    .LC0: .quad a@NTPOFF # R_390_TLS_LE64; linker resolves this to the TP offset, a negative integer

    The indirection is unfortunate. The lgfi (LoadImmediate) instruction loads a 32-bit signed integer, which can actuallybe used instead.

    Distributions

    • https://wiki.debian.org/SupportedArchitectures
    • https://alt.fedoraproject.org/alt/
    • https://wiki.gentoo.org/wiki/Project:S390


沪ICP备19023445号-2号
友情链接