IT博客汇 | lld 17 ELF changes

lld 17 ELF changes

MaskRay发表于 2023-07-31 04:01:32

LLVM 17 will be released. As usual, I maintain lld/ELF and have addedsome notes to https://github.com/llvm/llvm-project/blob/release/17.x/lld/docs/ReleaseNotes.rst.Here I will elaborate on some changes.

When --threads= is not specified, the number ofconcurrency is now capped to 16. A large --thread= can harmperformance, especially with some system malloc implementations likeglibc's. (D147493)
--remap-inputs= and --remap-inputs-file=are added to remap input files. (D148859)
--lto= is now available to supportclang -funified-lto (D123805)
--lto-CGO[0-3] is now available to controlCodeGenOpt::Level independent of the LTO optimizationlevel. (D141970)
--check-dynamic-relocations= is now correct 32-bittargets when the addend is larger than 0x80000000. (D149347)
--print-memory-usage has been implemented for memoryregions. (D150644)
SHF_MERGE, --icf=, and--build-id=fast have switched to 64-bit xxh3. (D154813)
Quoted output section names can now be used in linker scripts.(#60496 <https://github.com/llvm/llvm-project/issues/60496>_)
MEMORY can now be used without a SECTIONScommand. (D145132)
REVERSE can now be used in input section descriptionsto reverse the order of input sections. (D145381)
Program header assignment can now be used withinOVERLAY. This functionality was accidentally lost in 2020.(D150445)
Operators ^ and ^= can now be used inlinker scripts.
LoongArch is now supported.
DT_AARCH64_MEMTAG_* dynamic tags are now supported. (D143769)
AArch32 port now supports BE-8 and BE-32 modes for big-endian. (D140201) (D140202) (D150870)
R_ARM_THM_ALU_ABS_G* relocations are now supported. (D153407)
.ARM.exidx sections may start at non-zero outputsection offset. (D148033)
Arm Cortex-M Security Extensions is now implemented. (D139092)
BTI landing pads are now added to PLT entries accessed by rangeextension thunks or relative vtables. (D148704) (D153264)
AArch64 short range thunk has been implemented to mitigate theperformance loss of a long range thunk. (D148701)
R_AVR_8_LO8/R_AVR_8_HI8/R_AVR_8_HLO8/R_AVR_LO8_LDI_GS/R_AVR_HI8_LDI_GShave been implemented. (D147100) (D147364)
--no-power10-stubs now works for PowerPC64.
DT_PPC64_OPT is now supported. (D150631)
PT_RISCV_ATTRIBUTES is added to include theSHT_RISCV_ATTRIBUTES section. (D152065)
R_RISCV_PLT32 is added to support C++ relative vtables.(D143115)
RISC-V global pointer relaxation has been implemented. Specify--relax-gp to enable the linker relaxation. (D143673)
The symbol value of foo is correctly handled when--wrap=foo and RISC-V linker relaxation are used. (D151768)
x86-64 large data sections are now placed away from code sections toalleviate relocation overflow pressure. (D150510)

When using glibc malloc with a largerstd::thread::hardware_concurrency (say, more than 16),parallel relocation scanning can be quite slower without the--threads=16 throttling.

I usually try to make extensions, unless too LLVM internal specific(e.g. --lto-*), accepted by the binutils community. The feature request for--remap-inputs= and --remap-inputs-file= was asuccess story, implemented by GNU ld 2.41.

PT_RISCV_ATTRIBUTES output is still not quite right. Ialso question about its usefulness. Unfortunately, at this stage, it'sdifficult to getrid of it.

This cycle has a surprising number of new features, and I have spentlots of spare time reviewing them to ensure that they are robust andproperly tested. Most stuff is completely unrelated to my day job.

There are quite a few AArch32 changes from Arm engineers, primarilyabout big-endian support and Cortex-M Security Extensions.

I was firm that the RISC-V global pointer relaxation needs to beopt-in. I had a GNU ld --relax-gp patch last year andutilitized this opportunity (ld.lld feature proposal) to move forwardGNU ld --relax-gp. It's unfortunately opt-out, but havingan option is a step forward.

Speed

Unlike previous versions, there is just a minor performanceimprovement compared with lld 15.0.0. I added a simplified version of64-bit xxh3 into the LLVMSupport library and utilized it inlld.

Linking a -DCMAKE_BUILD_TYPE=Debug build of clang 16:

% hyperfine --warmup 2 --min-runs 25 "numactl -C 20-27 "{/tmp/out/custom-16/bin/ld.lld,/tmp/out/custom-17/bin/ld.lld}" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/out/custom-16/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      3.159 s ±  0.035 s    [User: 7.089 s, System: 3.076 s]
  Range (min … max):    3.095 s …  3.250 s    25 runs

Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      3.131 s ±  0.027 s    [User: 6.851 s, System: 3.101 s]
  Range (min … max):    3.080 s …  3.198 s    25 runs

Summary
  'numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8' ran
    1.01 ± 0.01 times faster than 'numactl -C 20-27 /tmp/out/custom-16/bin/ld.lld @response.txt --threads=8'

This influence to the total link time is small. However, if I testthe time proportion of the hash function in the total link time, I cansee that the proportion has been reduced to nearly one third. On someworkload and some machines this effect may be larger.