LLVM 17 will be released. As usual, I maintain lld/ELF and have addedsome notes to https://github.com/llvm/llvm-project/blob/release/17.x/lld/docs/ReleaseNotes.rst.Here I will elaborate on some changes.
--threads=
is not specified, the number ofconcurrency is now capped to 16. A large --thread=
can harmperformance, especially with some system malloc implementations likeglibc's. (D147493)--remap-inputs=
and --remap-inputs-file=
are added to remap input files. (D148859)--lto=
is now available to supportclang -funified-lto
(D123805)--lto-CGO[0-3]
is now available to controlCodeGenOpt::Level
independent of the LTO optimizationlevel. (D141970)--check-dynamic-relocations=
is now correct 32-bittargets when the addend is larger than 0x80000000. (D149347)--print-memory-usage
has been implemented for memoryregions. (D150644)SHF_MERGE
, --icf=
, and--build-id=fast
have switched to 64-bit xxh3. (D154813)#60496 <https://github.com/llvm/llvm-project/issues/60496>
_)MEMORY
can now be used without a SECTIONS
command. (D145132)REVERSE
can now be used in input section descriptionsto reverse the order of input sections. (D145381)OVERLAY
. This functionality was accidentally lost in 2020.(D150445)^
and ^=
can now be used inlinker scripts.DT_AARCH64_MEMTAG_*
dynamic tags are now supported. (D143769)R_ARM_THM_ALU_ABS_G*
relocations are now supported. (D153407).ARM.exidx
sections may start at non-zero outputsection offset. (D148033)R_AVR_8_LO8/R_AVR_8_HI8/R_AVR_8_HLO8/R_AVR_LO8_LDI_GS/R_AVR_HI8_LDI_GS
have been implemented. (D147100) (D147364)--no-power10-stubs
now works for PowerPC64.DT_PPC64_OPT
is now supported. (D150631)PT_RISCV_ATTRIBUTES
is added to include theSHT_RISCV_ATTRIBUTES section. (D152065)R_RISCV_PLT32
is added to support C++ relative vtables.(D143115)--relax-gp
to enable the linker relaxation. (D143673)foo
is correctly handled when--wrap=foo
and RISC-V linker relaxation are used. (D151768)When using glibc malloc with a largerstd::thread::hardware_concurrency
(say, more than 16),parallel relocation scanning can be quite slower without the--threads=16
throttling.
I usually try to make extensions, unless too LLVM internal specific(e.g. --lto-*
), accepted by the binutils community. The feature request for--remap-inputs=
and --remap-inputs-file=
was asuccess story, implemented by GNU ld 2.41.
PT_RISCV_ATTRIBUTES
output is still not quite right. Ialso question about its usefulness. Unfortunately, at this stage, it'sdifficult to getrid of it.
This cycle has a surprising number of new features, and I have spentlots of spare time reviewing them to ensure that they are robust andproperly tested. Most stuff is completely unrelated to my day job.
There are quite a few AArch32 changes from Arm engineers, primarilyabout big-endian support and Cortex-M Security Extensions.
I was firm that the RISC-V global pointer relaxation needs to beopt-in. I had a GNU ld --relax-gp
patch last year andutilitized this opportunity (ld.lld feature proposal) to move forwardGNU ld --relax-gp
. It's unfortunately opt-out, but havingan option is a step forward.
Unlike previous versions, there is just a minor performanceimprovement compared with lld 15.0.0. I added a simplified version of64-bit xxh3 into the LLVMSupport
library and utilized it inlld.
Linking a -DCMAKE_BUILD_TYPE=Debug
build of clang 16:1
2
3
4
5
6
7
8
9
10
11
12% hyperfine --warmup 2 --min-runs 25 "numactl -C 20-27 "{/tmp/out/custom-16/bin/ld.lld,/tmp/out/custom-17/bin/ld.lld}" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/out/custom-16/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 3.159 s ± 0.035 s [User: 7.089 s, System: 3.076 s]
Range (min … max): 3.095 s … 3.250 s 25 runs
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 3.131 s ± 0.027 s [User: 6.851 s, System: 3.101 s]
Range (min … max): 3.080 s … 3.198 s 25 runs
Summary
'numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8' ran
1.01 ± 0.01 times faster than 'numactl -C 20-27 /tmp/out/custom-16/bin/ld.lld @response.txt --threads=8'
This influence to the total link time is small. However, if I testthe time proportion of the hash function in the total link time, I cansee that the proportion has been reduced to nearly one third. On someworkload and some machines this effect may be larger.