llvm-project 14 will be released soon. I added some lld/ELF notes to https://github.com/llvm/llvm-project/blob/release/14.x/lld/docs/ReleaseNotes.rst. Here I will elaborate on some changes.
--export-dynamic-symbol-list
has been added. (D107317) When I added --export-dynamic-symbol
to GNU ld, H.J. Lu asked me to add this option. I asked myself whether this was necessary but then realized this may help deprecate --dynamic-list
in the long term. --dynamic-list
is confusing. It has a different semantics for executables and shared objects. The symbolic intention for shared objects isn't clear.--why-extract
has been added to query why archive members/lazy object files are extracted. (D109572) This was a long missing feature from ld.lld -Map
. I picked a separate option because I realized that this need is often orthogonal to input section to output section map.-Map
is specified, --cref
will be printed to the specified file. (D114663) A linker's stdout output is often interleaved with different information, so being able to redirect a piece of information to a file is useful. I think it would be nice if GNU ld had --cref=<file>
and not reused -Map
.-z bti-report
and -z cet-report
are now supported. (D113901)--lto-pgo-warn-mismatch
has been added. (D104431)--warn-backrefs
. One may build such an archive with llvm-ar rcS [--thin]
to save space. (D117284) In 15.0.0, the archive symbol table will be entirely ignored. Archives and --start-lib has more context.-O1
. This results in a larger .strtab
(usually less than 1%) but a faster link time. Use optimization level -O2
to restore the deduplication. In 15.0.0, the -O2
deduplication is dropped to help parallel .symtab
write.--compress-debug-sections=zlib
is now run in parallel. {clang,gcc} -gz
link actions are significantly faster. (D117853) Compressed debug sections#linkers has more context..rela.dyn
and SHF_MERGE|SHF_STRINGS
sections (e.g. .debug_str
) is now run in parallel.Linker script changes:
INSERT
comment now gets appropriate flags. (D118529)Architecture specific changes:
--no-relax
can suppress the optimization. This is a time that ld.lld gets an optimization (for the linked output) earlier than GNU ld. (D112063) (D117614)-mtls-dialect=gnu2
). (D112582)R_X86_64_GOTPC32_TLSDESC
and R_X86_64_TLSDESC_CALL
(-mtls-dialect=gnu2
). (D114416)-mtls-dialect=gnu2
referencing the same TLS variable is now supported. (D114416)--no-relax
now suppresses R_X86_64_GOTPCRELX
and R_X86_64_REX_GOTPCRELX
GOT optimization (D113615)R_X86_64_PLTOFF64
is now supported. (D112386)R_AARCH64_NONE
, R_PPC_NONE
, and R_PPC64_NONE
in input REL relocation sections are now supported.Breaking changes
e_entry
no longer falls back to the address of .text
if the entry symbol does not exist. Instead, a value of 0 will be written. (D110014)--lto-pseudo-probe-for-profiling
has been removed. In LTO, the compiler enables this feature automatically. (D110209)--[no-]define-common
, -d
, -dc
, and -dp
will now get a warning. They will be removed or ignored in 15.0.0. (llvm-project#53660 <https://github.com/llvm/llvm-project/issues/53660>
_)(Compared with glibc malloc, linking against libmimalloc.a is 1.12x as fast.) I use a -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXE_LINKER_FLAGS=-Wl,--push-state,$HOME/Dev/mimalloc/out/release/libmimalloc.a,--pop-state -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD=X86
-fno-pic -no-pie
build. The host compiler is a close-to-main clang. Both input and output is in tmpfs.
I have made dozens of changes scattering across the lld/ELF codebase to improve performance, e.g.
Linking a -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON
build of clang: 1
2
3
4
5
6
7
8
9
10
11
12% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "/tmp/llvm-{13,14}/out/release/bin/ld.lld" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.133 s ± 0.007 s [User: 1.277 s, System: 0.436 s]
Range (min … max): 1.119 s … 1.142 s 16 runs
Benchmark 2: numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.003 s ± 0.012 s [User: 1.286 s, System: 0.439 s]
Range (min … max): 0.988 s … 1.025 s 16 runs
Summary
'numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8' ran
1.13 ± 0.01 times faster than 'numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8'.strtab
deduplication. (--threads=2
=> 1.16x, --threads=2
=> 1.17x)
Linking a -DCMAKE_BUILD_TYPE=Debug
build of clang: 1
2
3
4
5
6
7
8
9
10
11
12% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "/tmp/llvm-{13,14}/out/release/bin/ld.lld" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 5.237 s ± 0.032 s [User: 8.976 s, System: 1.831 s]
Range (min … max): 5.194 s … 5.288 s 16 runs
Benchmark 2: numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 4.480 s ± 0.024 s [User: 8.674 s, System: 1.756 s]
Range (min … max): 4.442 s … 4.522 s 16 runs
Summary
'numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8' ran
1.17 ± 0.01 times faster than 'numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8'--threads=1
=> 1.05x, --threads=2
=> 1.09x)
Linking a -DCMAKE_BUILD_TYPE=RelWithDebInfo
build of clang: 1
2
3
4
5
6
7
8
9
10
11
12% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "/tmp/llvm-{13,14}/out/release/bin/ld.lld" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.520 s ± 0.017 s [User: 3.797 s, System: 1.210 s]
Range (min … max): 1.479 s … 1.545 s 16 runs
Benchmark 2: numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.101 s ± 0.012 s [User: 3.679 s, System: 1.244 s]
Range (min … max): 1.084 s … 1.125 s 16 runs
Summary
'numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8' ran
1.38 ± 0.02 times faster than 'numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8'--threads=2
=> 1.16x) 0.13s of the saving is credited to parallel write of .debug_str
. 0.08s is credited to not using posix_fallocate
. More is credited to optimized computation and sort of .rela.dyn
.
Linking a default build of chrome: 1
2
3
4
5
6
7
8
9
10
11
12% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "/tmp/llvm-{13,14}/out/release/bin/ld.lld" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 8.017 s ± 0.042 s [User: 7.440 s, System: 3.238 s]
Range (min … max): 7.946 s … 8.089 s 16 runs
Benchmark 2: numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 6.857 s ± 0.052 s [User: 6.921 s, System: 3.006 s]
Range (min … max): 6.796 s … 6.982 s 16 runs
Summary
'numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8' ran
1.17 ± 0.01 times faster than 'numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8'
I have made some changes decreasing sizeof(SymbolUnion)
and sizeof(InputSection)
. There is a 1~2% decrease for some programs with several malloc implementations.
ThinLTO application will see more reduction. lld uses file-backed mmap to read input files. For ThinLTO indexing, the page buffers are nearly unused after symbol resolution. I have changed lld to call madvise(MADV_DONTNEED)
to overlap the page buffer memory with the memory allocated by LTO library (mostly ThinLTO import and export lists): https://reviews.llvm.org/D116367. This change led to a 16% reduction when linking a large executable.
I have made another change that changed the -–start-lib
code path to cache the symbol interning result, which led to 0.6% reduction: https://reviews.llvm.org/D116390.
I have audited commits by others. Almost all have nearly no size difference or slightly increase code size. I have made some patches which improve flexibility and increase code size, but a dozen which decreases the code size. In a -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
builds,
1 | % stat -c %s llvm-13/out/release/lib/../tools/lld/ELF/CMakeFiles/lldELF.dir/**/*.o | awk '{s+=$1}END{print s}' |
mold 1.1 was just released. People who wonder about lld's performance can check out Why isn't ld.lld faster?.
If one module consists of N features, the time complexity incrementally updating this module may be more than O(N), because the N features may interact and result in more than linear edges (though possibly still smaller than O(N^2)). This rule applies to introducing multi-threading to ld.lld's symbol processing.