IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Clang\'s -O0 output: branch displacement and size increase

    MaskRay发表于 2024-04-27 01:33:23
    love 0

    tl;dr Clang 19 will remove the -mrelax-all default at-O0, significantly decreasing the text section size forx86.

    Span-dependent instructions

    In assembly languages, some instructions with an immediate operandcan be encoded in two (or more) forms with different sizes. On x86-64, aJMP/JCC (jumps) can be encoded either in 2 bytes with a 8-bit relativeoffset or 6 bytes with a 32-bit relative offset. The short form ispreferred because it takes less space. However, when the target of thejump is too far away, the long form must be used.

    1
    2
    3
    4
    ja foo    # jump near if above, 77 <rel8>
    ja foo # jump short if above, 0f 87 <rel32>
    .nops 126
    foo: ret

    A 1978 paper by Thomas G. Szymanski ("Assembling Code forMachines with Span-Dependent Instructions") used the term"span-dependent instructions" to refer to such instructions. Assemblersgrapple with the challenge of choosing the optimal size for theseinstructions, often referred to as the "branch displacement problem"since branches are the most common type. A good resource forunderstanding Szymanski's work is AssemblingSpan-Dependent Instructions.

    Start small and grow

    Popular assemblers still used today tend to favor a "start small andgrow" approach, typically requiring one more pass than Szymanski's"start big and shrink" method. This approach often results in smallercode and can handle additional complexities like alignmentdirectives.

    In LLVM, the MClibrary (Machine Code) is reponsible for assembly, disassembly, andobject file formats. Within MC, "assembler relaxation" deals withspan-dependent instructions. This is distinct from linkerrelaxation.

    Eli Bendersky provides a detailed explanation in a 2013blog post and highlights an interesting behavior:

    For example, when compiling with -O0, the LLVM assembler simplyrelaxes all jumps it encounters on first sight. This allows it to putall instructions immediately into data fragments, which ensures there'smuch fewer fragments overall, so the assembly process is faster andconsumes less memory.

    When -O0 is enabled and the integrated assembler is used(common by default), clangDriver passes the -mrelax-allflag to the LLVM MC library. This sets the MCRelaxAll flagin MCTargetOptions, instructing the assembler topotentially start with the long form (near) for JMP and JCC instructionson the X86 target only. Other instructions like ADD/SUB/CMP and non-x86architectures remain unaffected.

    -mrelax-all tradeoff

    Here is an example:

    1
    2
    3
    4
    5
    void foo(int a) {
    // -mrelax-all: near jump (6 bytes)
    // -mno-relax-all or -fno-integrated-as: short jump (2 bytes)
    if (a) bar();
    }

    The assembly (clang -S) looks like:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    foo:                                    # @foo
    # %bb.0: # %entry
    pushq %rbp
    movq %rsp, %rbp
    subq $16, %rsp
    movl %edi, -4(%rbp)
    cmpl $0, -4(%rbp)
    je .LBB0_2
    # %bb.1: # %if.then
    movb $0, %al
    callq bar@PLT
    .LBB0_2: # %if.end
    addq $16, %rsp
    popq %rbp
    retq

    The JE instruction assembles to either a "jump near" (8-bit relativeoffset) or "jump short" (32-bit relative offset).

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    # -mrelax-all
    MCSection
    MCDataFragment: empty
    MCAlignFragment: alignment=4
    MCDataFragment: instructions including JE where JE indicates a "jump short"

    # -mno-relax-all
    MCSection
    MCDataFragment: empty
    MCAlignFragment: alignment=4
    MCDataFragment: instructions before JE (push; mov; sub; mov; cmp)
    MCRelaxableFragment: JE, indicating a "jump near"
    MCDataFragment: instructions after JE (mov; call; add; pop; ret)

    The impact of -mrelax-all on text section size issignificant, especially when there are many branch instructions. In anx86-64 release build of lld, -mrelax-all increased the.text section size by 7.9%. This translates to a 5.4%increase in VM size and a 4.6% increase in the overall file size.

    Dean Michael Berris proposed to remove the-mrelax-all default for -O0 in 2016, butit stalled. -mrelax-all caused undesired interaction issueswith RISC-V's conditionalbranch transforms, leading Craig Topper to remove-mrelax-all at -O0 for RISC-V.

    While -mrelax-all might have offered slight compile timebenefits in the past, the gains are negligible today. Benchmarking usingstage 2 builds of Clang showed no measurable difference between-mrelax-all and -mno-relax-all. Onllvm-compile-time-tracker running the llvm-test-suite/CTMark benchmark,compile time actually increasedslightly by 0.62% while the text section size decreasedby 4.44%.

    A difference for assembly at different optimisation levels would bequite surprising. GCC/GNU assembler don't exhibit similar expansion ofJMP/JCC instructions even at -O0.

    These arguments strengthen the case for removing-mrelax-all as the default for -O0. My patch haslanded and will be included in the next major release, LLVM 19.1.

    Understanding thecompile time difference

    I have studied a notorious huge file,llvm/lib/Target/X86/X86ISelLowering.cpp.

    Fragment count: A significant difference exists inthe number of assembler fragments generated:

    • -mrelax-all: 89633
    • -mno-relax-all: 143852

    With -mrelax-all, the number of MCRelaxableFragments issubstantially reduced (to zero when building Clang). This reductionlikely contributes to the compile time difference.

    Fixed-point iteration: -mrelax-allensures the fixed-point iteration algorithm (almost always) converges ina single iteration. In contrast, with -mno-relax-all,around 6% of sections require additional iterations. However, thisdifference is likely not the primary factor affecting compile time.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    // -mrelax-all
    1: 13919
    2: 1

    // -mno-relax-all
    1: 13103
    2: 793
    3: 23
    4: 1

    Whydidn't people complain about the code size increase?

    Because people generally care less about -O0 code size.-O0 is often used alongside -g for debuggingpurposes. The total file size increase caused by-mrelax-all might seem less significant in comparison.

    In addition, not all projects can be successfully built with-O0 optimization. This is typically due to issues like verylarge programs or mandatory inlining behavior.

    For a discussion on size reduction ideas in ELF relocatable files,please check out my blog post about LightELF.



沪ICP备19023445号-2号
友情链接