IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Assemblers

    MaskRay发表于 2023-05-09 01:02:25
    love 0

    UNDER CONSTRUCTION

    This archive provides a description of popular assemblers and theirarchitecture-specific differences.

    Assemblers

    GCC generates assembly code and invokes GNU Assembler (also known as"gas"), which is part of GNU Binutils, to convert the assembly code intomachine code. The GCC driver is also capable of accepting assembly inputfiles. Due to GCC's widespread usage, GNU Assembler is arguably the mostpopular assembler.

    Within the LLVM project, the LLVM integrated assembler is a librarythat is linked by Clang, llvm-mc, and lld (for LTO purposes) to generatemachine code. It supports a wide range of GNU Assembler syntax and canbe used as a drop-in replacement for GNU Assembler.

    On the Windows platform, the Microsoft Macro Assembler (MASM) iswidely used.

    For x86 architecture, NASM is another popular assembler.

    Architectures

    x86

    There are two main branches of syntax: Intel syntax and AT&Tsyntax. AT&T syntax is derived from PDP-11 and exhibits several keydifferences:

    • The operand list is reversed compared to Intel syntax.
    • The four-part generic addressing mode is written asdisplacement(base,index,scale) instead of[base+index*scale+disp] in Intel syntax.
    • Immediate values are prefixed with $, while registersare prefixed with %.
    • The mnemonics have a suffix indicating the operand size, e.g.b for 1 byte, w for 2 bytes (Word),d for 4 bytes (Dword), and q for 8 bytes(Qword).

    Although the sigils add some complexity to the language, they doprovide a distinct advantage: symbol references can be parsed withoutambiguity. Many x86 instructions take an operand that can be a registeror a memory location. With sigils, parsing becomes unambiguous, asdemonstrated by examples such as subl var, %eax andsubl $1, %eax.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    % gcc -S a.c
    % cat a.s
    ...
    movl var(%rip), %eax
    addl $3, %eax
    % gcc -S -masm=intel a.c
    % cat a.s
    ...
    .intel_syntax noprefix
    ...
    mov eax, DWORD PTR var[rip]
    add eax, 3

    Intel syntax is generally concise, except for the verbose sizedirectives (e.g., DWORD PTR). It is widely utilized in theWindows environment and within the reverse engineering community.

    However, Intel syntax has a flaw related to ambiguity, as it preventsthe use of variable names that collide with registers (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53929).

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    % cat ambiguous.c
    int *rip, rax;
    int foo() { return rip[rax]; }
    % gcc -S -masm=intel ambiguous.c -o -
    ...
    mov rax, QWORD PTR rip[rip]
    mov edx, DWORD PTR rax[rip]
    % gcc -c -masm=intel ambiguous.c
    /tmp/ccEOMwm6.s: Assembler messages:
    /tmp/ccEOMwm6.s:28: Error: invalid use of register
    /tmp/ccEOMwm6.s:29: Error: invalid use of register

    I believe it would be beneficial if the designers added sigils toIntel syntax to disambiguate symbol references from registers. Theabsence of AT&T-style line noise makes Intel syntax code much morereadable. Unfortunately, Intel syntax is less popular in software codedue to GCC defaulting to AT&T syntax (Please,really, make -masm=intel the default for x86.

    Using as -msyntax=intel -mnaked-reg allows parsing theinput in Intel syntax without a register prefix. This is similar toincluding a .intel_syntax noprefix directive in theinput.

    With llvm-mc -x86-asm-syntax=intel, the input can beparsed in Intel syntax. Using -output-asm-variant=1 willprint instructions in Intel syntax.

    MIPS

    Modifiers are utilized to describe different access types of asymbol. This serves as a bonus as it prevents symbol references frombeing mistaken as register names. However, the function call-like syntaxcan appear verbose.

    1
    2
    3
    4
    5
    lui     a0, %tprel_hi(tls)
    add a0, a0, tp, %tprel_add(tls)
    lw a0, %tprel_lo(tls)(a0)
    lui a1, %hi(var)
    lw a2, %lo(var)(a1)

    Power ISA

    Power ISA assembly may seem unusual, as general-purpose registers arenot prefixed with the r prefix. Whether an integer denotesa register or an immediate value depends on its position as an operandin an instruction. I find that this difference slightly affectsreadability.

    Similar to x86, postfix modifiers are used to describe differentaccess kinds of a symbol.

    1
    2
    3
    4
    addis 5, 13, tls@tprel@ha
    lwz 5, tls@tprel@l(5)
    addis 3, 2, var@toc@ha
    addi 2, 2, var@toc@l

    AArch64

    Prefix modifiers are used to describe various access types of asymbol. Personally, this is the modifier syntax that I prefer themost.

    1
    2
    3
    4
    add     x8, x8, :tprel_hi12:tls
    add x8, x8, :tprel_lo12_nc:tls
    adrp x8, fp
    ldr x8, [x8, :lo12:fp]

    RISC-V

    The modifier syntax is copied from MIPS.

    The documentation is available on https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md.

    Inline assembly

    Certain compilers allow the inclusion of assembly code within ahigh-level language.

    The most widely used implementation is GCC Basic Asmand ExtendedAsm. On Windows, MSVC supports inlineassembly for x86-32 but not for x86-64 and Arm.

    Clang supports both GCC and MSVC inline assembly. Clang's MSVC inlineassembly can be utilized with x86-64.

    Some compilers provide additional variants of inline assembly. Hereare some relevant links:

    • Free Pascal https://wiki.freepascal.org/Asm
    • D https://dlang.org/spec/iasm.html
    • Jai https://jai.community/t/inline-assembly/139
    • Nim https://nim-lang.github.io/Nim/manual.html#statements-and-expressions-assembler-statement

    Notes on GNU Assembler

    .file and .loc directives are used tocreate .debug_line.

    .cfi* directives are used to create.eh_frame or .debug_frame.

    GNU Assembler implements "INDEFINITE REPEAT BLOCK DIRECTIVES: .IRPAND .IRPC" from MACRO-11. Unfortunately there is no directive forfor (int i = 0; i < 20; i++)..irpc i,0123456789 just gives 10 iterations and writing allintegers using .irp is tedious and error-prone.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    .rept 3
    ret
    .endr

    .irpc i,012
    movq $\i, %rax
    .endr

    .irp i,%rax,%rbx,%rcx
    movq \i, %rax
    .endr

    .if, .ifdef, and .ifndefdirectives allow us to write conditional code in assembly tests withoutusing a C preprocessor. I often use .ifdef to combinepositive tests and negative tests in one file.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # RUN: llvm-mc %s | FileCheck %s
    # RUN: not llvm-mc --defsym ERR=1 %s -o /dev/null 2>&1 | FileCheck %s --check-prefix=ERR

    # CHECK: ...
    ## positive tests

    .ifdef ERR
    # ERR: ...
    ## negatives tests
    .endif

    GNU Assembler has supported.incbin since 2001-07 (hey, C/C++ #embed).The review thread mentioned that .incbin had been supportedby some other assemblers.

    Notes on LLVM integratedassembler

    In general, inline assembly is parsed by LLVMMCParser for validationand formatting purposes. Parsing can be disabled for certain targets bydefault, and the parsing can be explicitly disabled by using the-fno-integrated-as option.

    Let's focus on ELF platforms for the following description, assumingour goal is to create a relocatable object file. The input file can beeither LLVM IR (intermediate code; the initial input file may be inC/C++) or assembly language.

    If the input is LLVM IR, LLVM creates a MCObjectStreamerobject with new MCELFStreamer or a target-registeredfactory (e.g., AArch64ELFStreamer). The streamerconstructor creates a MCAssembler object. For an assemblyinput file, LLVM additionally creates a MCAsmParser objectand a MCTargetAsmParser object.

    MSVC inline assembly

    TODO

    __asm blocks are parsed for Windows target triples. Thisextension is available on other targets by specifying-fasm-blocks or the broad -fms-extensions. An__asm statement is represented as aclang::MSAsmStmt object.clang::Parser::ParseMicrosoftAsmStatement parses the inlineassembly string and callsllvm::AsmParser::parseMSInlineAsm. It is worth noting thatthe string may be modified during this process. For aclang::MSAsmStmt object, LLVM IR is generated throughclang::CodeGen::CodeGenFunction::EmitAsmStmt.



沪ICP备19023445号-2号
友情链接