UNDER CONSTRUCTION
This archive provides a description of popular assemblers and theirarchitecture-specific differences.
GCC generates assembly code and invokes GNU Assembler (also known as"gas"), which is part of GNU Binutils, to convert the assembly code intomachine code. The GCC driver is also capable of accepting assembly inputfiles. Due to GCC's widespread usage, GNU Assembler is arguably the mostpopular assembler.
Within the LLVM project, the LLVM integrated assembler is a librarythat is linked by Clang, llvm-mc, and lld (for LTO purposes) to generatemachine code. It supports a wide range of GNU Assembler syntax and canbe used as a drop-in replacement for GNU Assembler.
On the Windows platform, the Microsoft Macro Assembler (MASM) iswidely used.
For x86 architecture, NASM is another popular assembler.
There are two main branches of syntax: Intel syntax and AT&Tsyntax. AT&T syntax is derived from PDP-11 and exhibits several keydifferences:
displacement(base,index,scale)
instead of[base+index*scale+disp]
in Intel syntax.$
, while registersare prefixed with %
.b
for 1 byte, w
for 2 bytes (Word),d
for 4 bytes (Dword), and q
for 8 bytes(Qword).Although the sigils add some complexity to the language, they doprovide a distinct advantage: symbol references can be parsed withoutambiguity. Many x86 instructions take an operand that can be a registeror a memory location. With sigils, parsing becomes unambiguous, asdemonstrated by examples such as subl var, %eax
andsubl $1, %eax
.
1 | % gcc -S a.c |
Intel syntax is generally concise, except for the verbose sizedirectives (e.g., DWORD PTR
). It is widely utilized in theWindows environment and within the reverse engineering community.
However, Intel syntax has a flaw related to ambiguity, as it preventsthe use of variable names that collide with registers (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53929).1
2
3
4
5
6
7
8
9
10
11% cat ambiguous.c
int *rip, rax;
int foo() { return rip[rax]; }
% gcc -S -masm=intel ambiguous.c -o -
...
mov rax, QWORD PTR rip[rip]
mov edx, DWORD PTR rax[rip]
% gcc -c -masm=intel ambiguous.c
/tmp/ccEOMwm6.s: Assembler messages:
/tmp/ccEOMwm6.s:28: Error: invalid use of register
/tmp/ccEOMwm6.s:29: Error: invalid use of register
I believe it would be beneficial if the designers added sigils toIntel syntax to disambiguate symbol references from registers. Theabsence of AT&T-style line noise makes Intel syntax code much morereadable. Unfortunately, Intel syntax is less popular in software codedue to GCC defaulting to AT&T syntax (Please,really, make -masm=intel
the default for x86.
Using as -msyntax=intel -mnaked-reg
allows parsing theinput in Intel syntax without a register prefix. This is similar toincluding a .intel_syntax noprefix
directive in theinput.
With llvm-mc -x86-asm-syntax=intel
, the input can beparsed in Intel syntax. Using -output-asm-variant=1
willprint instructions in Intel syntax.
Modifiers are utilized to describe different access types of asymbol. This serves as a bonus as it prevents symbol references frombeing mistaken as register names. However, the function call-like syntaxcan appear verbose.
1 | lui a0, %tprel_hi(tls) |
Power ISA assembly may seem unusual, as general-purpose registers arenot prefixed with the r
prefix. Whether an integer denotesa register or an immediate value depends on its position as an operandin an instruction. I find that this difference slightly affectsreadability.
Similar to x86, postfix modifiers are used to describe differentaccess kinds of a symbol.
1 | addis 5, 13, tls@tprel@ha |
Prefix modifiers are used to describe various access types of asymbol. Personally, this is the modifier syntax that I prefer themost.
1 | add x8, x8, :tprel_hi12:tls |
The modifier syntax is copied from MIPS.
The documentation is available on https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md.
Certain compilers allow the inclusion of assembly code within ahigh-level language.
The most widely used implementation is GCC Basic Asmand ExtendedAsm. On Windows, MSVC supports inlineassembly for x86-32 but not for x86-64 and Arm.
Clang supports both GCC and MSVC inline assembly. Clang's MSVC inlineassembly can be utilized with x86-64.
Some compilers provide additional variants of inline assembly. Hereare some relevant links:
.file
and .loc
directives are used tocreate .debug_line
.
.cfi*
directives are used to create.eh_frame
or .debug_frame
.
GNU Assembler implements "INDEFINITE REPEAT BLOCK DIRECTIVES: .IRPAND .IRPC" from MACRO-11. Unfortunately there is no directive forfor (int i = 0; i < 20; i++)
..irpc i,0123456789
just gives 10 iterations and writing allintegers using .irp
is tedious and error-prone.1
2
3
4
5
6
7
8
9
10
11.rept 3
ret
.endr
.irpc i,012
movq $\i, %rax
.endr
.irp i,%rax,%rbx,%rcx
movq \i, %rax
.endr
.if
, .ifdef
, and .ifndef
directives allow us to write conditional code in assembly tests withoutusing a C preprocessor. I often use .ifdef
to combinepositive tests and negative tests in one file.
1 | # RUN: llvm-mc %s | FileCheck %s |
GNU Assembler has supported.incbin
since 2001-07 (hey, C/C++ #embed
).The review thread mentioned that .incbin
had been supportedby some other assemblers.
In general, inline assembly is parsed by LLVMMCParser for validationand formatting purposes. Parsing can be disabled for certain targets bydefault, and the parsing can be explicitly disabled by using the-fno-integrated-as
option.
Let's focus on ELF platforms for the following description, assumingour goal is to create a relocatable object file. The input file can beeither LLVM IR (intermediate code; the initial input file may be inC/C++) or assembly language.
If the input is LLVM IR, LLVM creates a MCObjectStreamer
object with new MCELFStreamer
or a target-registeredfactory (e.g., AArch64ELFStreamer
). The streamerconstructor creates a MCAssembler
object. For an assemblyinput file, LLVM additionally creates a MCAsmParser
objectand a MCTargetAsmParser
object.
TODO
__asm
blocks are parsed for Windows target triples. Thisextension is available on other targets by specifying-fasm-blocks
or the broad -fms-extensions
. An__asm
statement is represented as aclang::MSAsmStmt
object.clang::Parser::ParseMicrosoftAsmStatement
parses the inlineassembly string and callsllvm::AsmParser::parseMSInlineAsm
. It is worth noting thatthe string may be modified during this process. For aclang::MSAsmStmt
object, LLVM IR is generated throughclang::CodeGen::CodeGenFunction::EmitAsmStmt
.