A language tour
No compiler between you and the machine. Just registers, opcodes, and the raw will to compute.
01 โ The Machine's Native Tongue
Every program you have ever written โ in Python, Go, Rust, JavaScript โ eventually becomes a stream of assembly instructions before any computation happens. Assembly is not an abstraction over the CPU. It is the CPU's own language, transliterated into something humans can read. When you write assembly, nothing is hidden from you. Nothing can be.
"There is nothing more honest than assembly. It does exactly what you tell it. The problem is that it also does exactly what you tell it."
section .data msg db "Hello, World!", 10 ; 10 = newline (ASCII LF) len equ $ - msg ; $ = current address, so len = 14 section .text global _start _start: mov rax, 1 ; syscall number: write(2) mov rdi, 1 ; arg 1: fd = 1 (stdout) mov rsi, msg ; arg 2: buf = pointer to message mov rdx, len ; arg 3: count = 14 bytes syscall ; trap into the kernel mov rax, 60 ; syscall number: exit(2) xor rdi, rdi ; arg 1: status = 0 (xor r,r is fastest zero) syscall
There is no runtime, no standard library, no main(). This program speaks directly to the Linux kernel through the syscall instruction โ a software interrupt that asks the OS to perform a privileged action on your behalf.
02 โ Registers
Main memory holds gigabytes. Cache holds megabytes. Registers hold sixteen 64-bit values โ and they operate at the speed of the processor itself, with zero latency. The art of assembly is moving the right data into the right register at the right moment. A function that avoids spilling registers to memory can be an order of magnitude faster than one that doesn't.
; x86-64 general-purpose registers (System V ABI roles): ; rax โ return value, accumulator ; rdi rsi rdx โ function arguments 1, 2, 3 ; rcx r8 r9 โ function arguments 4, 5, 6 ; rbp rsp โ frame pointer, stack pointer ; rbx r12โr15 โ callee-saved (you must preserve these) ; max(int a, int b) โ two approaches to the same problem max_branch: ; edi = a, esi = b cmp edi, esi ; compare a and b (sets flags) jge .a_wins ; jump if a >= b (signed) mov eax, esi ; b is larger โ return it ret .a_wins: mov eax, edi ret max_branchless: ; same contract, no branch mov eax, edi ; assume a is the answer cmovl eax, esi ; overwrite with b if a < b ret ; cmovl = conditional move if less
cmovl (conditional move if less) is a data-flow instruction โ it selects a value without a jump. Modern CPUs struggle to predict branches in data-dependent comparisons; cmov eliminates the prediction entirely.
03 โ The Stack
When you call a function in any language, the CPU executes a precise ceremony: it pushes the return address onto the stack, adjusts the stack pointer, and jumps. The callee saves the registers it needs, does its work, restores them in reverse order, and returns. High-level languages make this invisible. In assembly, you perform it yourself โ and understand it forever.
; long factorial(int n) โ edi = n, returns in rax factorial: push rbp ; โโฎ standard prologue: mov rbp, rsp ; โโฏ establish a stack frame cmp edi, 1 jle .base_case ; n <= 1 โ return 1 push rdi ; save n โ the call below will clobber rdi dec edi ; edi = n - 1 call factorial ; rax = factorial(n - 1) pop rdi ; restore n from the stack imul rax, rdi ; rax = n * factorial(n - 1) pop rbp ret .base_case: mov rax, 1 ; return 1 pop rbp ret
The push rdi before the recursive call and pop rdi after it is the calling convention made visible: rdi is caller-saved, meaning if you need it after a call, you โ the caller โ are responsible for preserving it.
04 โ Flags & Branches
Every conditional statement in every language becomes some form of compare-and-jump. The CPU maintains a FLAGS register โ a collection of single-bit indicators set as a side effect of arithmetic and comparison instructions. cmp subtracts two values and discards the result, but the flags remain. Then a conditional jump reads those flags and either leaps or falls through.
; FLAGS register bits (set by cmp, sub, add, and others): ; ZF โ Zero Flag (result was zero) ; SF โ Sign Flag (result was negative) ; CF โ Carry Flag (unsigned overflow) ; OF โ Overflow (signed overflow) ; ; jl = jump if less (SF โ OF) ; jge = jump if โฅ (SF = OF) ; jz = jump if zero (ZF = 1) ; jne = jump if not equal (ZF = 0) ; int clamp(int val, int lo, int hi) edi=val, esi=lo, edx=hi clamp: mov eax, edi cmp eax, esi ; val - lo (sets flags, discards result) jge .check_hi ; val >= lo? proceed to upper bound check mov eax, esi ; val < lo: return lo ret .check_hi: cmp eax, edx ; val - hi jle .done ; val <= hi: within range, return val mov eax, edx ; val > hi: return hi .done: ret
cmp a, b is identical to sub a, b except the result is thrown away. The flags it sets are the only thing that matters. This is the foundation of every conditional expression you have ever written.
05 โ The Art of the Idiom
Assembly is full of idioms โ patterns that exploit the hardware's specific quirks to do more with less. They look cryptic until you understand the CPU they were written for. Then they look inevitable. A skilled assembly programmer's code is full of these, and reading them is like reading someone who knows exactly how the machine breathes.
; โโ Zero a register โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ xor rax, rax ; 3 bytes. Shorter than mov rax, 0 (7 bytes). ; Also renamed by the CPU: no dependency on old rax. ; โโ Test a register for zero โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ test eax, eax ; eax & eax โ sets ZF, SF, PF. Discards result. jz .is_zero ; Saves one byte vs. cmp eax, 0 + je. ; โโ Cheap multiply via LEA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ lea rax, [rdi + rdi*2] ; rax = rdi * 3 (no imul instruction) lea rax, [rdi + rdi*8] ; rax = rdi * 9 (shift + add in one cycle) ; โโ Branchless absolute value โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ mov ecx, eax sar ecx, 31 ; arithmetic shift: ecx = 0x00000000 or 0xFFFFFFFF xor eax, ecx ; flip all bits if negative sub eax, ecx ; sub -1 = add 1 โ two's complement round-trip ; result: |eax|, no branch, no mispredict ; โโ Load effective address for pointer arithmetic โโโโโโโโโโโโโ lea rcx, [rdi + rcx*8] ; rcx = rdi + (rcx * 8) โ array element address ; scales any index to any element size for free
lea (Load Effective Address) was designed to compute memory addresses โ but compilers have hijacked it for arithmetic because it can add, shift, and store in a single cycle without touching the FLAGS register. Knowing this is the difference between reading assembly and understanding it.
06 โ The Whole Picture
Security researchers, malware analysts, and CTF players read assembly every day โ it's the only language available when you don't have the source.
Microcontrollers with 2KB of RAM and no OS are still programmed in assembly. When every byte counts, there is no room for a compiler's opinion.
Interrupt handlers, context switches, and the first instructions after power-on are hand-written assembly. Linux still has thousands of lines of it.
Game engines, video codecs, and cryptography libraries drop into hand-optimised assembly for inner loops where every nanosecond is a design decision.
Every compiler has an assembly backend. Inspecting the output of gcc -O2 or rustc is one of the most instructive things a programmer can do.
Once you can read assembly, you understand what your code actually does โ not what you imagined it does. That clarity changes how you write in every other language.