May 7, 2019 Binary Analysis and Disassembly Petr Ročkai and Lukáš Korenčik Binary Analysis and Disassembly 2/211 May 7, 2019 Organisation • theory: 30-50 minutes every week • remaining time: coding and discussions • there will be 6 bi-weekly assignments ∘ together, they will form a project ∘ each assignment is a milestone Binary Analysis and Disassembly 3/211 May 7, 2019 Grading • you need 7 points to pass the subject • each assignment is worth 1 point • showing up 10 times is worth 1 point • up to 2 points for writing code reviews • up to 2 points for meeting deadlines Binary Analysis and Disassembly 4/211 May 7, 2019 Deadlines and Feedback • the deadline for each assignment is 14 days • beating the deadline gives you 1/3 of a point ∘ the solution must be of sufficient quality • feedback will be given on the off weeks ∘ i.e. 7 days after the deadline Binary Analysis and Disassembly 5/211 May 7, 2019 Programming Language • C or C++ is up to you • you can use up to C11 and up to C++17 • only the standard library and POSIX • no boost, no libelf or BFD Binary Analysis and Disassembly 6/211 May 7, 2019 Repositories • make a repository for your homeworks ∘ git, hg or whatever works for you ∘ make it public and email me the URL • write a simple Makefile (inc. dependencies) ∘ you will only have a few source files ∘ cmake is acceptable but discouraged Binary Analysis and Disassembly 7/211 May 7, 2019 Assignment Submission • tag your repository with hw1, etc. ∘ use hw1.1 etc. for resubmissions • tag dates are what counts for deadlines • we will not look at master head ∘ you can break stuff freely there Binary Analysis and Disassembly 8/211 May 7, 2019 Semester Plan (part 1) date 1. introduction & preliminaries 19.2. 2. instruction sets 26.2. 3. static control flow 5.3. 4. dynamic control flow 12.3. 5. executable files, ELF 19.3. 6. dynamic linking 26.3. Binary Analysis and Disassembly 9/211 May 7, 2019 Semester Plan (part 2) date 7. debug info 2.4. 8. DWARF 9.4. 9. function calls and frames 16.4. 10. advanced instructions 23.4. 11. debugger basics 30.4. 12. decompilation basics 7.5. Binary Analysis and Disassembly 10/211 May 7, 2019 Assignment Schedule given due 1. decoding instructions 26.2. 12.3. 2. basic blocks & branching 12.3. 26.3. 3. making sense of ELF 26.3. 9.4. 4. parsing symbol tables 9.4. 23.4. 5. reconstructing functions 23.4. 7.5. 6. a complete disassembler 7.5. 21.5. Binary Analysis and Disassembly 11/211 May 7, 2019 Part 1: Preliminaries Binary Analysis and Disassembly 12/211 May 7, 2019 Machine Code • consists of individual instructions • encoded in a tightly-packed binary format ∘ may be fixed or variable length • stored program architecture ∘ instructions live in addressable memory Binary Analysis and Disassembly 13/211 May 7, 2019 Assembly • symbolic language one level above machine code ∘ abstracts away from numeric addresses ∘ replaces them with symbolic labels • instructions are encoded in a text format • designed for humans (but rarely used nowadays) Binary Analysis and Disassembly 14/211 May 7, 2019 C and Compilers • another layer of abstraction over assembly • abstracts away the specifics of hardware architecture ∘ registers, stack management, opcodes ∘ provides structured control flow • still a low-level language, mostly OS-level programs Binary Analysis and Disassembly 15/211 May 7, 2019 Compiled High-Level Languages • another abstraction rung above C ∘ algebraic or class-based type systems ∘ abstract data structures ∘ extensive standard libraries ∘ late dispatch, lexical closures, ... • e.g. C++, Rust, ML, Haskell, (Java) Binary Analysis and Disassembly 16/211 May 7, 2019 Interpreters • typically the highest rung of the abstraction tower ∘ dynamic types, garbage collectors ∘ powerful, high-level libraries or APIs • often realized as JIT compilers / virtual machines ∘ usually implemented in C or C++ • e.g. JavaScript, Python, Ruby, bash, R Binary Analysis and Disassembly 17/211 May 7, 2019 Disassembly • going from machine code to assembly ∘ decode instruction ∘ recover control flow structure • print the program in human-readable format • re-assembling should give identical machine code Binary Analysis and Disassembly 18/211 May 7, 2019 Decompilation • attempt to reconstruct high-level code ∘ recovery of structured control flow (if, while) ∘ identification of local variables ∘ recovery of addresses • decompile → compile is not idempotent Binary Analysis and Disassembly 19/211 May 7, 2019 Exercise 1.1: objdump • read the objdump manpage • try objdump -x on some binaries ∘ /usr/bin/gzip ∘ your own test program (hello world style) ∘ try -static, -fPIC &c. ∘ try both the .o file and the executable • also try objdump -x --disassemble Binary Analysis and Disassembly 20/211 May 7, 2019 Exercise 1.2: gdb • compile your test program with -g • gdb [--args] ./a.out • start • stepi, disassemble, print $rax • break • for more user friendliness: layout Binary Analysis and Disassembly 21/211 May 7, 2019 Exercise 1.3: reading binary data $ printf "\x03\x12\x01\x00\x00\x00" > file.bin read the above data into the following structure struct __attribute__((packed)) d { short x; int y; } expected result: x = 4611, y = 1 Binary Analysis and Disassembly 22/211 May 7, 2019 Part 2: Instruction Sets Binary Analysis and Disassembly 23/211 May 7, 2019 Instruction Anatomy • opcode: what to do • operands ∘ immediate values (part of instruction) ∘ register references ∘ memory references (immediate or via register) • modifiers (e.g. lock) Binary Analysis and Disassembly 24/211 May 7, 2019 Opcode Classes • control flow • integer arithmetic • bit operations • memory access • floating point arithmetic • special instructions Binary Analysis and Disassembly 25/211 May 7, 2019 Register Classes • GPR: General Purpose Register ∘ hold a single word: integers, addresses • SIMD (vector) and/or floating point registers • pointers: stack, instruction, frame (base) • machine control registers Binary Analysis and Disassembly 26/211 May 7, 2019 Control Flow • conditional & unconditional jumps ∘ direct (fixed address) ∘ indirect unconditional (computed address) ∘ conditional on results of arithmetic • subroutine calls and returns ∘ use the stack for return addresses Binary Analysis and Disassembly 27/211 May 7, 2019 Arithmetic • addition, subtraction • signed+unsigned division, multiplication • integer comparison (signed/unsigned) • standard instructions up to word size (64b) ∘ 128b operations are available too Binary Analysis and Disassembly 28/211 May 7, 2019 Bit Operations • bitwise and, or, xor, negate • shifts and rotations • bit field packing/unpacking • bit counting, endianity conversion Binary Analysis and Disassembly 29/211 May 7, 2019 Memory Access • load from and store into memory • various address computation modes ∘ part of the access instruction ∘ special-purpose arithmetic (lea) ∘ general-purpose arithmetic Binary Analysis and Disassembly 30/211 May 7, 2019 Addressing Modes • scalars: base register + offset ∘ especially useful for stack variables ∘ also globals (relative to program counter) • arrays: base register + immediate * index register • ‘far’ addressing for segmented memory (obsolete) Binary Analysis and Disassembly 31/211 May 7, 2019 Floating Point Arithmetic • separate instruction set • separate registers (distinct from GPRs) • variable precision (usually 32b, 64b, 80b) • governed by IEEE 754 Binary Analysis and Disassembly 32/211 May 7, 2019 Specials: Synchronisation • atomic memory access ∘ read-modify-write (add, sub, xor, ...) ∘ compare + exchange • memory fences / barriers • on amd64 encoded using the lock prefix Binary Analysis and Disassembly 33/211 May 7, 2019 Specials: Vector Instructions • SIMD: single instruction (opcode), multiple data • integer and floating-point arithmetic • 4-8 values packed in 128b or 256b register • speeds up number crunching considerably • on top of usual superscalar execution Binary Analysis and Disassembly 34/211 May 7, 2019 Specials: User Mode • checksums (e.g. crc32) • symmetric crypto (aes-ni) • random numbers (rdrand, rdseed) • processor capabilities (cpuid) • timers (rdtsc) Binary Analysis and Disassembly 35/211 May 7, 2019 Specials: Privileged Mode • CPU management opcodes and registers • interrupt handling • system calls • cache control • debugging and monitoring • virtualisation Binary Analysis and Disassembly 36/211 May 7, 2019 Exercise 2.1 • learn a bit more about assembly • use gcc -S to produce examples ∘ you can also try -fverbose-asm • write a recursive factorial (in C) ∘ use gdb instruction stepping ∘ try an iterative version too Binary Analysis and Disassembly 37/211 May 7, 2019 Exercise 2.2 • write a simple assembly program • borrow the prologue/epilogue from gcc • sum an arithmetic/geometric sequence ∘ use formulas first (just arithmetic) ∘ try using a summing loop Binary Analysis and Disassembly 38/211 May 7, 2019 Instruction Encoding • how to encode opcodes and operands into bytes • fixed-length or variable-length • fixed: e.g. VLIW (very long instruction word) ∘ often employs instruction combining ∘ variant: fixed opcodes, trailing immediate operands Binary Analysis and Disassembly 39/211 May 7, 2019 Variable-Length Coding • saves space compared to fixed-width coding • often much harder to decode • usually decoded from left to right • first byte affects what second byte means, &c. • already-decoded prefix tells you whether to continue Binary Analysis and Disassembly 40/211 May 7, 2019 Encoding on AMD64 • programmer’s manual in study materials • variable length (even opcodes) • very long history of extensions • different meaning in different CPU modes • not a very clean encoding of a messy instruction set Binary Analysis and Disassembly 41/211 May 7, 2019 Assignment 1 • write an instruction decoder for amd64 • have make decode build the binary • invocation ./decode 74 1a ∘ prints: je 0x1a(%rip) • we will only decode a small subset of instructions • print unknown instruction if that is the case Binary Analysis and Disassembly 42/211 May 7, 2019 Assignment 1: Required • branching: jmp, je, jne, jb ∘ operands: rel8off, rel32off • stack: push, pop (64b only) • calls: near call (rel32off) and ret • mov in 64b mem/reg versions (details later) • a few arithmetic and bitwise ops, nop, int3 Binary Analysis and Disassembly 43/211 May 7, 2019 Assignment 1: Arithmetic & Bitwise • xor eax imm32 and xor rax imm32 • add eax imm32 and add rax imm32 • mul with 2 64b registers (rax – rdx) • cmp eax imm32 and cmp rax imm32 • cmp with 2 64b registers (rax – rdx) Binary Analysis and Disassembly 44/211 May 7, 2019 Assignment 1: mov • only the 89 and 8B opcodes ∘ with 2 64b registers (rax – rdx) ∘ from memory to a 64b register ∘ from a 64b register to memory • memory: address in rax or rbx ∘ rip and rbp + 32b displacement Binary Analysis and Disassembly 45/211 May 7, 2019 Assignment 1: Not Required • anything in the VEX maps • memory operands other than ∘ mov with address in rax or rbx ∘ mov with rip and rbp + disp32 • prefixes other than the REX range Binary Analysis and Disassembly 46/211 May 7, 2019 Assignment 1: Hints • most 64b instructions need a REX prefix (0x40-0x4F) • exceptions: call, ret, jmp, branching ∘ some of the push/pop (those of ‘named’ GPRs) • look for complete decoded examples in objdump Binary Analysis and Disassembly 47/211 May 7, 2019 Part 3: Static Control Flow Binary Analysis and Disassembly 48/211 May 7, 2019 Control Flow • answers the question ‘what to do next?’ • normally, instructions run in a sequence • just like statements in C • how about conditionals and loops? Binary Analysis and Disassembly 49/211 May 7, 2019 Structured Control Flow • used in high-level languages • if statements or expressions • structured loops: while, for • not available in machine code Binary Analysis and Disassembly 50/211 May 7, 2019 Goto • also known as unstructured control flow • goto jumps from one place to another ∘ the destination is called a label ∘ the jump is unconditional (always taken) • if + goto → any loop Binary Analysis and Disassembly 51/211 May 7, 2019 Goto: Example int f( int x ) { int i = x; loop: x = x * --i; if ( i > 1 ) goto loop; return ~x; } Binary Analysis and Disassembly 52/211 May 7, 2019 Machine Code • goto is basically a jump instruction • there are no labels in machine code • assembler computes label offsets • there is also a conditional jump instruction ∘ perform the goto only if a condition holds Binary Analysis and Disassembly 53/211 May 7, 2019 Simplified if • in C, if can guard arbitrary statements • what if it could only guard exactly 1 goto? ∘ and there is no else either • we can still do everything if ( x ) { foo(); bar(); } else baz(); Binary Analysis and Disassembly 54/211 May 7, 2019 Reinventing if if_begin: if ( !x ) goto if_false; foo(); bar(); goto if_end; if_false: baz(); if_end: Binary Analysis and Disassembly 55/211 May 7, 2019 Conditional Jump • recall if ( x > 0 ) goto loop • this is basically 2 instructions • first is cmp, the second is jg • conditional goto is conditional jump • used to encode all control flow in machine code Binary Analysis and Disassembly 56/211 May 7, 2019 Basic Blocks • abstraction used by compilers • starts with a label • followed by a sequence of non-jump instructions ∘ no labels or jumps in the sequence • with a single jump/branch at the end Binary Analysis and Disassembly 57/211 May 7, 2019 Control Flow Graph • take instructions as nodes • control flow as edges • extremely useful for code analysis • using basic blocks makes the graph much smaller Binary Analysis and Disassembly 58/211 May 7, 2019 Exercise 3.1 • rewrite this program with conditional gotos while ( x < 1000 ) { x *= 5; if ( x % 7 == 0 ) break; x --; } Binary Analysis and Disassembly 59/211 May 7, 2019 int fib( int n ) /* exercise 3.2 */ { if ( n <= 2 ) return 1; else { int a = fib( n - 1 ); int b = fib( n - 2 ); return a + b; } } Binary Analysis and Disassembly 60/211 May 7, 2019 Exercise 3.3 • take the goto version of program from 3.2 • change it to only have one return statement • draw the control flow graph of both versions Binary Analysis and Disassembly 61/211 May 7, 2019 Exercise 3.4 • write an iterative version of fib • you can use the argument + 3 variables • change it into goto form • draw the control flow graph Binary Analysis and Disassembly 62/211 May 7, 2019 Exercise 3.5 • compile all above programs into object files • disassemble them using objdump • recover control flow from the assembly ∘ only add labels that are required ∘ identify basic blocks Binary Analysis and Disassembly 63/211 May 7, 2019 Exercise 3.6 • rewrite program from 3.4 into assembly by hand • only use registers for computation • start from an empty int fib( int ) skeleton • check that the program does the right thing Binary Analysis and Disassembly 64/211 May 7, 2019 Part 4: Dynamic Control Flow Binary Analysis and Disassembly 65/211 May 7, 2019 Last Time • direct conditional + unconditional jumps • basic blocks, control flow graph Today • direct & indirect function calls, returns • indirect jumps and jump tables Binary Analysis and Disassembly 66/211 May 7, 2019 Function Calls • call is usually static (fixed address) • but ret jumps to a dynamic address ∘ also known as return address • arguments are passed in registers or on stack • local variables are stored on the stack Binary Analysis and Disassembly 67/211 May 7, 2019 Call Frame • each function uses up a section of the stack ∘ known as a frame, holds automatic local variables ∘ though some of those might only live in registers • there’s also stuff in-between frames ∘ arguments, register spills, return address Binary Analysis and Disassembly 68/211 May 7, 2019 Indirect Jump • jump to a dynamic address (i.e. not constant) • often arises from switch statements (in C) • either computed or via a jump table • looks like jmp *%rax (if the address is in rax) Binary Analysis and Disassembly 69/211 May 7, 2019 Ex 4.1 • write a simple C function with a switch • use consecutive integer cases (i.e. 1, 2, 3, ...) • put different code in each branch (e.g. return N) • compile with gcc and clang with different -O ∘ compare the assembly output Binary Analysis and Disassembly 70/211 May 7, 2019 Detour: Graphviz • a simple but powerful tool to draw graphs • accepts plain-text input that looks like this digraph G { 1 [ shape=rectangle label="box" ] 2 [ shape=rectangle label="another\lbox\l" ] 1 -> 2 [ label="arrow" ] } Binary Analysis and Disassembly 71/211 May 7, 2019 Ex 4.2 • draw the CFG from 3.3 or 3.4 using dot • see https://graphviz.org for docs • to look at the result, use dot -Tx11 < cfg.dot ∘ dot -Tpdf > cfg.pdf also works • put instructions into the boxes Binary Analysis and Disassembly 72/211 May 7, 2019 Assignment 2 • extend your decoder to allow multiple instructions • print each instruction on a separate line • assume the code starts at address 0 • decompose the code into basic blocks • use graphviz dot to generate a CFG Binary Analysis and Disassembly 73/211 May 7, 2019 Assignment 2: Input • continue to allow ascii/hex bytes in argv[] • if no args given, read a raw binary from stdin • you can assume there are only known instructions • and the input will be at most 2KB (for now) Binary Analysis and Disassembly 74/211 May 7, 2019 Assignment 2: Output • generate ‘maximal’ basic blocks ∘ print #