Writing Efficient Code in C(++) Petr Ročkai Organisation • theory: 20-30 minutes every week • coding: all the remaining time • passing the subject: collect 7 points • most points come from assignments • showing up 10 times gets you 1 point Writing Efficient Code in C(++) 2/92 Assignments • one assignment every 2 weeks, 5 in total • missing the deadline or failing is the same Deadlines 1. 14 days (Wed by midnight], fetches 2 points 2. end of semester (17.12.), fetches 1.5 points 3. end of the exam period (12.2.), fetches 1 point Writing Efficient Code in C(++) 3/92 Assignments (cont'd) • you can use git, mercurial or dares • put everything that you want me to see on maste r • write a simple Makefile (no cmake, autotools, ...) • each homework gets a target (make hwl through hw5) • use the same repo for in-seminar work (make exl ...) Writing Efficient Code in C(++) 4/92 Competitions • we will hold 3 competitions in the seminar • you'll have 40 minutes to do your best on a small problem • the winner gets 1 point, second place gets .5 point • all other working programs get .2 points • we'll dissect the winning program together Writing Efficient Code in C(++) 5/92 Preliminary Plan 19.9. today computational complexity 26.9. microbenchmarking & statistics 3.10. cancelled 10.10. the memory hierarchy hwOl due 17.10. using callgrind 24.10. tuning for the compiler/optimiser hw02 due 31.10. competition 1 7.11. understanding the CPU hw03 due 14.11. exploiting parallelism 21.11. using pert + competition 2 hw04 due 28.12. Q&A, homework recap 5.12. semester recap + competition 3 hw05 due Writing Efficient Code in C(++) 6/92 Efficient Code • computational complexity • the memory hierarchy • tuning for the compiler & optimiser • understanding the CPU • exploiting parallelism Writing Efficient Code in C(++) 7/92 Understanding Performance • writing and evaluating benchmarks • profiling with callg rind • profiling with pe rf • the law of diminishing returns • premature optimisation is the root of all evil • (but when is the right time?) Writing Efficient Code in C(++) 8/92 Tools • on a POSIX operating system (preferably not in a VM) • perf (Linux-only, sorry) • callg rind (part of the valg rind suite) • kcacheg rind (for visualisation of callg rind logs) • maybe gnuplot for plotting performance data Writing Efficient Code in C(++) 9/92 Compilers • please stick to C++14 and Cll (or C99) • the reference compiler will be clang 5.0.1 • you can use other compilers locally • but your code has to build with clang 5 Writing Efficient Code in C(++) 10/92 Part 1: Computational Complexity Writing Efficient Code in C(++) 11/92 Computational Complexity Complexity and Efficiency • this class is not about asymptotic behaviour • you need to understand complexity to write good code • performance and security implications • what is your expected input size? • complexity vs constants vs memory use Writing Efficient Code in C(++) 12/92 Computational Complexity Quiz • what's the worst-case complexity of: — a bubble sort? (standard) quick sort? — inserting an element into a RB tree? — inserting an element into a hash table? — inserting an element into a sorted vector? — inserting an element into a dynamic array? • what are the amortised complexities? • how about expected (average)? Writing Efficient Code in C(++) 13/92 Computational Complexity Hash Tables • often the most efficient data structure available • poor theoretical worst-case complexity — what if the hash function is really bad? • needs a fast hash function for efficiency — rules out secure (cryptographic] hashes Writing Efficient Code in C(++) 14/92 Computational Complexity Worst-Case Complexity Matters • CVE-2011-4815, 4838, 4885, 2012-0880,... • apps can become unusable with too many items • use a better algorithm if you can (or must] • but: simplicity of code is worth a lot, too • also take memory complexity and constants into account Writing Efficient Code in C(++) 15/92 Computational Complexity Constants Matter • n ops if each takes 1 second • n log n ops if each takes .1 second • n2 ops if each takes .01 second Picking the Right Approach • where are the crossover points? • what is my typical input size? • is it worth picking an approach dynamically? • what happens in pathological cases? Writing Efficient Code in C(++) 16/92 Computational Complexity Exercise 1 • set up your repository and a Make file • implement a bounded priority buffer — holds at most n items — holds at most one copy of a given item — forgets the smallest item if full — fetch/remove the largest item — API: insert, top and remove • two versions: sorted array and a sorted list Writing Efficient Code in C(++) 17/92 Computational Complexity Exercise 1 (cont'd) • write a few unit tests • write a benchmark that inserts (~107) random values • the benchmark can use clock(3) or time (1) • compare the approaches for n = 5,10,10000 • what are the theoretical complexities? • what are your expectations on performance? • can you think of a better overall solution? Writing Efficient Code in C(++) 18/92 Computational Complexity Part 2: Microbenchmarking & Statistics Writing Efficient Code in C(++) 19/92 Microbenchmarking & Statistics Motivation • there's a gap between high-level code and execution • the gap has widened over time — higher-level languages & more abstraction — more powerful optimisation procedures — more complex machinery inside the CPU — complicated cache effects • it is very hard to predict actual performance Writing Efficient Code in C(++) 20/92 Microbenchmarking & Statistics Challenges • performance is very deterministic in theory • this is not the case in practice — time-sharing operating systems — cache content and/or swapping — power management, CPU frequency scaling — program nondeterminism; virtual machines • both micro (unit) and system benchmarks are affected Writing Efficient Code in C(++) 21/92 Microbenchmarking & Statistics Unit vs System Benchmarking • a benchmark only gives you one number • it is hard to find causes of poor performance • unit benchmarks are like unit tests — easier to tie causes to effects — faster to run (minutes or hours vs hours or days) — easier to make parametric Writing Efficient Code in C(++) 22/92 Microbenchmarking & Statistics Isolation vs Statistics • there are many sources of measurement errors • some are systematic, others are random (noise) • noise is best fought with statistics • but statistics can't fix systematic errors • benchmark data is not normally distributed Writing Efficient Code in C(++) 23/92 Microbenchmarking & Statistics Repeated Measurements • you will need to do repeat measurements • more repeats give you better precision — the noise will average out — execution time vs precision tradeoff • the repeat runs form your input sample — this is what you feed into bootstrap Writing Efficient Code in C(++) 24/92 Microbenchmarking & Statistics Bootstrap • usual statistical tools are distribution-dependent • benchmark data is distributed rather oddly • idea: take many random re-samplings of the data • take the 5th and 95th percentiles as a confidence interval • this is a very robust (if stochastic) approach Writing Efficient Code in C(++) 25/92 Microbenchmarking & Statistics Implementing Bootstrap • inputs: a sample, an estimator and iteration count • outputs: a new sample • in each iteration, create a random resample — add a random item from the original sample — we do not care about repeats — size of the resample should be the same as the original Writing Efficient Code in C(++) 26/92 Microbenchmarking & Statistics Estimators • most useful estimators are the mean (average) • and various percentiles (e.g. median) • you can also estimate standard deviation — but keep in mind the original data is not normal Writing Efficient Code in C(++) 27/92 Microbenchmarking & Statistics Output Distribution • the output of bootstrap is another distribution • you can expect this one to be normal • it is the distribution of the estimator result • you can compute the mean and o of the bootstrap Writing Efficient Code in C(++) 28/92 Microbenchmarking & Statistics Confidence Intervals • assume your estimator is the mean (average) • you get a normal distribution of averages — each of them is more or less likely correct — you can pick the average one as your estimate — and take a 2a interval for the CI Writing Efficient Code in C(++) 29/92 Microbenchmarking & Statistics Confidence Interval on Performance • the above gives you a CI on average speed • you may want a confidence interval on actual speed • you can use a 5th or 95th percentile as estimator Writing Efficient Code in C(++) 30/92 Microbenchmarking & Statistics Precise Clocks • available in POSIX via clock_gettime — the resulting time is in nanoseconds • your best bet is CL0CK_M0N0T0NIC (maybe _RAW) • you can ask clock_get res for clock resolution Writing Efficient Code in C(++) 31/92 Microbenchmarking & Statistics Homework 1: Benchmarking • implement a simple benchmarking tool • allow for repeat measurements — make the time limit and precision configurable • you will use this tool for the rest of the semester • the API is up to you Writing Efficient Code in C(++) 32/92 Microbenchmarking & Statistics Reducing Systematic Errors • you can use f o rk () to get fresh processes — the testcase might leak memory — other effects may cause systematic slowdowns • consider the effect of cache content — hot vs cold cache benchmarking Writing Efficient Code in C(++) 33/92 Microbenchmarking & Statistics Output • for each unit benchmark, print a single line of output • it should contain an average & a CI on the average — also a 90% CI on actual runtime • also allow each measurement to be printed out separately • exact format will be decided in the seminar Writing Efficient Code in C(++) 34/92 Microbenchmarking & Statistics Part 3: The Memory Hierarchy Writing Efficient Code in C(++) 35/92 The Memory Hierarchy • many levels of ever bigger, ever slower memories • CPU registers: very few, very fast (no latency) • LI cache: small (100s of KiB), plenty fast (~4 cycles) • L2 cache: still small, medium fast (~12 cycles) • L3 cache: ~2-32 MiB, slow-ish (~36 cycles) • L4 cache: (only some CPUs) -100 MiB (-90 cycles) • DRAM: many gigabytes, pretty slow (~200 cycles) • NVMe: ~10k cycles • SSD: ~20k cycles • spinning rust: ~30M cycles • RTT to US: -450M cycles Writing Efficient Code in C(++) 36/92 The Memory Hierarchy Paging vs Caches • page tables live in slow RAM • address translations are very frequent • and extremely timing-sensitive • TLB -» small, very fast address translation cache • process switch -» TLB flush • but: Tagged TLB, software-managed TLB • typical size: 12 - 4k entries • miss penalties up to 100 cycles Writing Efficient Code in C(++) 37/92 The Memory Hierarchy Additional Effects • some caches are shared, some are core-private • out of order execution to avoid waits • automatic or manual (compiler-assisted) prefetch • speculative memory access • ties in with branch prediction Writing Efficient Code in C(++) 38/92 The Memory Hierarchy Some Tips • use compact data structures (vector > list) • think about locality of reference • think about the size of your working set • code size, not just speed, also matters Writing Efficient Code in C(++) 39/92 The Memory Hierarchy See Also • cpumemory. pdf in study materials — somewhat advanced and somewhat long — also very useful (the title is not wrong] — don't forget to add 10 years — oprofile is now pe rf • http: //www. 7-cpu. com CPU latency data Writing Efficient Code in C(++) 40/92 The Memory Hierarchy Exercise 2 • write benchmarks that measure cache effects Some Ideas • walk a random section of a long std: : list • measure time per item in relation to list size • same but with a std :: vector • same but access randomly chosen elements (vector only) Writing Efficient Code in C(++) 41/92 The Memory Hierarchy Some Issues • unif orm_int_dist ribution has odd timing behaviour • but we don't really care about uniformity • you may need to fight the optimiser a bit • especially make sure to avoid undefined behaviour • indexing vs iteration have wildly different behaviour • shuffling your code slightly can affect the results a lot Writing Efficient Code in C(++) 42/92 The Memory Hierarchy Homework 2: Matrix Multiplication • implement a real-valued matrix data structure • implement 2 matrix multiplication algorithms — natural order — cache-efficient order • compare the implementations using benchmarks Writing Efficient Code in C(++) 43/92 The Memory Hierarchy Part 4: Profiling I, callg rind Writing Efficient Code in C(++) 44/92 Profiling I, callgrind Why profiling? • it's not always obvious what is the bottleneck • benchmarks don't work so well with complex systems • performance is not quite composable • the equivalent of printf debugging isn't too nice Writing Efficient Code in C(++) 45/92 Profiling I, callgrind Workflow 1. use a profiler to identify expensive code — the more time program spent doing X, — the more sense it makes to optimise X 2. improve the affected section of code — re-run the profiler, compare the two profiles — if satisfied with the improvement, goto 1 — else goto 2 Writing Efficient Code in C(++) 46/92 Profiling I, callg rind What to Optimise • imagine the program spends 50 % time doing X — optimise X to run in half the time — the overall runtime is reduced by 25 % — good return on investment • law of diminishing returns — now only 33 % of time is spent on X — cutting X in half again only gives 17 % of total — and so on, until it makes no sense to optimise X Writing Efficient Code in C(++) 47/92 Profiling I, callgrind Flat vs Structured Profiles • flat profiles are easier to obtain • but also harder to use — just a list of functions and cost — the context & structure is missing • call stack data is a lot harder to obtain — endows the profile with very rich structure — reflects the actual control flow Writing Efficient Code in C(++) 48/92 Profiling I, callgrind cachegrind • part of the valg rind tool suite • dynamic translation and instrumentation • based on simulating CPU timings — instruction fetch and decode — somewhat abstract cost model • can optionally simulate caches • originally only flat profiles Writing Efficient Code in C(++) 49/92 Profiling I, callgrind callgrind • records entire call stacks • can reconstruct call graphs • very useful for analysis of complex programs kcachegrind • graphical browser for callgrind data • demo Writing Efficient Code in C(++) 50/92 Profiling I, callgrind Exercise 3 • there's a simple BFS implementation in study materials • you can also use/compare your own BFS implementation • don't forget to use - 02 - g or such when compiling • generate a profile with cacheg rind • load it up into kcacheg rind • generate another, using callg rind this time & compare Writing Efficient Code in C(++) 51/92 Profiling I, callg rind Exercise 3 (cont'd) • add cache simulation options &c. • explore the knobs in kcacheg rind • experiment with the size of the generated graph • optimise the BFS implementation based on profile data Writing Efficient Code in C(++) 52/92 Profiling I, callgrind Part 5: Tuning for the Compiler Writing Efficient Code in C(++) 53/92 Tuning for the Compiler Goals • write high-level code • with good performance What We Need to Know • which costs are easily eliminated by the compiler? • how to make best use of the optimiser (with minimal cost)? Writing Efficient Code in C(++) 54/92 Tuning for the Compiler How Compilers Work • read and process the source text • generate low-level intermediate representation • run IR-level optimisation passes • generate native code for a given target Writing Efficient Code in C(++) 55/92 Tuning for the Compiler Intermediate Representation • for C++ compilers typically a (partial) SSA • reflects CPU design / instruction sets • symbolic addresses (like assembly] • explicit control and data flow Writing Efficient Code in C(++) 56/92 Tuning for the Compiler IR-Level Optimiser • common sub-expression elimination • loop-invariant code motion • loop strength reduction • loop unswitching • sparse conditional constant propagation • (regular) constant propagation • dead code elimination Writing Efficient Code in C(++) 57/92 Tuning for the Compiler Common Sub-expression Elimination • identify redundant (& side-effect free) computation • compute the result only once & re-use the value • not as powerful as equational reasoning Writing Efficient Code in C(++) 58/92 Tuning for the Compiler Loop-Invariant Code Motion • identify code that is independent of the loop variable • and also free of side effects • hoist the code out of the loop • basically a loop-enabled variant of CSE Writing Efficient Code in C(++) 59/92 Tuning for the Compiler The Cost of Calls • prevents CSE (due to possible side effects] • prevents all kinds of constant propagation Inlining • removes the cost of calls • improves all intra-procedural analyses • inflates code size • only possible if the IR-level definition is available See also: link-time optimisation Writing Efficient Code in C(++) 60/92 Tuning for the Compiler The Cost of Abstraction: Encapsulation • API or ABI level? • API: cost quickly eliminated by the inliner • ABI: not even LTO can fix this • ABI-compatible setter is a call instead of a single store Writing Efficient Code in C(++) 61/92 Tuning for the Compiler The Cost of Abstraction: Late Dispatch • used for vi rtual methods in C++ • indirect calls (through a vtable] • also applies to C-based approaches (gobj ect] • prevents (naive) inlining • compilers (try to) devirtualise calls Writing Efficient Code in C(++) 62/92 Tuning for the Compiler Exercise 4: Variant 1 • start with bf s. cpp from study materials • make a version where edges () is in a separate C++ file • you will need to use std:: function • try a compromise using a visitor pattern • compare all three approaches using benchmarks Writing Efficient Code in C(++) 63/92 Tuning for the Compiler Exercise 4: Variant 2 • compare the cost of a direct and indirect call • write a f o reach function that takes a function pointer — use separate compilation to prevent inlining • compare to a loop with a direct call — the function to be called should be simple-ish Writing Efficient Code in C(++) 64/92 Tuning for the Compiler Homework 3: Sets of Integers • implement a set of uint 16_t using a bitvector — with insert, erase, union and intersection • the same using a nibble-trie — a trie with out-degree 16 (4 bits] — should have a maximum depth of 4 — implement insert and union • compare the two implementations Writing Efficient Code in C(++) 65/92 Tuning for the Compiler Part 6: Understanding the CPU Writing Efficient Code in C(++) 66/92 Understanding the CPU The Simplest CPU • in-order, one instruction per cycle • sources of inefficiency — most circuitry is idle most of the time — not very good use of silicon • but it is reasonably simple Writing Efficient Code in C(++) 67/92 Understanding the CPU Design Motivation • silicon (die) area is expensive • switching speed is limited • heat dissipation is limited • transistors cannot be arbitrarily shrunk • "wires" are not free either Writing Efficient Code in C(++) 68/92 Understanding the CPU The Classic RISC Pipeline • fetch - get instruction from memory • decode - figure out what to do • execute - do the thing • memory - read/write to memory • write back - store results in the register file Writing Efficient Code in C(++) 69/92 Understanding the CPU Instruction Fetch • pull the instruction from cache, into the CPU • the address of the instruction is stored in PC • traditionally does branch "prediction" — in simple RISC CPUs always predicts not taken — this is typically not a very good prediction — loops usually favour taken heavily Writing Efficient Code in C(++) 70/92 Understanding the CPU Instruction Decode • not much actual decoding in RISC ISAs • but it does register reads • and also branch resolution — might need a big comparator circuit — depending on ISA (what conditional branches exist) — updates the PC Writing Efficient Code in C(++) 71/92 Understanding the CPU Execute • this is basically the ALU — ALU = arithmetic and logic unit • computes bitwise and shift/rotate operations • integer addition and subtraction • integer multiplication and division (multi-cycle) Writing Efficient Code in C(++) 72/92 Understanding the CPU Memory • dedicated memory instructions in RISC — load and store — pass through execute without effect • can take a few cycles • moves values between memory and registers Write Back • write data back into registers • so that later instructions can use the results Writing Efficient Code in C(++) 73/92 Understanding the CPU Pipeline Problems • data hazards (result required before written] • control hazards (branch misprediction) • different approaches possible — pipeline stalls (bubbles) — delayed branching • structural hazards — multiple instructions try to use a single block — only relevant on more complex architectures Writing Efficient Code in C(++) 74/92 Understanding the CPU Superscalar Architectures • more parallelism than a scalar pipeline • can retire more than one instruction per cycle • extracted from sequential instruction stream • dynamically established data dependencies • some units are replicated (e.g. 2 ALUs] Writing Efficient Code in C(++) 75/92 Understanding the CPU Out-of-order execution • tries to fill in pipeline stalls/bubbles • same principle as super-scalar execution — extracts dependencies during execution — execute if all data ready — even if not next in the program Writing Efficient Code in C(++) 76/92 Understanding the CPU Speculative Execution • sometimes it's not yet clear what comes next • let's decode, compute etc. something anyway • fills in more bubbles in the pipeline • but not always with actual useful work • depends on the performance of branch prediction Writing Efficient Code in C(++) 77/92 Understanding the CPU Take-Away • the CPU is very good at utilising circuitry • it is somewhat hard to write "locally" inefficient code • you should probably concentrate on non-local effects — non-local with respect to instruction stream — like locality of reference — and organisation of data in memory in general — also higher-level algorithm structure Writing Efficient Code in C(++) 78/92 Understanding the CPU Exercise 6 • implement a brainfuck interpreter • try to make it as fast as possible • see wikipedia for some example programs Writing Efficient Code in C(++) 79/92 Understanding the CPU Homework 4 • implement sub-string search algorithms • a naive one (with full restarts) • one based on a failure table (KMP) • one that uses a DFA • write benchmarks, find cross-over points Writing Efficient Code in C(++) 80/92 Understanding the CPU Part 7: Exploiting Parallelism Writing Efficient Code in C(++) 81/92 Exploiting Parallelism Hardware vs Software • hardware is naturally parallel • software is naturally sequential • something has to give — depends on the throughput you need — eventually your software needs to go parallel Writing Efficient Code in C(++) 82/92 Exploiting Parallelism Algorithms • some algorithms are inherently sequential — typically for P-complete problems — for instance DFS post-order • which algorithm do you really need though? — topological sort is much easier than post-order • some tasks are trivially concurrent — think map-reduce Writing Efficient Code in C(++) 83/92 Exploiting Parallelism Task Granularity • how big are the tasks you can run in parallel? — big tasks = little task-switching overhead — small tasks = easier to balance out • how much data do they need to share? — shared memory vs message passing Writing Efficient Code in C(++) 84/92 Exploiting Parallelism Distributed Memory • comparatively big sub-tasks • not much data structure sharing (small results) • scales extremely well (millions of cores) Shared Memory • small, tightly intertwined tasks • sharing a lot of data • scales quite poorly (hundreds of cores) Writing Efficient Code in C(++) 85/92 Exploiting Parallelism Caches vs Parallelism • different CPUs are connected to different caches • caches are normally transparent to the program • what if multiple CPUs hold the same value in cache — they could see different versions at the same time — need cache coherence protocols Writing Efficient Code in C(++) 86/92 Exploiting Parallelism Cache Coherence • many different protocols exist • a common one is MESI (4 cache line states) — modified, exclusive, shared, invalid — snoops on the bus to keep up to date • cheap until two cores hit the same cache line — required for communication — also happens accidentally Writing Efficient Code in C(++) 87/92 Exploiting Parallelism Locality of Reference • comes with a twist in shared memory • compact data is still good, but — different cores may use different pieces of data — if they are too close, this becomes costly — also known as false sharing Writing Efficient Code in C(++) 88/92 Exploiting Parallelism Distribution of Work • want to communicate as little as possible • also want to distribute work evenly • randomised, spread-out data often works well — think hash tables • structures with a single active point are bad — think stacks, queues, counters &c. Writing Efficient Code in C(++) 89/92 Exploiting Parallelism Shared-Memory Parallelism in C++ • std: : th read - create threads • std::future- delayed (concurrent) values • std: : atomic - atomic (thread-safe) values • std::mutex and std::lock_guard Writing Efficient Code in C(++) 90/92 Exploiting Parallelism Exercise 7 • implement shared-memory map-reduce • make the number of threads a runtime parameter • check how this scales (wall time vs number of cores) • use this for summing up a (big) array of numbers • can you improve on this by hand-rolling the summing loop? Writing Efficient Code in C(++) 91/92 Exploiting Parallelism Homework 5 • implement parallel matrix multiplication • compare to your sequential versions — try with 2 and 4 threads in your benchmarks • you can use std:: thread or OpenMP Writing Efficient Code in C(++) 92/92 Exploiting Parallelism