Writing Efficient Code in C++
Petr Ročkai
Organisation
• theory: —20 minutes every week
• coding: all the remaining time
• passing the subject: collect 10 points
• most points come from assignments
• attendance is optional (but worth .1 point/seminar)
Writing Efficient Code in C++
1/86
December 2, 2017
Assignments
• one assignment every 2 weeks, 5 in total
• you can get 3 points per assignment
• on my desk (in your repo) by 8am on Monday in 2 weeks
• you can be arbitrarily late, but will get 1 point less
Writing Efficient Code in C++
2/86
December 2, 2017
Assignments (cont'd)
• you can use git, mercurial or dares
• put everything that you want me to see on master
• write a simple Makefile (no cmake, autotools,...)
• each homework gets a target (make hwl through hw5)
• use the same repo for in-seminar work (make exl ...)
Writing Efficient Code in C++
3/86
December 2, 2017
Competition
• we will hold a few (3 or 4) competitions in the seminar
• you'll have 40 minutes to do your best on a small problem
• the winner gets 1 point, second place gets .5 point
• all other working programs get .2 points
• we'll dissect the winning program together
Writing Efficient Code in C++
4/86
December 2, 2017
Preliminary Plan		
2.10.	cancelled (conference, sorry!)	
9.10.	microbenchmarking & statistics	hwOl due
16.10.	the memory hierarchy	
23.10.	using callgrind	hw02 due
30.10.	tuning for the compiler/optimiser	
6.11.	competition 1	hw03 due
13.11.	understanding the CPU	
20.11.	exploiting parallelism	hw04 due
27.11.	using perf + competition 2	
4.12.	Q&A, homework recap	hw05 due
11.12.	semester recap + competition 3	
Writing Efficient Code in C++ 5/86		December 2, 2017
Part 1: Introduction & Tools
Writing Efficient Code in C++
6/86
December 2, 2017
Efficient Code
• computational complexity
• the memory hierarchy
• tuning for the compiler & optimiser
• understanding the CPU
• exploiting parallelism
Writing Efficient Code in C++
7/86
December 2, 2017
Understanding Performance
• writing and evaluating benchmarks
• profiling with callg rind
• profiling with pe rf
• the law of diminishing returns
• premature optimisation is the root of all evil
• (but when is the right time?)
Writing Efficient Code in C++
8/86
December 2, 2017
Tools
• C++14 & a matching compiler (clang 3.7+, g++ 5+)
• on a POSIX operating system (preferably not in a VM)
• perf (Linux-only sorry)
• callg rind (part of the valg rind suite)
• kcacheg rind (for visualisation of callg rind logs)
• gnu plot for plotting performance data
• brick-benchmark for micro-benchmarks in C++
Writing Efficient Code in C++
9/86
December 2, 2017
Part 2: Computational Complexity
Writing Efficient Code in C++
10/86
December 2, 2017
Complexity and Efficiency
• this class is not about asymptotic behaviour
• but: you need to understand complexity to write good code
• performance and security implications
• what is your expected input size?
• complexity vs constants vs memory use
Writing Efficient Code in C++
11/86
December 2, 2017
Quiz
• what's the asymptotic complexity of:
— a bubble sort? (standard) quick sort?
— inserting an element into a RB tree?
— inserting an element into a hash table?
— inserting an element into a sorted vector?
• what are the amortised complexities?
• how about expected (average)?
• what if the hash function is really bad?
Writing Efficient Code in C++
12/86
December 2, 2017
Worst-Case Complexity Matters
• CVE-2011-4815, 4838, 4885, 2012-0880, ...
• apps can become unusable with too many pictures/songs/...
• use a better algorithm if you can (or must)
• but: simplicity of code is worth a lot, too
• also take memory complexity and constants into account
Writing Efficient Code in C++
13/86
December 2, 2017
Constants Matter
• n ops if each takes 1 second
• n log n ops if each takes .1 second
• n2 ops if each takes .01 second
Picking the Right Approach
• where are the crossover points?
• what is my typical input size?
• is it worth picking an approach dynamically?
• what happens in pathological cases?
Writing Efficient Code in C++
14/86
December 2, 2017
Exercise 1
• set up your repository and a Makefile
• implement a bounded priority buffer
— holds at most n items
— holds at most one copy of a given item
— forgets the smallest item if full
— fetch/remove the largest item
— API: insert, top and remove
• two versions: sorted std: : vecto r vs std: : set
Writing Efficient Code in C++
15/86
December 2, 2017
Exercise 1 (cont'd)
• write a few unit tests
• write a benchmark that inserts (~ 107) random values
• the benchmark can use clock(3) or time(1)
• compare the approaches for n = 5, 10,10000
• what are the theoretical complexities?
• what are your expectations on performance?
• can you think of a better overall solution?
Writing Efficient Code in C++
16/86
December 2, 2017
Homework 1
• implement hash tables with insert and find
— with linked-list buckets [lpt]
— with linear probing and rehashing [lpt]
• compare with std: : set and std: : unordered_set [lpt]
• bonus: beat std: : unordered_set by >10 % [.5pt]
• stick to crude measurement methods: time(l)/clock(3)
• use a number of elements suitable for this measurement style
Writing Efficient Code in C++
17/86
December 2, 2017
Intermezzo 1: Assignment 1
Writing Efficient Code in C++
18/86
December 2, 2017
• please write a Makefile if you didn't (not CMakeLists . txt)
• please ensure optimisation are enabled (at least -02)
• make hwl should create a binary called hwl
— this also means you can't use hwl as a directory name
— try hwl. s rc or such instead
• $ (CC) is the C compiler, not C++
— use $ (CXX) or just C++
— same for CFLAGS vs CXXFLAGS
• if you want C++17, please use - std=c++lz
— - std=c++17 does not work on clang 4.0 or older
• none of the above will incur the -lpt penalty for being late
— but please do fix those issues ASAP
— next time these issues will no longer get an exception
Writing Efficient Code in C++
19/86
December 2, 2017
Part 3: Micro-Benchmarking and Statistics
Writing Efficient Code in C++
20/86
December 2, 2017
Motivation
• there's a gap between high-level code and actual execution
• the gap has widened over time
— higher-level languages & more abstraction
— more powerful optimisation procedures
— more complex machinery inside the CPU
— complicated cache effects
• it is very hard to predict actual performance
Writing Efficient Code in C++
21/86
December 2, 2017
Challenges
• performance is very deterministic in theory
• this is not the case in practice
— time-sharing operating systems
— cache content and/or swapping
— power management, CPU frequency scaling
— virtual machines
• both micro (unit) and system benchmarks are affected
Writing Efficient Code in C++
22/86
December 2, 2017
Unit vs System Benchmarking
• a benchmark only gives you one number
• it is hard to find causes of poor performance
• unit benchmarks are like unit tests
— easier to tie causes to effects
— faster to run (minutes or hours vs hours or days)
— easier to make parametric
Writing Efficient Code in C++
23/86
December 2, 2017
Isolation vs Statistics
• there are many sources of measurement errors
• some are systematic, others are random (noise)
• noise is best fought with statistics
• but statistics can't fix systematic errors
• benchmark data is not normally distributed
Writing Efficient Code in C++
24/86
December 2, 2017
Bootstrap
• usual statistical tools are distribution-dependent
• benchmark data is distributed rather oddly
• idea: take many random re-samplings of the data
• take the 5th and 95th percentiles as the confidence interval
• this is a very robust (if stochastic) approach
Writing Efficient Code in C++
25/86
December 2, 2017
Using brick-benchmark
• implements f o rk-based benchmark isolation
• uses bootstrapping to correctly quantify noise
• uses clock_gettime to get precise timings
• adaptive - if the CI is reasonably tight, stop iterating
• simple registration API
• example use - look for Self Test in brick-benchmark
Writing Efficient Code in C++
26/86
December 2, 2017
Exercise 2
• compare 3 stack implementations
— std: : vector, std: : deque and std: :list
— parametrise the benchmark by number of items inserted
— and by the maximum size of the stack
— randomise whether to remove or insert
— insert needs to have a higher probability
• same for queues, but only std: : deque and std: : list
• (optional) implement radix sort for integral types
— compare with std: : so rt on different sequence sizes
Writing Efficient Code in C++
27/86
December 2, 2017
Homework 2
• implement erase() for both your hash tables [lpt]
• write micro-benchmarks for your hash tables [lpt]
— also include std: : set and std: : unordered_set
— plot time-per-insert vs number of inserts for each
— come up with a benchmark for erase
• implement a generator of random graphs (of a given size)
• implement BFS using the best hash table and best queue [lpt]
Writing Efficient Code in C++
28/86
December 2, 2017
Part 4: The Memory Hierarchy
Writing Efficient Code in C++
29/86
December 2, 2017
• many levels of ever bigger, ever slower memories
• CPU registers: very few, very fast (no latency)
• LI cache: small (100s of KiB), plenty fast (—4 cycles)
• L2 cache: still small, medium fast (—12 cycles)
• L3 cache: —2-32 MiB, slow-ish (—36 cycles)
• L4 cache: (only some CPUs) -100 MiB (-90 cycles)
• DRAM: many gigabytes, pretty slow (—200 cycles)
• NVMe: -10k cycles
• SSD: —20k cycles
• spinning rust: —30M cycles
• RTT to US: -450M cycles
Writing Efficient Code in C++
30/86
December 2, 2017
Paging vs Caches
• page tables live in slow RAM
• address translations are very frequent
• and extremely timing-sensitive
• TLB small, very fast address translation cache
• process switch TLB flush
• but: Tagged TLB, software-managed TLB
• typical size: 12 - 4k entries
• miss penalties up to 100 cycles
Writing Efficient Code in C++
31/86
December 2, 2017
Additional Effects
• some caches are shared, some are core-private
• out of order execution to avoid waits
• automatic or manual (compiler-assisted) prefetch
• speculative memory access
• ties in with branch prediction
Writing Efficient Code in C++
32/86
December 2, 2017
Some Tips
• use compact data structures (vector > list)
• think about locality of reference
• think about the size of your working set
• code size, not just speed, also matters
Writing Efficient Code in C++
33/86
December 2, 2017
See Also
• cpumemo ry. pdf in study materials
— somewhat advanced and somewhat long
— also very useful (the title is not wrong)
— don't forget to add 10 years
— oprofile is now pe rf
• http: //www. 7-cpu . com CPU latency data
Writing Efficient Code in C++
34/86
December 2, 2017
Exercise 3
• write benchmarks that measure cache effects
Some Ideas
• walk a random section of a long std: : list
• measure time per item in relation to list size
• same but with a std: : vecto r
• same but access randomly chosen elements (vector only)
Writing Efficient Code in C++
35/86
December 2, 2017
Some Issues
• unif orm_int_dist ribution has odd timing behaviour
• but we don't really care about uniformity
• you may need to fight the optimiser a bit
• especially make sure to avoid undefined behaviour
• indexing vs iteration have wildly different behaviour
• shuffling your code slightly can affect the results a lot
Writing Efficient Code in C++
36/86
December 2, 2017
Part 5: Profiling I, callgrind
Writing Efficient Code in C++
37/86
December 2, 2017
Why profiling?
• it's not always obvious what is the bottleneck
• benchmarks don't work so well with complex systems
• performance is not quite composable
• the equivalent of printf debugging isn't too nice
Writing Efficient Code in C++
38/86
December 2, 2017
Workflow
1. use a profiler to identify expensive code
— the more time program spent doing X,
— the more sense it makes to optimise X
2. improve the affected section of code
— re-run the profiler, compare the two profiles
— if satisfied with the improvement, goto 1
— else goto 2
Writing Efficient Code in C++
39/86
December 2, 2017
What to Optimise
• imagine the program spends 50 % time doing X
— optimise X to run in half the time
— the overall runtime is reduced by 25 %
— good return on investment
• law of diminishing returns
— now only 33 % of time is spent on X
— cutting X in half again only gives 17 % of total
— and so on, until it makes no sense to optimise X
Writing Efficient Code in C++
40/86
December 2, 2017
Flat vs Structured Profiles
• flat profiles are easier to obtain
• but also harder to use
— just a list of functions and cost
— the context & structure is missing
• call stack data is a lot harder to obtain
— endows the profile with very rich structure
— reflects the actual control flow
Writing Efficient Code in C++
41/86
December 2, 2017
cachegrind
• part of the valg rind tool suite
• dynamic translation and instrumentation
• based on simulating CPU timings
— instruction fetch and decode
— somewhat abstract cost model
• can optionally simulate caches
• originally only flat profiles
Writing Efficient Code in C++
42/86
December 2, 2017
callgrind
• records entire call stacks
• can reconstruct call graphs
• very useful for analysis of complex programs
kcachegrind
• graphical browser for callgrind data
• demo
Writing Efficient Code in C++
43/86
December 2, 2017
Exercise 4
• there's a simple BFS implementation in study materials
• you can also use/compare your own BFS implementation
• don't forget to use -02 - g or such when compiling
• generate a profile with cachegrind
• load it up into kcacheg rind
• generate another, using callg rind this time & compare
Writing Efficient Code in C++
44/86
December 2, 2017
Exercise 4 (cont'd)
• add cache simulation options &c.
• explore the knobs in kcacheg rind
• experiment with the size of the generated graph
• optimise the BFS implementation based on profile data
Writing Efficient Code in C++
45/86
December 2, 2017
Homework 3
• implement a real-valued matrix data structure [lpt]
• implement 2 matrix multiplication algorithms [lpt]
— natural order
— cache-efficient order
• compare the implementations using benchmarks [lpt]
• the output should be again gnuplot sources on stdout
Writing Efficient Code in C++
46/86
December 2, 2017
Part 6: Tuning for the Compiler
Writing Efficient Code in C++
47/86
December 2, 2017
Goals
• write high-level code
• with good performance
What We Need to Know
• which costs are easily eliminated by the compiler?
• how to make best use of the optimiser (with minimal cost)?
Writing Efficient Code in C++
48/86
December 2, 2017
How Compilers Work
• read and process the source text
• generate low-level intermediate representation
• run IR-level optimisation passes
• generate native code for a given target
Writing Efficient Code in C++
49/86
December 2, 2017
Intermediate Representation
• for C++ compilers typically a (partial) SSA
• reflects CPU design / instruction sets
• symbolic addresses (like assembly)
• explicit control and data flow
Writing Efficient Code in C++
50/86
December 2, 2017
IR-Level Optimiser
• common sub-expression elimination
• loop-invariant code motion
• loop strength reduction
• loop unswitching
• sparse conditional constant propagation
• (regular) constant propagation
• dead code elimination
Writing Efficient Code in C++
51/86
December 2, 2017
Common Sub-expression Elimination
• identify redundant (& side-effect free) computation
• compute the result only once & re-use the value
• not as powerful as equational reasoning
Writing Efficient Code in C++
52/86
December 2, 2017
Loop-Invariant Code Motion
• identify code that is independent of the loop variable
• and also free of side effects
• hoist the code out of the loop
• basically a loop-enabled variant of CSE
Writing Efficient Code in C++
53/86
December 2, 2017
The Cost of Calls
• prevents CSE (due to possible side effects)
• prevents all kinds of constant propagation
Inlining
• removes the cost of calls
• improves all intra-procedural analyses
• inflates code size
• only possible if the IR-level definition is available
See also: link-time optimisation
Writing Efficient Code in C++
54/86
December 2, 2017
The Cost of Abstraction: Encapsulation
• API or ABI level?
• API: cost quickly eliminated by the inliner
• ABI: not even LTO can fix this
• ABI-compatible setter is a call instead of a single store
Writing Efficient Code in C++
55/86
December 2, 2017
The Cost of Abstraction: Late Dispatch
• used for virtual methods in C++
• indirect calls (through a vtable)
• also applies to C-based approaches (gobj ect)
• prevents (naive) inlining
• compilers (try to) devirtualise calls
Writing Efficient Code in C++
56/86
December 2, 2017
Exercise 5
• start with bf s . cpp from study materials
• make a version where edges () is in a separate C++ file
• you will need to use std: : function
• try a compromise using a visitor pattern
• compare all three approaches using benchmarks
Writing Efficient Code in C++
57/86
December 2, 2017
Intermezzo 2: Competition & Homework
Writing Efficient Code in C++
58/86
December 2, 2017
Competition
• download competitionl. tar.gz from study materials
• run make personalize I=xx where xx is your initials
• in xx. hpp, implement a set of cha r
— must support insert, erase and count
— operator & (for intersection of two sets)
— operator | (for union of two sets)
• make check to run unit tests
• make bench to run benchmarks
Writing Efficient Code in C++
59/86
December 2, 2017
Homework 4
• implement a set of uint 16_t using a bitvector [lpt]
— with insert, erase, union and intersection
• the same using a nibble-trie [lpt]
— a trie with out-degree 16 (4 bits)
— should have a maximum depth of 4
— implement insert and union
• compare the two implementations [lpt]
Writing Efficient Code in C++
60/86
December 2, 2017
Part 7: Understanding the CPU
Writing Efficient Code in C++
61/86
December 2, 2017
The Simplest CPU
• in-order, one instruction per cycle
• sources of inefficiency
— most circuitry is idle most of the time
— not very good use of silicon
• but it is reasonably simple
Writing Efficient Code in C++
62/86
December 2, 2017
Design Motivation
• silicon (die) area is expensive
• switching speed is limited
• heat dissipation is limited
• transistors cannot be arbitrarily shrunk
• "wires" are not free either
Writing Efficient Code in C++
63/86
December 2, 2017
The Classic RISC Pipeline
• fetch - get instruction from memory
• decode - figure out what to do
• execute - do the thing
• memory - read/write to memory
• write back - store results in the register file
Writing Efficient Code in C++
64/86
December 2, 2017
Instruction Fetch
• pull the instruction from cache, into the CPU
• the address of the instruction is stored in PC
• traditionally does branch "prediction"
— in simple RISC CPUs always predicts not taken
— this is typically not a very good prediction
— loops usually favour taken heavily
Writing Efficient Code in C++
65/86
December 2, 2017
Instruction Decode
• not much actual decoding in RISC ISAs
• but it does register reads
• and also branch resolution
— might need a big comparator circuit
— depending on ISA (what conditional branches exist)
— updates the PC
Writing Efficient Code in C++
66/86
December 2, 2017
Execute
• this is basically the ALU
— ALU = arithmetic and logic unit
• computes bitwise and shift/rotate operations
• integer addition and subtraction
• integer multiplication and division (multi-cycle)
Writing Efficient Code in C++
67/86
December 2, 2017
Memory
• dedicated memory instructions in RISC
— load and store
— pass through execute without effect
• can take a few cycles
• moves values between memory and registers
Write Back
• write data back into registers
• so that later instructions can use the results
Writing Efficient Code in C++
68/86
December 2, 2017
Pipeline Problems
• data hazards (result required before written)
• control hazards (branch misprediction)
• different approaches possible
— pipeline stalls (bubbles)
— delayed branching
• structural hazards
— multiple instructions try to use a single block
— only relevant on more complex architectures
Writing Efficient Code in C++
69/86
December 2, 2017
Superscalar Architectures
• more parallelism than a scalar pipeline
• can retire more than one instruction per cycle
• extracted from sequential instruction stream
• dynamically established data dependencies
• some units are replicated (e.g. 2 ALUs)
Writing Efficient Code in C++
70/86
December 2, 2017
Out-of-order execution
• tries to fill in pipeline stalls/bubbles
• same principle as super-scalar execution
— extracts dependencies during execution
— execute if all data ready
— even if not next in the program
Writing Efficient Code in C++
71/86
December 2, 2017
Speculative Execution
• sometimes it's not yet clear what comes next
• let's decode, compute etc. something anyway
• fills in more bubbles in the pipeline
• but not always with actual useful work
• depends on the performance of branch prediction
Writing Efficient Code in C++
72/86
December 2, 2017
Take-Away
• the CPU is very good at utilising circuitry
• it is somewhat hard to write "locally" inefficient code
• you should probably concentrate on non-local effects
— non-local with respect to instruction stream
— like locality of reference
— and organisation of data in memory in general
— also higher-level algorithm structure
Writing Efficient Code in C++
73/86
December 2, 2017
Exercise 6
• implement a brainfuck interpreter
• try to make it as fast as possible
• see wikipedia for some example programs
Bonus Homework
• write a brainfuck amd64 compiler
• 2 points for emitting symbolic assembly
• 1 extra for emitting binary code
Writing Efficient Code in C++
74/86
December 2, 2017
Part 8: Exploiting Parallelism
Writing Efficient Code in C++
75/86
December 2, 2017
Hardware vs Software
• hardware is naturally parallel
• software is naturally sequential
• something has to give
— depends on the throughput you need
— eventually your software needs to go parallel
Writing Efficient Code in C++
76/86
December 2, 2017
Algorithms
• some algorithms are inherently sequential
— typically for P-complete problems
— for instance DFS post-order
• which algorithm do you really need though?
— topological sort is much easier than post-order
• some tasks are trivially concurrent
— think map-reduce
Writing Efficient Code in C++
77/86
December 2, 2017
Task Granularity
• how big are the tasks you can run in parallel?
— big tasks = little task-switching overhead
— small tasks = easier to balance out
• how much data do they need to share?
— shared memory vs message passing
Writing Efficient Code in C++
78/86
December 2, 2017
Distributed Memory
• comparatively big sub-tasks
• not much data structure sharing (small results)
• scales extremely well (millions of cores)
Shared Memory
• small, tightly intertwined tasks
• sharing a lot of data
• scales quite poorly (hundreds of cores)
Writing Efficient Code in C++
79/86
December 2, 2017
Caches vs Parallelism
• different CPUs are connected to different caches
• caches are normally transparent to the program
• what if multiple CPUs hold the same value in cache
— they could see different versions at the same time
— need cache coherence protocols
Writing Efficient Code in C++
80/86
December 2, 2017
Cache Coherence
• many different protocols exist
• a common one is MESI (4 cache line states)
— modified, exclusive, shared, invalid
— snoops on the bus to keep up to date
• cheap until two cores hit the same cache line
— required for communication
— also happens accidentally
Writing Efficient Code in C++
81/86
December 2, 2017
Locality of Reference
• comes with a twist in shared memory
• compact data is still good, but
— different cores may use different pieces of data
— if they are too close, this becomes costly
— also known as false sharing
Writing Efficient Code in C++
82/86
December 2, 2017
Distribution of Work
• want to communicate as little as possible
• also want to distribute work evenly
• randomised, spread-out data often works well
— think hash tables
• structures with a single active point are bad
— think stacks, queues, counters &c.
Writing Efficient Code in C++
83/86
December 2, 2017
Shared-Memory Parallelism in C++
• std: : thread - create threads
• std: : future - delayed (concurrent) values
• std: : atomic - atomic (thread-safe) values
• std::mutexand std::lock_guard
Writing Efficient Code in C++
84/86
December 2, 2017
Exercise 7
• implement shared-memory map-reduce in C++
• make the number of threads a runtime parameter
• check how this scales (wall time vs number of cores)
• use this for summing up a (big) array of numbers
• can you improve on this by hand-rolling the summing loop?
Writing Efficient Code in C++
85/86
December 2, 2017
Homework 5
• implement parallel matrix multiplication [2pt]
• compare to your sequential versions [lpt]
— try with 2 and 4 threads in your benchmarks
Writing Efficient Code in C++
86/86
December 2, 2017