November 11, 2019
Writing Efficient Code in C(++)
Petr Ročkai
Writing Efficient Code in C(++) 2/66 November 11, 2019
Organisation
• theory: 20-30 minutes every week
• coding: all the remaining time
• passing the subject: collect 15 points
• most points come from assignments
• showing up 5 times gets you 1 point
Writing Efficient Code in C(++) 3/66 November 11, 2019
Assignments
• one assignment every 2 weeks, 6 in total
• missing the deadline is the same as failing it
• one assignment = 2 points
Bonuses (per assignment)
• add 1 point if you pass within 14 days
• else add 0.5 points if you pass by 20.12.
Writing Efficient Code in C(++) 4/66 November 11, 2019
Assignments (cont’d)
• details about submission next week
• you can use C or C++
• the code must be valid C11 or C++17
• lite tests run every midnight
• full tests and reviews are done at deadlines
Writing Efficient Code in C(++) 5/66 November 11, 2019
Competitions
• we will hold 3 competitions in the seminar
• do your best in 40 minutes on a small problem
• the winner gets 2 points, second place gets 1 point
• all other working programs get 0.5 points
• we’ll dissect the winning program together
Writing Efficient Code in C(++) 6/66 November 11, 2019
Exercises in Seminar
• one of you will be programming live
• i.e. what you do will be shown on the beamer
• you have to do this once to pass the subject
∘ it is okay to do it more than once though
∘ you will get 1 point for each instance
Writing Efficient Code in C(++) 7/66 November 11, 2019
Semester Plan (part 1)
date
1. computational complexity 16.9.
2. microbenchmarking & stats 23.9.
3. the memory hierarchy 30.9.
cancelled 7.10.
4. using callgrind 14.10.
5. tuning for the compiler 21.10.
state holiday 28.10.
Writing Efficient Code in C(++) 8/66 November 11, 2019
Semester Plan (part 2)
date
6. competition 1 4.11.
7. understanding the CPU 11.11.
8. exploiting parallelism 18.11.
9. competition 2 25.11.
10. using perf 2.12.
11. Q&A, homework recap 9.12.
12. competition 3 16.12.
Writing Efficient Code in C(++) 9/66 November 11, 2019
Assignment Schedule
given due
1. benchmarking tool 23.9. 7.10.
2. matrix multiplication 7.10. 21.10.
3. sets of integers 21.10. 4.11.
4. substring search 4.11. 18.11.
5. parallel computation 18.11. 2.12.
6. a hash table 2.12. 16.12.
Writing Efficient Code in C(++) 10/66 November 11, 2019
Efficient Code
• computational complexity
• the memory hierarchy
• tuning for the compiler & optimiser
• understanding the CPU
• exploiting parallelism
Writing Efficient Code in C(++) 11/66 November 11, 2019
Understanding Performance
• writing and evaluating benchmarks
• profiling with callgrind
• profiling with perf
• the law of diminishing returns
• premature optimisation is the root of all evil
• (but when is the right time?)
Writing Efficient Code in C(++) 12/66 November 11, 2019
Tools
• on a POSIX operating system (preferably not in a VM)
• perf (Linux-only, sorry)
• callgrind (part of the valgrind suite)
• kcachegrind (for visualisation of callgrind logs)
• maybe gnuplot for plotting performance data
Writing Efficient Code in C(++) 13/66 November 11, 2019
Compilers
• please stick to C++17 and C11 (or C99)
• the reference compiler will be clang 8.0.0
• you can use other compilers locally
• but your code has to build with clang 8
Writing Efficient Code in C(++) 14/66 November 11, 2019
Part 1: Computational Complexity
Writing Efficient Code in C(++) 15/66 November 11, 2019
Complexity and Efficiency
• this class is not about asymptotic behaviour
• you need to understand complexity to write good code
• performance and security implications
• what is your expected input size?
• complexity vs constants vs memory use
Writing Efficient Code in C(++) 16/66 November 11, 2019
Quiz
• what’s the worst-case complexity of:
∘ a bubble sort? (standard) quick sort?
∘ inserting an element into a RB tree?
∘ inserting an element into a hash table?
∘ inserting an element into a sorted array?
∘ appending an element to a dynamic array?
• what are the amortised complexities?
• how about expected (average)?
Writing Efficient Code in C(++) 17/66 November 11, 2019
Hash Tables
• often the most efficient data structure available
• poor theoretical worst-case complexity
∘ what if the hash function is really bad?
• needs a fast hash function for efficiency
∘ rules out secure (cryptographic) hashes
Writing Efficient Code in C(++) 18/66 November 11, 2019
Worst-Case Complexity Matters
• CVE-2011-4815, 4838, 4885, 2012-0880, ...
• apps can become unusable with too many items
• use a better algorithm if you can (or must)
• but: simplicity of code is worth a lot, too
• also take memory complexity and constants into account
Writing Efficient Code in C(++) 19/66 November 11, 2019
Constants Matter
• 𝑛 ops if each takes 1 second
• 𝑛 log 𝑛 ops if each takes .1 second
• 𝑛2
ops if each takes .01 second
Picking the Right Approach
• where are the crossover points?
• what is my typical input size?
• is it worth picking an approach dynamically?
• what happens in pathological cases?
Writing Efficient Code in C(++) 20/66 November 11, 2019
Exercises
• log into aisa
• run pb173eff update
• then cd ~/pb173eff/01
• and cat intro.txt
Writing Efficient Code in C(++) 21/66 November 11, 2019
Part 2: Microbenchmarking & Statistics
Writing Efficient Code in C(++) 22/66 November 11, 2019
Motivation
• there’s a gap between high-level code and execution
• the gap has widened over time
∘ higher-level languages & more abstraction
∘ more powerful optimisation procedures
∘ more complex machinery inside the CPU
∘ complicated cache effects
• it is very hard to predict actual performance
Writing Efficient Code in C(++) 23/66 November 11, 2019
Challenges
• performance is very deterministic in theory
• this is not the case in practice
∘ time-sharing operating systems
∘ cache content and/or swapping
∘ power management, CPU frequency scaling
∘ program nondeterminism; virtual machines
• both micro (unit) and system benchmarks are affected
Writing Efficient Code in C(++) 24/66 November 11, 2019
Unit vs System Benchmarking
• a benchmark only gives you one number
• it is hard to find causes of poor performance
• unit benchmarks are like unit tests
∘ easier to tie causes to effects
∘ faster to run (minutes or hours vs hours or days)
∘ easier to make parametric
Writing Efficient Code in C(++) 25/66 November 11, 2019
Isolation vs Statistics
• there are many sources of measurement errors
• some are systematic, others are random (noise)
• noise is best fought with statistics
• but statistics can’t fix systematic errors
• benchmark data is not normally distributed
Writing Efficient Code in C(++) 26/66 November 11, 2019
Repeated Measurements
• you will need to do repeat measurements
• more repeats give you better precision
∘ the noise will average out
∘ execution time vs precision tradeoff
• the repeat runs form your input sample
∘ this is what you feed into bootstrap
Writing Efficient Code in C(++) 27/66 November 11, 2019
Bootstrap
• usual statistical tools are distribution-dependent
• benchmark data is distributed rather oddly
• idea: take many random re-samplings of the data
• take 5th and 95th percentile as a confidence interval
• this is a very robust (if stochastic) approach
Writing Efficient Code in C(++) 28/66 November 11, 2019
Implementing Bootstrap
• inputs: a sample, an estimator and iteration count
• outputs: a new sample
• in each iteration, create a random resample
∘ add a random item from the original sample
∘ repeats are allowed (this is important)
∘ size of the resample = size of the original
Writing Efficient Code in C(++) 29/66 November 11, 2019
Estimators
• most useful estimators are the mean (average)
• and various percentiles (e.g. median)
• you can also estimate standard deviation
∘ but keep in mind the original data is not normal
Writing Efficient Code in C(++) 30/66 November 11, 2019
Output Distribution
• the output of bootstrap is another distribution
• you can expect this one to be normal
• it is the distribution of the estimator result
• you can compute the mean and 𝜎 of the bootstrap
Writing Efficient Code in C(++) 31/66 November 11, 2019
Part 3: The Memory Hierarchy
Writing Efficient Code in C(++) 32/66 November 11, 2019
• CPU registers: very few, very fast (no latency)
• L1 cache: small (100s of KiB), plenty fast (~4 cycles)
• L2 cache: still small, medium fast (~12 cycles)
• L3 cache: ~2-32 MiB, slow-ish (~36 cycles)
• L4 cache: (only some CPUs) ~100 MiB (~90 cycles)
• DRAM: many gigabytes, pretty slow (~200 cycles)
• NVMe: ~10k cycles
• SSD: ~20k cycles
• spinning rust: ~30M cycles
• RTT to US: ~450M cycles
Writing Efficient Code in C(++) 33/66 November 11, 2019
Paging vs Caches
• page tables live in slow RAM
• address translations are very frequent
• and extremely timing-sensitive
• TLB $→$ small, very fast address translation cache
• process switch → TLB flush
• but: Tagged TLB, software-managed TLB
• typical size: 12 - 4k entries
• miss penalties up to 100 cycles
Writing Efficient Code in C(++) 34/66 November 11, 2019
Additional Effects
• some caches are shared, some are core-private
• out of order execution to avoid waits
• automatic or manual (compiler-assisted) prefetch
• speculative memory access
• ties in with branch prediction
Writing Efficient Code in C(++) 35/66 November 11, 2019
Some Tips
• use compact data structures (vector beats list)
• think about locality of reference
• think about the size of your working set
• code size, not just speed, also matters
Writing Efficient Code in C(++) 36/66 November 11, 2019
See Also
• cpumemory.pdf in study materials
∘ somewhat advanced and somewhat long
∘ also very useful (the title is not wrong)
∘ don’t forget to add 10 years
∘ oprofile is now perf
• http://www.7-cpu.com CPU latency data
Writing Efficient Code in C(++) 37/66 November 11, 2019
Part 4: Profiling I, callgrind
Writing Efficient Code in C(++) 38/66 November 11, 2019
Why profiling?
• it’s not always obvious what is the bottleneck
• benchmarks don’t work so well with complex systems
• performance is not quite composable
• the equivalent of printf debugging isn’t too nice
Writing Efficient Code in C(++) 39/66 November 11, 2019
Workflow
1. use a profiler to identify expensive code
∘ the more time program spent doing X,
∘ the more sense it makes to optimise X
2. improve the affected section of code
∘ re-run the profiler, compare the two profiles
∘ if satisfied with the improvement, goto 1
∘ else goto 2
Writing Efficient Code in C(++) 40/66 November 11, 2019
What to Optimise
• imagine the program spends 50 % time doing X
∘ optimise X to run in half the time
∘ the overall runtime is reduced by 25 %
∘ good return on investment
• law of diminishing returns
∘ now only 33 % of time is spent on X
∘ cutting X in half again only gives 17 % of total
∘ and so on, until it makes no sense to optimise X
Writing Efficient Code in C(++) 41/66 November 11, 2019
Flat vs Structured Profiles
• flat profiles are easier to obtain
• but also harder to use
∘ just a list of functions and cost
∘ the context & structure is missing
• call stack data is a lot harder to obtain
∘ endows the profile with very rich structure
∘ reflects the actual control flow
Writing Efficient Code in C(++) 42/66 November 11, 2019
cachegrind
• part of the valgrind tool suite
• dynamic translation and instrumentation
• based on simulating CPU timings
∘ instruction fetch and decode
∘ somewhat abstract cost model
• can optionally simulate caches
• originally only flat profiles
Writing Efficient Code in C(++) 43/66 November 11, 2019
callgrind
• records entire call stacks
• can reconstruct call graphs
• very useful for analysis of complex programs
kcachegrind
• graphical browser for callgrind data
• demo
Writing Efficient Code in C(++) 44/66 November 11, 2019
Part 5: Tuning for the Compiler
Writing Efficient Code in C(++) 45/66 November 11, 2019
Goals
• write high-level code
• with good performance
What We Need to Know
• which costs are easily eliminated by the compiler?
• how to best use the optimiser (with minimal cost)?
Writing Efficient Code in C(++) 46/66 November 11, 2019
How Compilers Work
• read and process the source text
• generate low-level intermediate representation
• run IR-level optimisation passes
• generate native code for a given target
Writing Efficient Code in C(++) 47/66 November 11, 2019
Intermediate Representation
• for C++ compilers typically a (partial) SSA
• reflects CPU design / instruction sets
• symbolic addresses (like assembly)
• explicit control and data flow
Writing Efficient Code in C(++) 48/66 November 11, 2019
IR-Level Optimiser
• common sub-expression elimination
• loop-invariant code motion
• loop strength reduction
• loop unswitching
• sparse conditional constant propagation
• (regular) constant propagation
• dead code elimination
Writing Efficient Code in C(++) 49/66 November 11, 2019
Common Sub-expression Elimination
• identify redundant (& side-effect free) computation
• compute the result only once & re-use the value
• not as powerful as equational reasoning
Writing Efficient Code in C(++) 50/66 November 11, 2019
Loop-Invariant Code Motion
• identify code that is independent of the loop variable
• and also free of side effects
• hoist the code out of the loop
• basically a loop-enabled variant of CSE
Writing Efficient Code in C(++) 51/66 November 11, 2019
The Cost of Calls
• prevents CSE (due to possible side effects)
• prevents all kinds of constant propagation
Inlining
• removes the cost of calls
• improves all intra-procedural analyses
• inflates code size
• only possible if the IR-level definition is available
See also: link-time optimisation
Writing Efficient Code in C(++) 52/66 November 11, 2019
The Cost of Abstraction: Encapsulation
• API or ABI level?
• API: cost quickly eliminated by the inliner
• ABI: not even LTO can fix this
• ABI-compatible setter is a call instead of a single store
Writing Efficient Code in C(++) 53/66 November 11, 2019
The Cost of Abstraction: Late Dispatch
• used for virtual methods in C++
• indirect calls (through a vtable)
• also applies to C-based approaches (gobject)
• prevents (naive) inlining
• compilers (try to) devirtualise calls
Writing Efficient Code in C(++) 54/66 November 11, 2019
Part 6: Understanding the CPU
Writing Efficient Code in C(++) 55/66 November 11, 2019
The Simplest CPU
• in-order, one instruction per cycle
• sources of inefficiency
∘ most circuitry is idle most of the time
∘ not very good use of silicon
• but it is reasonably simple
Writing Efficient Code in C(++) 56/66 November 11, 2019
Design Motivation
• silicon (die) area is expensive
• switching speed is limited
• heat dissipation is limited
• transistors cannot be arbitrarily shrunk
• “wires” are not free either
Writing Efficient Code in C(++) 57/66 November 11, 2019
The Classic RISC Pipeline
• fetch – get instruction from memory
• decode – figure out what to do
• execute – do the thing
• memory – read/write to memory
• write back – store results in the register file
Writing Efficient Code in C(++) 58/66 November 11, 2019
Instruction Fetch
• pull the instruction from cache, into the CPU
• the address of the instruction is stored in PC
• traditionally does branch “prediction”
∘ in simple RISC CPUs always predicts not taken
∘ this is typically not a very good prediction
∘ loops usually favour taken heavily
Writing Efficient Code in C(++) 59/66 November 11, 2019
Instruction Decode
• not much actual decoding in RISC ISAs
• but it does register reads
• and also branch resolution
∘ might need a big comparator circuit
∘ depending on ISA (what conditional branches exist)
∘ updates the PC
Writing Efficient Code in C(++) 60/66 November 11, 2019
Execute
• this is basically the ALU
∘ ALU = arithmetic and logic unit
• computes bitwise and shift/rotate operations
• integer addition and subtraction
• integer multiplication and division (multi-cycle)
Writing Efficient Code in C(++) 61/66 November 11, 2019
Memory
• dedicated memory instructions in RISC
∘ load and store
∘ pass through execute without effect
• can take a few cycles
• moves values between memory and registers
Write Back
• write data back into registers
• so that later instructions can use the results
Writing Efficient Code in C(++) 62/66 November 11, 2019
Pipeline Problems
• data hazards (result required before written)
• control hazards (branch misprediction)
• different approaches possible
∘ pipeline stalls (bubbles)
∘ delayed branching
• structural hazards
∘ multiple instructions try to use a single block
∘ only relevant on more complex architectures
Writing Efficient Code in C(++) 63/66 November 11, 2019
Superscalar Architectures
• more parallelism than a scalar pipeline
• can retire more than one instruction per cycle
• extracted from sequential instruction stream
• dynamically established data dependencies
• some units are replicated (e.g. 2 ALUs)
Writing Efficient Code in C(++) 64/66 November 11, 2019
Out-of-order execution
• tries to fill in pipeline stalls/bubbles
• same principle as super-scalar execution
∘ extracts dependencies during execution
∘ execute if all data ready
∘ even if not next in the program
Writing Efficient Code in C(++) 65/66 November 11, 2019
Speculative Execution
• sometimes it’s not yet clear what comes next
• let’s decode, compute etc. something anyway
• fills in more bubbles in the pipeline
• but not always with actual useful work
• depends on the performance of branch prediction
Writing Efficient Code in C(++) 66/66 November 11, 2019
Take-Away
• the CPU is very good at utilising circuitry
• it is somewhat hard to write “locally” inefficient code
• you should probably concentrate on non-local effects
∘ non-local with respect to instruction stream
∘ like locality of reference
∘ and organisation of data in memory in general
∘ also higher-level algorithm structure