Introduction Example Cache Vectorization Parallelization
Hardware-aware Performance Tuning in C/C++
Jiˇr´ı Filipoviˇc
Sep. 2021
1 / 39
Introduction Example Cache Vectorization Parallelization
Focus of the Lecture
We will learn how to optimize C/C++ code to get more
performance from contemporary processors
maximizing beneﬁt from cache architecture
writing code taking advantage of compiler auto-vectorization
using multiple cores eﬃciently
We will not cover all interesting topics...
only basic optimizations from each category
no language-speciﬁc optimizations (inlining, proper usage of
virtual functions etc.)
no hardcore, assembly-level optimizations
2 / 39
Introduction Example Cache Vectorization Parallelization
Demonstration Example
We will demonstrate optimization methods using concrete example
I have tried to ﬁnd as simple as possible computational
problem, which still expose a lot of opportunity for various
optimization techniques
The code is not very abstract or generic
in productivity-optimized programming, we want to hide how
are algorithms performed, how are data stored etc.
however, when optimizing code, we have to focus on
implementation details, thus, code looks more ”old school”
in practice, usually very small fraction of source code is
performance-critical, so diﬀerent programming style for
optimized code is not a problem
3 / 39
Introduction Example Cache Vectorization Parallelization
Electrostatic Potential Map
Important problem from computational chemistry
we have a molecule deﬁned by position and charges of its
atoms
the goal is to compute charges at a 3D spatial grid around the
molecule
In a given point of the grid, we have
Vi =
j
wj
4π 0rij
Where wj is charge of the j-th atom, rij is Euclidean distance
between atom j and the grid point i and 0 is vacuum permittivity.
4 / 39
Introduction Example Cache Vectorization Parallelization
Electrostatic Potential Map
Initial implementation
suppose we know nothing about HW, just know C++
algorithm needs to process 3D grid such that it sums potential
of all atoms for each grid point
we will iterate over atoms in outer loop, as it allows to
precompute positions of grid points and minimizes number of
accesses into input/output array
5 / 39
Introduction Example Cache Vectorization Parallelization
Electrostatic Potential Map
void coulomb ( const sAtom∗ atoms , const int nAtoms ,
const float gs , const int gSize , float ∗grid ) {
for ( int a = 0; a < nAtoms ; a++) {
sAtom myAtom = atoms [ a ] ;
for ( int x = 0; x < gSize ; x++) {
float dx2 = powf (( float ) x ∗ gs − myAtom . x , 2.0 f ) ;
for ( int y = 0; y < gSize ; y++) {
float dy2 = powf (( float ) y ∗ gs − myAtom . y , 2.0 f ) ;
for ( int z = 0; z < gSize ; z++) {
float dz = ( float ) z ∗ gs − myAtom . z ;
float e = myAtom . w / sqrtf ( dx2 + dy2 + dz∗dz ) ;
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
}
}
6 / 39
Introduction Example Cache Vectorization Parallelization
Benchmarking
We will benchmark codes on pretty average desktop system
4 cores
AVX2 (256-bit vectors), no FMA
Guess speedup of original code :-).
7 / 39
Introduction Example Cache Vectorization Parallelization
Cache Memories
Why we have cache memories in modern processors?
main memory is too slow (both latency and bandwidth)
comparing to compute cores
we can build much faster, but also more expensive memory
cache is fast memory, which temporary keeps parts of larger
and slower memories
8 / 39
Introduction Example Cache Vectorization Parallelization
Cache Implementation
How is it working?
multiple levels (usually L1 and L2 private for core, L3 shared)
accessed by cache lines (64 bytes on Intel architectures)
when data are accessed, they are stored in cache and kept
until cache line is needed for another data
limited associativity (each cache line may cache only deﬁned
parts of main memory)
parallel access into memory – cache lines must be somehow
synchronized (broadcast, invalidation)
9 / 39
Introduction Example Cache Vectorization Parallelization
Optimization for Cache
Optimization for spatial locality
access consequent elements
align data to a multiple of cache line size
otherwise only part of transfered data is used
Optimization for temporal locality
when data element needs to be accessed multiple times,
perform accesses in a short time
otherwise it may be removed from cache due to its limited
capacity or associativity
Omit ineﬃcient usage
conﬂict misses
false sharing
10 / 39
Introduction Example Cache Vectorization Parallelization
Electrostatic Potential Map
void coulomb ( const sAtom∗ atoms , const int nAtoms ,
const float gs , const int gSize , float ∗grid ) {
for ( int a = 0; a < nAtoms ; a++) {
sAtom myAtom = atoms [ a ] ;
for ( int x = 0; x < gSize ; x++) {
float dx2 = powf (( float ) x ∗ gs − myAtom . x , 2.0 f ) ;
for ( int y = 0; y < gSize ; y++) {
float dy2 = powf (( float ) y ∗ gs − myAtom . y , 2.0 f ) ;
for ( int z = 0; z < gSize ; z++) {
float dz = ( float ) z ∗ gs − myAtom . z ;
float e = myAtom . w / sqrtf ( dx2 + dy2 + dz∗dz ) ;
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
}
}
11 / 39
Introduction Example Cache Vectorization Parallelization
Evaluation
We have compiled the code above with vectorization switched oﬀ
(as we are interested in eﬀects of memory access only)
31.6 millions of atoms evaluated per second (MEvals/s) using
256 × 256 × 256 grid and 4096 atoms
164.7 Mevals/s by changing grid size to 257 × 257 × 257
Interpretation
strong dependence on input size indicates problems with
cache associativity
even 164.7 Mevals/s is not very good result, considering 8
ﬂoating point operations are performed in innermost loop
12 / 39
Introduction Example Cache Vectorization Parallelization
Spatial Locality
We are interested in the innermost loop
it deﬁnes memory access pattern (i.e., which elements are
accessed consequently)
the innermost loop runs over z, which creates large memory
strides in accessing grid
when grid size is power of two, columns hits the same
associativity region
Optimization
we need to rearrange loops: the innermost loop should iterate
through x
13 / 39
Introduction Example Cache Vectorization Parallelization
Spatial Locality
void coulomb ( const sAtom∗ atoms , const int nAtoms ,
const float gs , const int gSize , float ∗grid ) {
for ( int a = 0; a < nAtoms ; a++) {
sAtom myAtom = atoms [ a ] ;
for ( int z = 0; z < gSize ; z++) {
float dz2 = powf (( float ) z ∗ gs − myAtom . z , 2.0 f ) ;
for ( int y = 0; y < gSize ; y++) {
float dy2 = powf (( float ) y ∗ gs − myAtom . y , 2.0 f ) ;
for ( int x = 0; x < gSize ; x++) {
float dx = ( float ) x ∗ gs − myAtom . x ;
float e = myAtom . w / sqrtf ( dx∗dx + dy2 + dz2 ) ;
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
}
}
14 / 39
Introduction Example Cache Vectorization Parallelization
Evaluation
Performance measurement
371.8 Mevals/s using 256 × 256 × 256 grid and 4096 atoms
(from 31.6 and 164.7 MEvals/s)
no sensitivity to changing grid size (no cache associativity
problem)
much better spatial locality
Analysis of cache pattern
each atom is applied to the whole grid
poor temporal locality (grid is too large structure)
15 / 39
Introduction Example Cache Vectorization Parallelization
Temporal Locality
Atoms array is much smaller than grid
we can rearrange loops to iterate over atoms in the innermost
loop: z-y-x-a
alternatively, we may apply atom forces per rows of a grid,
creating iteration order z-y-a-x
or tiling may be used
Memory tiling
we break some loop into nested loops, such that outer loop
iterates with step s > 1 and those steps are performed in
some inner loop
multiple loops may be tiled
code is more complex, not shown in this tutorial
16 / 39
Introduction Example Cache Vectorization Parallelization
Evaluation
Note that autovectorization is switched oﬀ for all implementations.
Implementation Performance speedup
Naive (grid 257) 164.7 n/a
Spatial loc. 371.8 2.26×
ZYXA 359.7 2.18×
ZYAX 382.2 2.32×
Temporal locality brings only minor improvement, but it may
change when instructions are optimized/code is parallelized.
17 / 39
Introduction Example Cache Vectorization Parallelization
Vector Instructions
Modern processors have complex logic preparing instructions
arithmetical units are relatively cheep
when instruction is to be executed, it may process multiple
data elements in parallel
Data-parallel programming
the same instruction is applied onto multiple data (SIMD
model)
explicit usage: we need to generate vector instructions
18 / 39
Introduction Example Cache Vectorization Parallelization
Vector Instructions
Vector instructions
the same operation is applied to a short vector
mainly arithmetic operations, may be masked, may contain
support for reduction, binning etc.
vector length depends on data type and instruction set, e.g.
AVX2 works with vector of size 256 bytes, so 8 32-bit
numbers or 4 64-bit numbers are processed in parallel
Vectorization in C/C++
explicit: inline assembly or intrinsics
implicit: compiler generates vector instructions automatically
19 / 39
Introduction Example Cache Vectorization Parallelization
Automatic Vectorization
Better portability
the code can be compiled for any vector instruction set
Supported in modern compilers
however, it is diﬃcult task, so allowing compiler to vectorize
code needs programmer assist
Intel C++ compiler is strong in auto-vectorization
20 / 39
Introduction Example Cache Vectorization Parallelization
Automatic Vectorization
Current limitations
only innermost for loops are vectorized
number of iterations must be known when loop is entered, or
(preferably) at compilation time
memory access must be regular, ideally with unit stride (i.e.
consequent elements are accessed in vector instructions)
vector dependence usually disallows vectorization
21 / 39
Introduction Example Cache Vectorization Parallelization
Vector Dependence
Vector dependence
the for loop cannot be vectorized, if there is ﬂow dependence
between iterations
however, compiler may wrongly assume vector dependence (it
must by conservative to generate correct code)
#pragma ivdep (Intel) or #pragma GCC ivdep (gcc)
instruct compiler to ignore assumed vector dependences (with
true dependence, compiler still do not vectorize)
22 / 39
Introduction Example Cache Vectorization Parallelization
Contiguous Memory Access
struct vec{
float x , y ;
};
vec v [ n ] ;
for ( int i = 0; i < n ; i++)
v [ i ] . x ∗= 2.0 f ;
The loop is vectorized, however, access into v is strided. Typical
optimization is transferring array of structures (AoS) to structure
of arrays (SoA).
struct vec{
float ∗x ;
float ∗y ;
};
vec v ;
// a l l o c a t i o n . . .
for ( int i = 0; i < n ; i++)
v . x [ i ] ∗= 2.0 f ;
23 / 39
Introduction Example Cache Vectorization Parallelization
Electrostatic Potential Map
Naive implementation has assumed dependence (compiler is not
clever enough), which needs to be manually ﬁxed.
for ( int a = 0; a < nAtoms ; a++) {
sAtom myAtom = atoms [ a ] ;
for ( int x = 0; x < gSize ; x++) {
float dx2 = powf (( float ) x ∗ gs − myAtom . x , 2.0 f ) ;
for ( int y = 0; y < gSize ; y++) {
float dy2 = powf (( float ) y ∗ gs − myAtom . y , 2.0 f ) ;
for ( int z = 0; z < gSize ; z++) {
float dz = ( float ) z ∗ gs − myAtom . z ;
float e = myAtom . w / sqrtf ( dx2 + dy2 + dz∗dz ) ;
#pragma ivdep
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
}
24 / 39
Introduction Example Cache Vectorization Parallelization
Electrostatic Potential Map
AZYX and ZYAX Innermost Loop
for ( int x = 0; x < gSize ; x++) {
float dx = ( float ) x ∗ gs − myAtom . x ;
float e = myAtom . w / sqrtf ( dx∗dx + dy2 + dz2 ) ;
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
The loop is automatically vectorized without problems.
25 / 39
Introduction Example Cache Vectorization Parallelization
Electrostatic Potential Map
ZYXA implementation
for ( int z = 0; z < gSize ; z++) {
for ( int y = 0; y < gSize ; y++) {
for ( int x = 0; x < gSize ; x++) {
float e = 0.0 f ;
for ( int a = 0; a < nAtoms ; a++) {
sAtom myAtom = atoms [ a ] ;
float dx = ( float ) x ∗ gs − myAtom . x ;
float dy = ( float ) y ∗ gs − myAtom . y ;
float dz = ( float ) z ∗ gs − myAtom . z ;
e += myAtom . w / sqrtf ( dx∗dx + dy∗dy + dz∗dz ) ;
}
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
26 / 39
Introduction Example Cache Vectorization Parallelization
ZYXA
The innermost loop is diﬃcult to vectorize
strided memory access into atoms elements
reduction
Two possible solutions
AoS to SoA optimization
vectorization of outer loop running over x
27 / 39
Introduction Example Cache Vectorization Parallelization
SoA
for ( int z = 0; z < gSize ; z++) {
for ( int y = 0; y < gSize ; y++) {
for ( int x = 0; x < gSize ; x++) {
float e = 0.0 f ;
for ( int a = 0; a < nAtoms ; a++) {
float dx = ( float ) x ∗ gs − atoms . x [ a ] ;
float dy = ( float ) y ∗ gs − atoms . y [ a ] ;
float dz = ( float ) z ∗ gs − atoms . z [ a ] ;
e += atoms . w [ a ] / sqrtf ( dx∗dx + dy∗dy + dz∗dz ) ;
}
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
28 / 39
Introduction Example Cache Vectorization Parallelization
Outer-loop Vectorization
for ( int z = 0; z < gSize ; z++) {
for ( int y = 0; y < gSize ; y++) {
#pragma simd
for ( int x = 0; x < gSize ; x++) {
float e = 0.0 f ;
for ( int a = 0; a < nAtoms ; a++) {
sAtom myAtom = atoms [ a ] ;
float dx = ( float ) x ∗ gs − myAtom . x ;
float dy = ( float ) y ∗ gs − myAtom . y ;
float dz = ( float ) z ∗ gs − myAtom . z ;
e += myAtom . w / sqrtf ( dx∗dx + dy∗dy + dz∗dz ) ;
}
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
29 / 39
Introduction Example Cache Vectorization Parallelization
All implementations
We will use restrict quantiﬁer
otherwise, compiler may expect aliasing even between atoms
and grid and give up vectorization
30 / 39
Introduction Example Cache Vectorization Parallelization
Evaluation
Implementation Performance speedup (vect. speedup)
Naive (grid 257) 164.7 n/a n/a
Naive vec. (grid 257) 330.6 2.01× 2.01×
Spatial loc. 1838 11.2× 4.94×
ZYXA outer 2189 13.3× 6.09×
ZYXA SoA 2203 13.4× 6.12×
ZYAX 2197 13.3× 5.75×
31 / 39
Introduction Example Cache Vectorization Parallelization
Parallelization in C/C++
Thread-level parallelism in C/C++
many possible ways to parallelize a code: pthreads, Boost
threads, TBB etc.
we will use OpenMP in our examples, as it broadly-supported
standard and it requires only small changes in our code
however, optimization principles are general and can be used
with any parallelization interface
32 / 39
Introduction Example Cache Vectorization Parallelization
OpenMP
OpenMP standard
for shared-memory parallelism
uses pragmas to declare, which parts of the code runs in
parallel
very easy to use, but writing eﬃcient code may be challenging
(much like in other interfaces)
implements fork-join model
standard, implemented in all major C/C++ compilers
33 / 39
Introduction Example Cache Vectorization Parallelization
OpenMP
The parallel region of the code is declared by #pragma omp
parallel
// s e r i a l code
const int n = 100;
#pragma omp p a r a l l e l
{
// p a r a l l e l code
printf ( ” H e l l o from thread %d\n” , omp_get_thread_num ( ) ) ;
// p a r a l l e l loop , i t e r a t i o n s o r d e r i s undefined
#pragma omp for
for ( int i = 0; i < n ; i++) {
// i t e r a t i o n space i s d i s t r i b u t e d a c r o s s a l l t h r e a d s
printf ( ”%d ” , i ) ;
}
}
// s e r i a l code
34 / 39
Introduction Example Cache Vectorization Parallelization
OpenMP
We can deﬁne private and shared variables
#pragma omp parallel for private(a) shared(b)
variables declared before parallel block are shared by default
private statement creates private copy for each thread
Thread synchronization
we can deﬁne critical section by #pragma omp critical
or use lightweight atomic operations, which are restricted to
simple scalar operations, such as + - * /
35 / 39
Introduction Example Cache Vectorization Parallelization
Electrostatic Potential Map
Which loop can be parallelized?
AZYX: loop running over atoms would need synchronization,
so we prefer to parallelize loop running over Z, Y or X
ZYXA: we can parallelize up to three outermost loops
ZYAX: we can parallelize up to two outermost loops
Which loop to parallelize?
enter and exit of the loop is synchronized
we want to minimize number of synchronizations, so we will
parallelize loops performing more work
to scale better, we may collapse n perfectly-nested loops using
#pragma omp for collapse(n)
36 / 39
Introduction Example Cache Vectorization Parallelization
ZYXA Example
#pragma omp p a r a l l e l for
for ( int z = 0; z < gSize ; z++) {
for ( int y = 0; y < gSize ; y++) {
#pragma simd
for ( int x = 0; x < gSize ; x++) {
float e = 0.0 f ;
for ( int a = 0; a < nAtoms ; a++) {
sAtom myAtom = atoms [ a ] ;
float dx = ( float ) x ∗ gs − myAtom . x ;
float dy = ( float ) y ∗ gs − myAtom . y ;
float dz = ( float ) z ∗ gs − myAtom . z ;
e += myAtom . w / sqrtf ( dx∗dx + dy∗dy + dz∗dz ) ;
}
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
37 / 39
Introduction Example Cache Vectorization Parallelization
ZYXA Example
#pragma omp p a r a l l e l for c o l l a p s e (2)
for ( int z = 0; z < gSize ; z++) {
for ( int y = 0; y < gSize ; y++) {
#pragma simd
for ( int x = 0; x < gSize ; x++) {
float e = 0.0 f ;
for ( int a = 0; a < nAtoms ; a++) {
sAtom myAtom = atoms [ a ] ;
float dx = ( float ) x ∗ gs − myAtom . x ;
float dy = ( float ) y ∗ gs − myAtom . y ;
float dz = ( float ) z ∗ gs − myAtom . z ;
e += myAtom . w / sqrtf ( dx∗dx + dy∗dy + dz∗dz ) ;
}
grid [ z∗gSize∗gSize + y∗gSize + x ] += e ;
}
}
}
38 / 39
Introduction Example Cache Vectorization Parallelization
Evaluation
Implementation Performance speedup (par. speedup)
Naive (grid 257) 164.7 n/a n/a
Spatial loc. 2272 13.8× 1.24×
ZYXA outer 7984 48.5× 3.62×
ZYAX 8092 49.1× 3.68×
39 / 39