Introduction
Our focus
Autotuning
Introduction to autotuning, overview of our research
Jiˇr´ı Filipoviˇc et al.
Institute of Computer Science
Masaryk University
Sep. 2021
1 / 19
Introduction
Our focus
Program development workﬂow
Programmer’s questions
which algorithm to use?
how to implement the algorithm eﬃciently?
how to set-up a compiler?
2 / 19
Introduction
Our focus
Program development workﬂow
Compiler’s questions
how to map variables to registers?
which unrolling factor to use for a loop?
which functions should be inlined?
and many others...
3 / 19
Introduction
Our focus
Program development workﬂow
User’s questions
how many computing nodes and threads assign to the
program?
should accelerators be used?
how to mix MPI and OpenMP threads?
4 / 19
Introduction
Our focus
Program development workﬂow
User’s questions
how many computing nodes and threads assign to the
program?
should accelerators be used?
how to mix MPI and OpenMP threads?
A compiler works with heuristics, people usually too.
4 / 19
Introduction
Our focus
Tuning of the program
We can empirically tune these possibilities
use diﬀerent algorithm
change source code optimizations
use diﬀerent compiler ﬂags
execute in a diﬀerent number of threads
etc.
5 / 19
Introduction
Our focus
Tuning of the program
A tuning allows us to outperform heuristics – we just test what
works better.
however, we have to invest more time into development
there are non-linear (vertical) dependencies, so we cannot
perform tuning steps in isolation
the optimum usually depends on hardware and input
6 / 19
Introduction
Our focus
Autotuning
The tuning can be automated
then we talk about autotuning
Autotuning
in design time, we deﬁne the space of tuning parameters,
which can be changed
each tuning parameter deﬁnes some property of the tuned
application
a search method is used to traverse the space of tuning
parameters eﬃciently
performed according to some objective, usually performance
7 / 19
Introduction
Our focus
Taxonomy of Autotuning
Tuning scope
what properties of the application are changed by autotuner
e.g. compiler ﬂags, number of threads, source code
optimizations parameters
Tuning time
oﬀ-line autotuning (performed once, e.g. after SW
installation)
dynamic autotuning (performed at runtime)
Developer involvement
transparent, or requiring only minor assist from developer
(e.g., compiler ﬂags tuning)
low-level, requiring an expert programmer to identify tuning
opportunities (e.g. optimizations parameters tuning)
8 / 19
Introduction
Our focus
Our focus
Our focus
source code optimizations parameters
heterogeneous computing
We target several research questions:
building framework for dynamic autotuning, which can be
integrated into real-world applications
eﬃcient searching in autotuning spaces
scheduling of autotuning, integration into task-based systems
autotuning in higher-order languages
9 / 19
Introduction
Our focus
Autotuning framework
Kernel Tuning Toolkit (KTT)
the source code in CUDA or OpenCL is changed during a
tuning process
the programmer deﬁnes how tuning parameters inﬂuence the
code
very powerful (source code may control nearly everything)
implementation is diﬃcult
requires recompilation
runtime checks of correctness/precision
non-trivial expression of tuning parameters
we have no implicit assumptions about tuning space
oﬄine and dynamic autotuning
KTT is a core framework used in other autotuning research.
10 / 19
Introduction
Our focus
Performance of our benchmarks library
Benchmark 2080Ti 1070 750 K20 Vega56
BiCG 88.3% 84.7% 81.7% 50.4% 75.6%
Coulomb 3D 91.8% 91.4% 84.3% 43.2% 65.3%
GEMM 79.8% 80.6% 91.1% 51.3% 96.3%
GEMM batched 86.8% 81.4% 90.0% 49.6% 86.0%
Transpose 87.1% 80.2% 86.3% 64.2% 86.1%
N-body 89.7% 86.6% 87.7% 40.6% 82.2%
Reduction 68.7% 87.5% 89.4% 64.1% 71.6%
Table: Performance of benchmarks autotuned for various hardware
devices. The performance relative to the theoretical peak of devices.
11 / 19
Introduction
Our focus
Performance portability
GPU→GPU
Benchmark avg±stdev worst failed
BiCG 89.0%±12.3% 57% 1
Convolution 79.4%±14.9% 55% 3
Coulomb 3D 95.8%±6.5% 67% 0
GEMM 83.6%±16.4% 31% 0
GEMM batched 85.4%±17% 37% 0
Hotspot 80.3%±17.5% 46% 3
Transpose 85.0%±21.9% 8% 3
N-body 78.8%±24.2% 2% 3
Reduction 88.4%±24% 12% 3
Fourier 74.5%±30% 31% 0
Table: Relative performance of benchmarks ported across GPU
architectures without re-tuning.
12 / 19
Introduction
Our focus
Dynamic autotuning of Batched GEMM
0
50
100
150
200
250
0 50 100 150 200 250 300
kernel perf.
perf. with overhead
offline tuned
Figure: Batched GEMM on GeForce GTX 1070.
13 / 19
Introduction
Our focus
Tuning space searching
Tuning spaces are diﬃcult to search
non-linear dependencies, many dimensions, discrete space
random search often superior to mathematical optimization or
ML-based methods
Performance-counter based searcher
ML method learns how tuning parameters inﬂuence
performance counters
during tuning space search, expert system navigates search
towards softening observed bottlenecks (changes tuning
parameter values in order to modify performance counters)
can be used to navigate tuning search of known application at
unseen hardware, unseen input and arbitrary optimizations
We have implemented the ﬁrst version of proﬁle-based searcher.
14 / 19
Introduction
Our focus
Searching speed
2.5e+06
2.6e+06
2.7e+06
2.8e+06
2.9e+06
3e+06
3.1e+06
50 100 150 200 250
kerneltime(ns)
tuning time (s)
proﬁle
random
Figure: GEMM, 2048 × 2048 × 2048, GTX 2080, model from GTX 1070
15 / 19
Introduction
Our focus
Searching speed
	1.4x106
	1.6x106
	1.8x106
	2x106
	2.2x106
	2.4x106
	2.6x106
	2.8x106
	3x106
	0 	50 	100 	150 	200 	250 	300
kernel	time	(ns)
tuning	time	(s)
profile
random
Figure: Matrix transposition, 8192 × 8192, GTX 2080, model from GTX
1070
16 / 19
Introduction
Our focus
Searching speed
350000
400000
450000
500000
550000
600000
650000
700000
50 100 150 200 250
kerneltime(ns)
tuning time (s)
proﬁle
random
Figure: Convolution, 4096 × 4096, GTX 2080, model from GTX 1070
17 / 19
Introduction
Our focus
Autotuning in task-based system
When and where to run auto-tuning
tuning space search has overhead, so it does not always
improve speed
depends on how many times we execute tuned code
depends on how much performance we can get, and how long
time we need to invest into tuning
analysis of performance counters and historical data
becomes increasingly complicated when we have more
processors and accelerators
Autotuning in task-based systems
task-based system distribute data-dependent computing tasks
in hererogeneous node(s)
those tasks can be inherently auto-tuned
challenging in scheduling and SW engineering
We have prototype of KTT integrated into StarPU.
18 / 19
Introduction
Our focus
Autotuning in higher-order languages
Autotuning with KTT requires expert-programmers
CUDA/OpenCL programming is diﬃcult
identiﬁcation of relevant tuning parameters requires intimate
knowledge of hardware
Higher-order languages ease programming
often at the cost of code eﬃciency
but higher-order functions can be tuned according to HW,
input and lower-order function, such tuning is transparent for
the programmer
for ﬁne-tuning, we can also expose tuning parameters for
lower-order functions
We are currently experimenting with integration of KTT into
Thrust (skeletal programming).
19 / 19