Introduction	Kernel Tuning Toolkit	Evaluation	Related Research
ooooooooooooooo	oooooooooo	oooooooooo	oooooooooooo
Autotuning
Introduction to autotuning, overview of our research
Jiří Filipovič et al. Institute of Computer Science Masaryk University
2023
Jiří Filipovič et al.
Autotuning
□ S
Introduction Kernel Tuning Toolkit Evaluation Related Research
•oooooooooooooo oooooooooo oooooooooo oooooooooooo
Program development workflow
Implementation questions
• which algorithm to use?
• how to implement the algorithm efficiently?
• how to set-up a compiler?
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
O0OOOOOOOOOOOOO oooooooooo oooooooooo oooooooooooo
Program development workflow
Compiler's questions
9 how to map variables to registers? • which unrolling factor to use for a loop? o which functions should be inlined? <* and many others...
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OO0OOOOOOOOOOOO oooooooooo oooooooooo oooooooooooo
Program development workflow
Execution
• how many nodes and threads assign to the program?
o should accelerators be used?
o how to mix MPI and OpenMP threads?
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OO0OOOOOOOOOOOO oooooooooo oooooooooo oooooooooooo
Program development workflow
Execution
• how many nodes and threads assign to the program? o should accelerators be used? o how to mix MPI and OpenMP threads? A compiler works with heuristics, people usually too.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOO0OOOOOOOOOOO oooooooooo oooooooooo oooooooooooo
Tuning of the program
We can empirically tune those possibilities
• use different algorithm
• change code optimizations
• use different compiler flags
9 execute in a different number of threads etc.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
oooo«oooooooooo oooooooooo oooooooooo oooooooooooo
Tuning of the program
A tuning allows us to outperform heuristics - we just test what works better.
• however, we have to invest more time into development
• there are vertical dependencies, so we cannot perform tuning steps in isolation
9 the optimum usually depends on hardware and input
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooo«ooooooooo oooooooooo oooooooooo oooooooooooo
Autotuning
The tuning can be automated
• then we talk about autotuning Autotuning
9 in design time, we define the space of tuning parameters, which can be changed
• each tuning parameter defines some property of the tuned application
• a search method is used to traverse the space of tuning parameters efficiently
• performed according to some objective, usually performance
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOO0OOOOOOOO oooooooooo oooooooooo oooooooooooo
Taxonomy of Autotuning
Tuning scope
• what properties of the application are changed by autotuner
o e.g. compiler flags, number of threads, source code optimizations parameters
Tuning time
• offline autotuning (performed once, e.g., after SW installation) 9 dynamic autotuning (performed in runtime)
Developer involvement
<* transparent, or requiring only minor developer assist (e.g. compiler flags tuning)
• low-level, requiring an expert programmer to identify tunning opportunities (e.g. code optimizations parameters tuning)
Jin Filipovic et al.
Autotuning
Introduction OOOOOOO^OOOOOOO
Kernel Tuning Toolkit
oooooooooo
Evaluation
oooooooooo
Related Research oooooooooooo
Our focus
We target autotuning of code optimization parameters
• the source code is changed during a tuning process
• the user defines how tuning parameters influence the code
• very powerful (source code may control nearly everything)
• implementation is difficult
• requires recompilation
• runtime checks of correctness/precision
• non-trivial expression of tuning parameters
we have no implicit assumptions about tuning space
• heterogeneous computing (we are tuning OpenCL or CUDA code)
• offline and dynamic autotuning
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOO0OOOOOO oooooooooo oooooooooo oooooooooooo
Motivation Example
Let's solve a simple problem - vectors addition
• we will use CUDA
• we want to optimize the code
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOO0OOOOO oooooooooo oooooooooo oooooooooooo
Motivation Example
__global__  void  add(float*  const  a,   float*  b) { int   i  =  blockldx.x*blockDim.x + threadldx.x; b[i]   += a[i] ;
}
It should not be difficult to write different variants of the code...
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOO0OOOO oooooooooo oooooooooo oooooooooooo
Optimization
__global__  void  add (f loat4*  const  a,  float4* b) { int   i  =  blockldx.x*blockDim.x +  threadldx.x; b[i]   += a[i] ;
}
Kernel has to be executed with n/4 threads.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooo«ooo oooooooooo oooooooooo oooooooooooo
Optimization
__global__  void  add (float 2*  const  a,  float2* b) { int   i  =  blockldx.x*blockDim.x + threadldx.x; b[i]   += a[i] ;
}
Kernel has to be executed with n/2 threads.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
oooooooooooo«oo oooooooooo oooooooooo oooooooooooo
Optimization
__global__  void  add(float*  const  a,   float* b,   const   int  n) { int   i  =  blockldx.x*blockDim.x + threadldx.x; for (; i < n; i += blockDim.x*gridDim.x) b[i]   += a[i] ;
}
Kernel has to be executed with n/m threads, where m can be anything.
Jin Filipovic et al.
Autotuning
□ S1
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOO0O oooooooooo oooooooooo oooooooooooo
What to Optimize?
I
Mixture of:
• thread-block size
• vector variables
• serial work
i.e. 3D space - and this is trivial example...
Jin Filipovic et al.
Autotuning
□ S
Introduction Kernel Tuning Toolkit Evaluation Related Research
oooooooooooooo* oooooooooo oooooooooo oooooooooooo
Autotuning
Autotuning tools may explore code parameters automatically
__global__ void
add (VECTYPE*   const  a,  VECTYPE*  b,   const   int  n) {
int   i  =  blockldx.x*blockDim.x + threadldx.x;
#if SERIAL_WORK > 1
for   (;   i  < n;   i  +=  blockDim.x*gridDim.x)
#endif
b[i]   += a[i] ;
}
The code executing kernel add has to configure parallelism according to values of VECTYPE and SERIAL.WORK tuning parameters.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO «000000000 oooooooooo oooooooooooo
Kernel Tuning Toolkit
We have developed a Kernel Tuning Toolkit (KTT)
• a framework allowing to tune code parameters for OpenCL and CUDA
• allows both offline and dynamic tuning
• enables cross-kernel optimizations
• mature implementation, documented, with examples 9 https://github.com/HiPerCoRe/KTT
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO O0OOOOOOOO oooooooooo oooooooooooo
Kernel Tuning Toolkit
Typical workflow similar to CUDA/OpenCL o initialize the tuner for a specified device
• create input/output of the kernel
• create kernel
9 create a tuning space for the kernel
• assign input/output to the kernel
• execute or tune the kernel
KTT creates a layer between an application and OpenCL/CUDA.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo oooooooooo oooooooooooo
KTT Sample Code
// Initialize   tuner  and kernel
ktt::Tuner  tuner(platformlndex,   devicelndex); const  ktt::DimensionVector  ndRangeDimensions(inputSize) const  ktt::DimensionVector  workGroupDimensions(128); ktt::Kernelld  foo  =  tuner.addKernelFromFile(kernelFile, ndRangeDimensions,   workGroupDimensions);
// Creation  and  assign  of  kernel arguments ktt: :Argumentld  a =  tuner.addArgumentVector(srcA ,
ktt: :ArgumentAccessType : :Readonly); ktt::Argumentld b =  tuner.addArgumentVector(srcB,
ktt::ArgumentAccessType::WriteOnly); tuner.setKernelArguments(foo,
std: :vector<ktt: :Argumentld >{a,   b});
// Addition  of  tuning variables
tuner.addParameter(foo,   "UNROLL",   {1,   2,   4, 8});
tuner.tuneKernel(foo); tuner.printResuit(foo,   "foo.csv",   ktt::PrintFormat::CSV);
"foo" ,
Jiří Filipovič et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo oooooooooo oooooooooooo
Kernel Tuning Toolkit
In practise, we usually need more functionality
9 tuning parameters can affect parallelism configuration (e.g. block and grid size in CUDA)
o by pre-defined functions (e.g. multiply specified block/grid dimmension)
• by lambda function provided by programmer
• some combinations of tuning parameters can be discarded a priori
9 lambda functions constraining tuning space
• KTT can check, if tuned kernel runs successfully
• automatic check of successful execution
• user can provide reference kernel, or reference class, and comparing function, KTT compares results automatically
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooo«ooooo oooooooooo oooooooooooo
Advanced features of KTT
Cross-kernel optimizations
• the user can add specific code for kernels execution into launchComputation method
• the code may query tuning parameters o the code may call multiple kernels
9 allows tuning code parameters with wider influence, as tuned kernels do not need to be functionally equivalent
Jin Filipovic et al. Autotuning
Introduction
ooooooooooooooo
Kernel Tuning Toolkit OOOOO0OOOO
Evaluation
oooooooooo
Related Research oooooooooooo
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOO0OOO oooooooooo oooooooooooo
Advanced features of KTT
Dynamic autotuning
9 dynamic tuning performs autotuning during application runtime
• KTT can execute the best kernel known so far to perform kernel's task
• or try different combination of tuning parameters before the execution
• tuning is transparent for the application
• tuning can be queried in any time
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOO0OO oooooooooo oooooooooooo
Dynamic Tuning Sample
// Main application loop while(application_run) {
• • •
if   (tuningRequired)
tuner.tuneKernelByStep(foo, {b}); else {
ktt: : Computat ionResuit  best =
tuner->getBestComputationResult (foo) ;
tuner.runKernel(compositionld, best.getConfiguration(),   {b});
}
• • •
}
Jiří Filipovič et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOO0O oooooooooo oooooooooooo
Dynamic tuning
Dynamic autotuning is challenging
<* when the kernel is executed, there must be no significant performance drop
• automatic memory management has to move only necessary data
9 KTT has to support asynchronous execution of
• memory copy, host and device code execution
• simultaneous execution of multiple kernels
Parallelism in KTT
• intra-manipulator: parallelism inside launchComputation method
• global parallelism: asynchronous execution of multiple launchComputation instances
During autotuning, global parallelism is disabled.
□       Si        -        =        1    -O O
Jin Filipovic et al.
Autotuning
Introduction
ooooooooooooooo
Kernel Tuning Toolkit ooooooooo«
KTT Architecture
Evaluation
oooooooooo
Related Research oooooooooooo
Application
Tuning parameters
Kernels
Input buffer(s)
Output buffer(s)
KTT
Tuning space	
>	_ Main f
Searcher *	
	f
Code generator
Code variant executor
Optional validation
Oi
Kernels history
Jin Filipovic et al.
Autotuning
Introduction
ooooooooooooooo
Kernel Tuning Toolkit
oooooooooo
Evaluation •ooooooooo
Related Research oooooooooooo
Benchmark set
Benchmark	dimensions	configurations
BiCG	11	5,122
Convolution	10	5,248
Coulomb 3D	8	1,260
GEMM	15	241,600
GEMM batched	11	424
Hotspot	6	480
Transpose	9	10,752
N-body	8	9,408
Reduction	5	175
Fourier	6	360
Table: A list of the benchmarks and the size and dimensionality (i.e., the number of tuning parameters) of their tuning spaces.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo o«oooooooo oooooooooooo
Test bed setup
Device	Architecture	SP perf.	BW
2x Xeon E5-2650	Sandy Bridge	512	102
Xeon Phi 5110P	Knights Corner	2,022	320
Tesla K20	Kepler	3,524	208
GeForce GTX 750	Maxwell	1,044	80
GeForce GTX 1070	Pascal	5,783	256
Radeon RX Vega 56	GCN 5	8,286	410
GeForce RTX 2080Ti	Turing	11,750	616
Table: Devices used in our benchmarks. Arithmetic performance (SP perf.) is measured in single-precision GFIops, memory bandwidth (BW) is measured in GB/s.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OO0OOOOOOO oooooooooooo
Performance
Benchmark	2080Ti	1070	750	K20	Vega56	E5-2650	5110P
BiCG	88.3%	84.7%	81.7%	50.4%	75.6%	46.0%	6.45%
Coulomb 3D	91.8%	91.4%	84.3%	43.2%	65.3%	74.2%	22.2%
GEMM	79.8%	80.6%	91.1%	51.3%	96.3%	37.5%	19.7%
GEMM batched	86.8%	81.4%	90.0%	49.6%	86.0%	27.7%	20.9%
Transpose	87.1%	80.2%	86.3%	64.2%	86.1%	62.5%	10.0%
N-body	89.7%	86.6%	87.7%	40.6%	82.2%	77.7%	29.9%
Reduction	68.7%	87.5%	89.4%	64.1%	71.6%	33.9%	10.1%
Hotspot	1.35X	1.94X	2.06X	1.4x	2.88X	1.2x	12.8X
Table: Performance of benchmarks autotuned for various hardware devices. The performance relative to the theoretical peak of devices.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo oooooooooo oooooooooooo
Performance portability
	GPU-	>GPU	
Benchmark	avgztstdev	worst	failed
BiCG	89.0%±12.3%	57%	1
Convolution	79.4%±14.9%	55%	3
Coulomb 3D	95.8%±6.5%	67%	0
GEMM	83.6%±16.4%	31%	0
GEMM batched	85.4%±17%	37%	0
Hotspot	80.3%±17.5%	46%	3
Transpose	85.0%±21.9%	8%	3
N-body	78.8%±24.2%	2%	3
Reduction	88.4%±24%	12%	3
Fourier	74.5%±30%	31%	0
Table: Relative performance of benchmarks ported across GPU architectures without re-tuning.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOOOO
Dynamic autotuining of Batched GEMM
250
200
150
100
300
Figure: Batched GEMM on GeForce GTX 1070.
S       ST) <\Qs
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo oooooooooo oooooooooooo
Dynamic autotuining of Batched GEMM
Figure: Batched GEMM on Tesla K20.
Jiří Filipovič et al. Autotuning
I'll!
6 p
9
ft) "0
iterate till there are samples to process
□
4
I
CD
n
CL CD
n
C CL
o
n o
(/)
n
stream 1
'FT samples
^ stream 2 FT samples
	Thr	n
	ro QJ	c
	Q.	
stream n FT samples
ft	Thr.	n
	ro qj	c
	CL	
	•	
	•	
	•	
ft	Thr	n
	ro qj	c
	CL	
M
qj cd ic! qj
ro a.
T3 —*
o
n ro ui tn
ft O
7s „
ft ro CI
stream 1
ft 2i
KJ 3
ro
"FT samples
stream 1
	1	
		CPU
1—1	ead	
O
FT samples
	—1	
ft	zr —s	n
	ro qj	c
	CL	
ft
o
qj ro
LQ QJ
ro o_
iterate till there are samples to process
III
o
3
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOO0OO oooooooooooo
3D Fourier Reconstruction
	2080Ti	1070	750	680
2080Ti	100%	99%	31%	49%
1070	99%	100%	31%	50%
750	43%	67%	100%	94%
680	60%	72%	71%	100%
Table: Performance portability of 3D Fourier reconstruction with 128 x 128 samples.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOO0O oooooooooooo
3D Fourier Reconstruction
	128x128	91x91	64x64	50x50	32x32
128x128	100%	100%	77%	70%	32%
91x91	100%	100%	76%	68%	33%
64x64	94%	94%	100%	91%	67%
50x50	79%	78%	98%	100%	86%
32x32	65%	67%	80%	92%	100%
Table: Performance portability on GeForce GTX1070 for different samples.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo ooooooooo* oooooooooooo
3D Fourier Reconstruction
	best runtime	tuning 50	tuning full
2080Ti	lm40s	88% ± 3%	54%
1070	5m49s	96% ± 2%	79%
750	16m59s	92% ± 4%	72%
680	15ml2s	94% ± 2%	75%
Table: The relative performance of dynamically-tuned 3D Fourier reconstruction.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO «00000000000
What do we use KTT for?
So we have developed fancy autotuning framework...
• which is interesting work anyway, but we can use it also for something more...
In GPU-accelerated applications
• used during program development (exploration of possible optimizations)
<* manually added into applications to enable dynamic tuning
• used in cryo-electron microscopy suite Xmipp
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO O0OOOOOOOOOO
What do we use KTT for?
Some more theoretical (but still with clear practical usage) tasks o searching tuning space 9 tuning budget estimation • interoperability with other tools
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo oooooooooo oo«ooooooooo
Searching tuning space
Why is searching tuning spaces important and difficult?
• important to speed-up autotuning convergence
o discrete many-dimensional non-convex spaces are hard to optimize with mathematical optimization
o as spaces changes with hardware or input, it is also hard task for machine learning (if ML model relates tuning parameters to runtime, it becomes invalid)
Our method
• decomposing relation between tuning parameters and runtime: ML used for relating tuning parameters to performance counters, expert system used steer optimization method
o ML model is independent on HW and input
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo oooooooooo ooo«oooooooo
Searching tuning space
MgUre! Dependence between a tuning parameter and various properties of the Coulomb 3D kernel running with large gridbox on GeForce GTX 750 and with small gridbox on GeForce GTX 1070. The x-axis shows a tuning parameter changing thread coarsening. The y-axis shows normalized values of selected properties: kernel runtime, L2 cache read transactions, texture cache read transactions and 32-bit floating-point operations.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOO0OOOOOOO
Searching tuning space
Main idea behind the searcher
• relation between tuning parameters and performance counters measuring amount of operations remains stable - can be captured by ML model
• relation between tuning parameters and performance counters measuring stress of GPU subsystems depend on GPU and input - can be observed during tuning and used to identify bottlenecks
<* an expert system asks ML model which tuning parameters to change to supress bottlenecks
9 mimics what programmers are doing
• they profile the code to observe bottlenecks, and use their understanding of the code to introduce changes supressing the bottlenecks
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo oooooooooo oooooooooooo
Searching tuning space
Training
Problem p, GPU x, input i
KTT autotuning sample tuning space
Model creation
captures TP-PC relations
tuning data
GPU dependent problem dependent
Searching
Problem p, GPU y, input j
trial run data
GPU dependent problem dependent
KTT search performs trial run
bottlenecks analysis analyse PC
GPU dependent
APC computation
reacts to bottlenecks
GPU dependent
configurations scoring
bias search
searching step
select next configuration
Figure: Schematic view of the searcher workflow. The boxes show program components, cylinders show data objects.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOO0OOOOO
Searching tuning space
0 50        100       150       200       250 0 200        400        600        800 1000
tuning time (s) tuning iteration
Figure: Convergence of the GEMM benchmark using KTT and Kernel Tuner. Left: convergence speed in time. Right: comparison of iterations (empirical tests).
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOO0OOOO
Tuning budget estimation
Tuning budget estimation
• the problem: as autotuning itself requires computational resources, it is also subject of optimization
• therefore, estimating when to stop autotuning is crucial, as it balances
overhead of tuning process (number of tuning steps x average time of tuned kernel with re-compilation) • expected improvement of speed of tuned kernel
• we believe it is possible to guess from historical data and regression of tuning searching convergence
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOO0OOO
Tuning budget estimation
CD
M—
o
in
TD C 13
o
M—
CD
E c
13
CD
CO
Tuning iteration
Figure: Example of tuning space searer convergence.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO ooooooooo«oo
Tuning budget estimation
Tuning iteration
Figure: Example of tuning cost.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO0O
Tuning budget estimation
CD
E -t—*
c
13 i_
C O
U ~Q_
ns
"03
H—'
,o
Stopping point
Figure: Example of total runtime depending on performed tuning steps.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooooo oooooooooo ooooooooooo*
What do we use KTT for?
Interoperability
• programming heterogeneous nodes is generaly challenging: distribution of work among multiple accelerators and CPU, data distribution
• we work on connection of KTT with StarPU
• StarPU implements task-based parallelism, it executes DAG of data-dependent tasks on heterogeneous nodes
9 alternative implementation of tasks
• StarPU schedules data movement and task execution across the node
• KTT makes tasks tunable
Jin Filipovic et al. Autotuning