Introduction	Kernel Tuning Toolkit	Evaluation	Related Research
ooooooooooooooo	ooooooooooo	ooooooooooooo	ooooooooooooooo
Autotuning
Introduction to autotuning, overview of our research
Jin FilipoviC et al. Fall 2024
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
•oooooooooooooo ooooooooooo ooooooooooooo ooooooooooooooo
Program development workflow
Implementation questions
• which algorithm to use?
• how to implement the algorithm efficiently?
• how to set-up a compiler?
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
O0OOOOOOOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo
Program development workflow
Compiler's questions
9 how to map variables to registers? • which unrolling factor to use for a loop? o which functions should be inlined? <* and many others...
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OO0OOOOOOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo
Program development workflow
Execution
• how many nodes and threads assign to the program?
o should accelerators be used?
o how to mix MPI and OpenMP threads?
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OO0OOOOOOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo
Program development workflow
Execution
• how many nodes and threads assign to the program? o should accelerators be used? o how to mix MPI and OpenMP threads? A compiler works with heuristics, people usually too.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOO0OOOOOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo
Tuning of the program
We can empirically tune those possibilities
• use different algorithm
• change code optimizations
• use different compiler flags
9 execute in a different number of threads etc.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
oooo«oooooooooo ooooooooooo ooooooooooooo ooooooooooooooo
Tuning of the program
A tuning allows us to outperform heuristics - we just test what works better.
• however, we have to invest more time into development
• there are vertical dependencies, so we cannot perform tuning steps in isolation
9 the optimum usually depends on hardware and input
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooo«ooooooooo ooooooooooo ooooooooooooo ooooooooooooooo
Autotuning
The tuning can be automated
• then we talk about autotuning Autotuning
• in design time, we define the space of tuning parameters
• each tuning parameter defines some property of the tuned application
• during autotuing, a search method is used to traverse assign optimal values for tuning parameters
• performed according to some objective, usually performance
Jin Filipovic et al. Autotuning
Introduction	Kernel Tuning Toolkit	Evaluation	Related Research
OOOOOO0OOOOOOOO	ooooooooooo	ooooooooooooo	ooooooooooooooo
Taxonomy	of Autotuning		
Tuning scope
• what properties of the application are changed by autotuner
o e.g. compiler flags, number of threads, source code optimizations parameters
Tuning time
9 offline autotuning (performed once, e.g., after SW installation)
o dynamic autotuning (performed in runtime)
Developer involvement
o transparent, or requiring only minor developer assist (e.g. compiler flags tuning)
• application-level, requiring an expert programmer to identify tunning opportunities (e.g. code optimizations parameters tuning)
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOO0OOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo
Our focus
We target autotuning of code optimization parameters
• the source code is changed during a tuning process
• the user defines how tuning parameters influence the code
• very powerful (source code may control nearly everything)
• autotuning framework implementation is difficult
• requires recompilation
• runtime checks of correctness/precision
• non-trivial expression of tuning parameters
we have no implicit assumptions about tuning space
• heterogeneous computing (we are tuning OpenCL or CUDA code)
• offline and dynamic autotuning
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOO0OOOOOO ooooooooooo ooooooooooooo ooooooooooooooo
Motivation Example
Let's solve a simple problem - vectors addition
• we will use CUDA
• we want to optimize the code
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOO0OOOOO ooooooooooo ooooooooooooo ooooooooooooooo
Motivation Example
__global__  void  add(float*  const  a,   float*  b) { int   i  =  blockldx.x*blockDim.x + threadldx.x; b[i]   += a[i] ;
}
It should not be difficult to write different variants of the code...
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
oooooooooo«oooo ooooooooooo ooooooooooooo ooooooooooooooo
Optimization
__global__  void  add(float4*  const  a,  float4* b) { int   i  =  blockldx.x*blockDim . x + threadldx.x; b[i]   +=  a[i] ;
}
Kernel has to be executed with n/4 threads.
Jin Filipovic et al.
Autotuning
□ S
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooo«ooo ooooooooooo ooooooooooooo ooooooooooooooo
Optimization
__global__  void  add (float2*  const  a,  float2* b) { int   i  =  blockldx.x*blockDim . x + threadldx.x; b[i]   += a[i] ;
}
Kernel has to be executed with n/2 threads.
Jin Filipovic et al.
Autotuning
□ a
Introduction Kernel Tuning Toolkit Evaluation Related Research
oooooooooooo«oo ooooooooooo ooooooooooooo ooooooooooooooo
Optimization
__global__  void  add(float*  const  a,   float* b,   const   int  n) { int   i  =  blockldx.x*blockDim.x + threadldx.x; for (; i < n; i += blockDim.x*gridDim.x) b[i]   += a[i] ;
}
Kernel has to be executed with n/m threads, where m can be anything.
Jin Filipovic et al.
Autotuning
□ S1
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOO0O ooooooooooo ooooooooooooo ooooooooooooooo
What to Optimize?
I
Mixture of:
• thread-block size
• vector variables
• serial work
i.e. 3D space - and this is trivial example...
Jin Filipovic et al.
Autotuning
□ S
Introduction Kernel Tuning Toolkit Evaluation Related Research
oooooooooooooo* ooooooooooo ooooooooooooo ooooooooooooooo
Autotuning
Autotuning tools explore code parameters automatically
__global__ void
add (VECTYPE*   const  a,  VECTYPE*  b,   const   int  n) {
int   i  =  blockldx.x*blockDim.x + threadldx.x;
#if SERIAL_WORK > 1
for   (;   i  < n;   i  +=  blockDim.x*gridDim.x)
#endif
b[i]   += a[i] ;
}
The code executing kernel add has to configure parallelism according to values of VECTYPE and SERIAL.WORK tuning parameters.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO «0000000000 ooooooooooooo ooooooooooooooo
Kernel Tuning Toolkit
We have developed a Kernel Tuning Toolkit (KTT) [5]
• a framework allowing to tune code parameters for OpenCL and CUDA
• allows both offline and dynamic tuning
• enables cross-kernel optimizations
• tuning problem described in C++, python, or JSON input
• mature implementation, documented, with examples 9 https://github.com/HiPerCoRe/KTT
[5] Filip Petrovič and Jiří Filipovič. "Kernel Tuning Toolkit". In: Softwaren 22 (2áP3), p. 101385 =
Jiří Filipovič et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO O0OOOOOOOOO ooooooooooooo ooooooooooooooo
Kernel Tuning Toolkit
Typical workflow in C++ similar to CUDA/OpenCL o initialize the tuner for a specified device o create an input/output of the kernel
• create a kernel
9 create a tuning space for the kernel
• assign input/output to the kernel
• execute or tune the kernel
KTT creates a layer between an application and OpenCL/CUDA.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OO0OOOOOOOO ooooooooooooo ooooooooooooooo
KTT Sample Code
// Initialize   tuner  and  kernel definition ktt::Tuner  tuner(platformlndex,   devicelndex);
const  ktt::DimensionVector  ndRangeDimensions(inputSize); const  ktt::DimensionVector  workGroupDimensions(128);
ktt : : Kernelld  fooDef  =  tuner . AddKernelDef init ionFromFile ( 11 foo " , ke ndRangeDimensions,   workGroupDimensions);
// Creation  and  assign  of  kernel arguments
ktt: :Argumentld  a =  tuner.AddArgumentVector(srcA ,
ktt: :ArgumentAccessType : :Readonly); ktt::Argumentld b =  tuner.AddArgumentVector(srcB,
ktt::ArgumentAccessType::WriteOnly); tuner.SetArguments(fooDef ,   {a,   b}) ;
// Create  kernel   and  its   tuning space
ktt :: Kernelld  foo  =  tuner . CreateSimpleKernel ( 11 foo " , fooDef); tuner.AddParameter(foo,   "UNROLL",   {1,   2,   4, 8});
tuner.Tune(foo ) ;
tuner.SaveResult ( foo ,   "foo-output " ,   ktt : : PrintFormat: :JSON ) ;
Jiří Filipovič et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOO0OOOOOOO ooooooooooooo ooooooooooooooo
Alternative KTT usage
Python
o similar to C++ (use C++ bindings)
• input/output as numpy arrays 9 easier rapid experimenting
JSON
• declarative way
• defines input/output (generated, binary)
• defines tuning space
• configures tuning space
9 can be loaded in C++/python, or run by provided miniapp
• allows interoperability between KTT and Kernel Tuner (and hopefuly more autotuners in future)
Jin Filipovic et al.
Autotuning
Introduction
ooooooooooooooo
Kernel Tuning Toolkit OOOO0OOOOOO
Evaluation
ooooooooooooo
Related Research ooooooooooooooo
Kernel Tuning Toolkit
In practise, we usually need more functionality
9 tuning parameters can affect parallelism configuration (e.g. block and grid size in CUDA)
o by pre-defined functions (e.g. multiply specified block/grid dimmension)
• by lambda function provided by programmer
• some combinations of tuning parameters can be discarded a priori
9 lambda functions constraining tuning space
• KTT can check, if tuned kernel runs successfully
• automatic check of successful execution
• user can provide reference kernel, or reference class, and comparing function, KTT compares results automatically
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOO0OOOOO ooooooooooooo ooooooooooooooo
Advanced features of KTT
Cross-kernel optimizations
• the user can define a kernel launcher: the code defining how are kernel(s) executed
• default launcher just execute a kernel o can query tuning parameters
• can call multiple kernels or the same kernel multiple times
<* can execute host code, and host-device memory transfers
9 allows tuning code parameters with wider influence, as tuned kernels do not need to be functionally equivalent
Jin Filipovic et al. Autotuning
Introduction
OOOOOOOOOOOOOOO
Kernel Tuning Toolkit oooooo«oooo
Evaluation
ooooooooooooo
Related Research ooooooooooooooo
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooo«ooo ooooooooooooo ooooooooooooooo
Advanced features of KTT
Dynamic autotuning [6]
o dynamic tuning performs autotuning during application runtime
• KTT can execute the best kernel known so far to perform kernel's task
• or try different combination of tuning parameters before the execution
• tuning is transparent for the application 9 tuning can be queried in any time
[6] F. Petrovic et al. "A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit". In: Future Generation Computer Systems 108 (2020), pp. 161-177. DOi: 10.1016/j.future.2020.02.069 □ - = | >0^O
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo oooooooo#oo ooooooooooooo ooooooooooooooo
Dynamic Tuning Sample
// Main application loop while(application_run) {
• • •
if   (tuningRequired)
tuner.Tune Iterat ion (foo ,   output); else {
ktt::KernelConfiguration  best =
tuner->GetBestConf igurat ion(foo); tuner.Run(foo,   best, output);
}
• • •
}
3^ <»►«>►     1 ^Q.O
Jiří Filipovič et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOO0O ooooooooooooo ooooooooooooooo
Dynamic tuning
Dynamic autotuning is challenging
o when the kernel is executed, there must be no significant performance drop
• automatic memory management has to move only necessary data
9 KTT has to support asynchronous execution of
• memory copy, host and device code execution
• simultaneous execution of multiple kernels
Parallelism in KTT
• intra-launcher: parallelism inside kernel launcher
9 global parallelism: asynchronous execution of multiple launcher instances
During autotuning, global parallelism has to be disabled.
Jin Filipovic et al.
Autotuning
Introduction
ooooooooooooooo
Kernel Tuning Toolkit oooooooooo*
KTT Architecture
Evaluation
ooooooooooooo
Related Research ooooooooooooooo
Application
Tuning parameters
Kernels
Input buffer(s)
Output buffer(s)
KTT
Tuning space	
>	_ Main f
Searcher *	
	f
Code generator
Code variant executor
Optional validation
Oi
Kernels history
Jin Filipovic et al.
Autotuning
Introduction
ooooooooooooooo
Kernel Tuning Toolkit
ooooooooooo
Benchmark set
Evaluation •oooooooooooo
Related Research ooooooooooooooo
Benchmark	dimensions	configurations
BiCG	11	5,122
Convolution	10	5,248
Coulomb 3D	8	1,260
GEMM	15	241,600
GEMM batched	11	424
Hotspot	6	480
Transpose	9	10,752
N-body	8	9,408
Reduction	5	175
Fourier	6	360
Table: A list of the benchmarks and the size and dimensionality (i.e., the number of tuning parameters) of their tuning spaces.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO O0OOOOOOOOOOO ooooooooooooooo
Test bed setup
Device	Architecture	SP perf.	BW
2x Xeon E5-2650	Sandy Bridge	512	102
Xeon Phi 5110P	Knights Corner	2,022	320
Tesla K20	Kepler	3,524	208
GeForce GTX 750	Maxwell	1,044	80
GeForce GTX 1070	Pascal	5,783	256
Radeon RX Vega 56	GCN 5	8,286	410
GeForce RTX 2080Ti	Turing	11,750	616
Table: Devices used in our benchmarks. Arithmetic performance (SP perf.) is measured in single-precision GFIops, memory bandwidth (BW) is measured in GB/s.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OO0OOOOOOOOOO ooooooooooooooo
Performance
Benchmark	2080Ti	1070	750	K20	Vega56	E5-2650	5110P
BiCG	88.3%	84.7%	81.7%	50.4%	75.6%	46.0%	6.45%
Coulomb 3D	91.8%	91.4%	84.3%	43.2%	65.3%	74.2%	22.2%
GEMM	79.8%	80.6%	91.1%	51.3%	96.3%	37.5%	19.7%
GEMM batched	86.8%	81.4%	90.0%	49.6%	86.0%	27.7%	20.9%
Transpose	87.1%	80.2%	86.3%	64.2%	86.1%	62.5%	10.0%
N-body	89.7%	86.6%	87.7%	40.6%	82.2%	77.7%	29.9%
Reduction	68.7%	87.5%	89.4%	64.1%	71.6%	33.9%	10.1%
Hotspot	1.35X	1.94X	2.06X	1.4x	2.88X	1.2x	12.8X
Table: Performance of benchmarks autotuned for various hardware devices. The performance relative to the theoretical peak of devices.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOO0OOOOOOOOO ooooooooooooooo
Performance portability
	GPU-	>GPU	
Benchmark	avgztstdev	worst	failed
BiCG	89.0%±12.3%	57%	1
Convolution	79.4%±14.9%	55%	3
Coulomb 3D	95.8%±6.5%	67%	0
GEMM	83.6%±16.4%	31%	0
GEMM batched	85.4%±17%	37%	0
Hotspot	80.3%±17.5%	46%	3
Transpose	85.0%±21.9%	8%	3
N-body	78.8%±24.2%	2%	3
Reduction	88.4%±24%	12%	3
Fourier	74.5%±30%	31%	0
Table: Relative performance of benchmarks ported across GPU architectures without re-tuning.
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo oooo«oooooooo ooooooooooooooo
Dynamic autotuining of Batched GEMM
250
200
150
100
300
Figure: Batched GEMM on GeForce GTX 1070.
Jin Filipovic et al.
Autotuning
Introduction
ooooooooooooooo
Evaluation ooooo«ooooooo
Related Research ooooooooooooooo
Dynamic autotuining of Batched GEMM
Figure: Batched GEMM on Tesla K20.
Jiří Filipovič et al. Autotuning
Introduction
OOOOOOOOOOOOOOO
Kernel Tuning Toolkit
ooooooooooo
Evaluation oooooo«oooooo
Related Research ooooooooooooooo
3D Fourier Reconstruction
Process #0
distribute tasks
Process #1
(batches of samples)
Thread Manager
distribute samples
1			1	r		1	r
CPU			CPU			CPU	
Thread			Thread		• • •	Thread	
#1			#2			#n	
	Ul						(/}
i-l	0>		(N	QJ		c	a;
E ro	Cl E		E ro	Q. £		E	Q. £
QJ				fü (/I		a;	
m	t			t			t
\				r			r
GPU			GPU			GPU	
Kernel			Kernel		• • •	Kernel	
#1			#2			#n	
1			1			1	
update 3D regular grid
Process #m
Thread Manager
distribute samples
>	r						
CPU			CPU			CPU	
Th read			Thread		• • •	Thread	
#1			#2			#n	
T-l	QJ			QJ		rH	QJ
ream	samp1		ream	samp1		ream	sampI
	t		>	h r		<	h-u_ f
GPU			GPU			GPU	
Kernel			Kernel		• • •	Kernel	
#1			#2			#n	
t			1			I	
update 3D regular grid
reduce partial grids
Process #0
Figure: Performance of dynamic tuned 3D Fourier reconstruction [8]
[8] D. Streläk et al. "A GPU Acceleration of 3D Fourier Reconstruction in Cryo-EM". In: The International Journal of High Performance Computing Applications 0 (0 2019). DOi: 10.1177tfl0943fl019832958 ^0,0
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOO0OOOOO ooooooooooooooo
3D Fourier Reconstruction
	2080Ti	1070	750	680
2080Ti	100%	99%	31%	49%
1070	99%	100%	31%	50%
750	43%	67%	100%	94%
680	60%	72%	71%	100%
Table: Performance portability of 3D Fourier reconstruction with 128 x 128 samples.
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOO0OOOO ooooooooooooooo
3D Fourier Reconstruction
	128x128	91x91	64x64	50x50	32x32
128x128	100%	100%	77%	70%	32%
91x91	100%	100%	76%	68%	33%
64x64	94%	94%	100%	91%	67%
50x50	79%	78%	98%	100%	86%
32x32	65%	67%	80%	92%	100%
Table: Performance portability on GeForce GTX1070 for different samples.
Jin Filipovic et al. Autotuning
Introduction
OOOOOOOOOOOOOOO
Kernel Tuning Toolkit
ooooooooooo
Evaluation OOOOOOOOO0OOO
Related Research ooooooooooooooo
3D Fourier Reconstruction
	best runtime	tuning 50	tuning full
2080Ti	lm40s	88% ± 3%	54%
1070	5m49s	96% ± 2%	79%
750	16m59s	92% ± 4%	72%
680	15ml2s	94% ± 2%	75%
Table: The relative performance of dynamically-tuned 3D Fourier reconstruction.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOO0OO ooooooooooooooo
Dynamic autotuining of SpMV
SpMV is important kernel in many applications
o perform multiplication of sparse matrix with dense vector o system of equations solving, graph processing, . ..
Challenging to compute efficienty
• optimization decisions strongly dependent on input structure
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo ooooooooooo«o ooooooooooooooo
Dynamic autotuining of SpMV
Multiple libraries available
• cuSPARSE - closed-source library actively developed by NVIDIA
• CUSP - open-source library released by NVIDIA, slower compared to cuSPARSE
Our goal
• insert dynamic autotuning into CUSP for DIA, ELL [2] , COO and CSR [1] formats
• minimize required changes in code using CUSP
[2] M. Demek. "Dynamic autotuning of SpMV kernel in CUSP library". MA thesis. Masaryk University, 2023 [1] F. Brablfk. "Dynamic autotuning of SpMV kernel in CUSP library". MA thesis Masaryk University, 20S4
Jin Filipovic et al.
Autotuning
Introduction
ooooooooooooooo
Kernel Tuning Toolkit
ooooooooooo
Evaluation
OOOOOOOOOOOCM
Related Research ooooooooooooooo
Dynamic autotuining of SpMV
2.00
1.75
1.50 -
1.25
y 1.00
Ol
0.75 -
0.50
0.25
0.00
CUSP
cuSPARSE
£>     J>    S"   *?    v«     &     <?-     tf # S>'   £>'   d*     #    &°     £" *y   ^   Vs     A (S*
A*' ^
Figure: SpMV benchmark for CSR format.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO «00000000000000
What do we use KTT for?
So we have developed fancy autotuning framework...
o which is interesting work anyway, but we can do even more... In GPU-accelerated applications
• used during program development (exploration of possible optimizations)
9 manually added into applications to enable dynamic tuning
• used in cryo-electron microscopy suite Xmipp
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooooo
What do we use KTT for?
Some more theoretical (but still with clear practical usage) tasks o searching tuning space 9 tuning budget estimation • interoperability with other tools
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo ooooooooooooo oo«oooooooooooo
Searching tuning space
Why is searching tuning spaces important and difficult?
• important to speed-up autotuning convergence
o discrete many-dimensional non-convex spaces are hard to optimize with mathematical optimization
• as spaces changes with hardware or input, it is also hard task for machine learning (if ML model relates tuning parameters to runtime, it becomes invalid)
We proposed a novel method [3]
• decomposing relation between tuning parameters and runtime: ML used for relating tuning parameters to performance counters, expert system used steer optimization method
o ML model is independent on HW and input
[3] J. Filipovic et al. "Using hardware performance counters to speed up autotuning convergence on GPUs". In: Journal of Parallel and Distributed Computing 160 (2022), pp. 16-35. ISSN: 0743-7315. DOi:
https://doi.Org/10.1016/j.jpdc.2021.10.003 □        rS" | = | >0^O
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo ooooooooooooo ooo«ooooooooooo
Searching tuning space
MgUre! Dependence between a tuning parameter and various properties of the Coulomb 3D kernel running with large gridbox on GeForce GTX 750 and with small gridbox on GeForce GTX 1070. The x-axis shows a tuning parameter changing thread coarsening. The y-axis shows normalized values of selected properties: kernel runtime, L2 cache read transactions, texture cache read transactions and 32-bit floating-point operations.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOO0OOOOOOOOOO
Searching tuning space
Main idea behind the searcher
• relation between tuning parameters and performance counters measuring amount of operations remains stable - can be captured by ML model
• relation between tuning parameters and performance counters measuring stress of GPU subsystems depend on GPU and input - can be observed during tuning and used to identify bottlenecks
<* an expert system asks ML model which tuning parameters to change to supress bottlenecks
9 mimics what programmers are doing
• they profile the code to observe bottlenecks, and use their understanding of the code to introduce changes supressing the bottlenecks
<□► < is? ► < ^ ► < ^ ►   i   -o °s o
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOO0OOOOOOOOO
Searching tuning space
Training
Problem p, GPU x, input i
KTT autotuning sample tuning space
Model creation
captures TP-PC relations
tuning data
GPU dependent problem dependent
Searching
Problem p, GPU y, input j
trial run data
GPU dependent problem dependent
KTT search performs trial run
bottlenecks analysis analyse PC
GPU dependent
APC computation
reacts to bottlenecks
GPU dependent
configurations scoring
bias search
searching step
select next configuration
Figure: Schematic view of the searcher workflow. The boxes show program components, cylinders show data objects.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOO0OOOOOOOO
Searching tuning space
0 50        100       150       200       250 0 200        400        600        800 1000
tuning time (s) tuning iteration
Figure: Convergence of the GEMM benchmark using KTT and Kernel Tuner. Left: convergence speed in time. Right: comparison of iterations (empirical tests).
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOOO0OOOOOOO
Tuning budget estimation
Tuning budget estimation
• the problem: as autotuning itself requires computational resources, it is also subject of optimization
• therefore, estimating when to stop autotuning is crucial, as it balances
overhead of tuning process (number of tuning steps x average time of tuned kernel with re-compilation) • expected improvement of speed of tuned kernel
• we shown it is possible to guess from historical data and regression of tuning searching convergence [4]
[4] Jaroslav Ol'ha et al. "Estimating resource budgets to ensure autotuning efficiency". In: Available at SSRN 4661862 (2024) I □ I   < fiP I 1    -0 0,0
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooooo
Tuning budget estimation
03
M—
O 1/1
"O C
o a;
03
M—
o
~o c
o
(V
c
13
CO
13
Ol CD
Tuning iteration
Tuning iteration
Figure: Example of tuning space searcher convergence.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo ooooooooooooo ooooooooo«ooooo
Tuning budget estimation
O
U
c
'c
-t—•
>
JV =3
E u
o u
Oil c
'c
-t—»
(1) >
03
13
u
Tuning iteration
Tuning iteration
Figure: Example of tuning cost.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOOOOOO0OOOO
Tuning budget estimation
E
■4—»
c
=3 C
o
03
~Q-Q. 03
"03
O
c o
03 U
"5.
Q. 03
"03
.O
Stopping point
Stopping point
Figure: Example of total runtime depending on performed tuning steps.
Jin Filipovic et al.
Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOOOOOOO0OOO
KTT interoperability
KTT can be connected with different frameworks
• programming heterogeneous nodes is generaly challenging: distribution of work among multiple accelerators and CPU, data distribution
• StarPU implements task-based parallelism, it executes DAG of data-dependent tasks on heterogeneous nodes
alternative implementation of tasks
• StarPU schedules data movement and task execution across the node
• connection of KTT and StarPU makes tasks tunable [7]
• tuning transparent to user
• decoples codes of domain and HPC experts
[7] D. Strelak et al. "Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes". In: Computing (2023), pp. 1-29
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOOOOOOOO0OO
Future work
Still many interesting topics untouched
• autotuning for energy efficiency
• optimizing optimization spaces
• high-level programming of autotuned code o non-trivial applications
Jin Filipovic et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooo««
Bibliography I
[1]   F. Bráblík. "Dynamic autotuning of SpMV kernel in CUSP library". MA thesis. Masaryk University, 2024.
[2]   M. Demek. "Dynamic autotuning of SpMV kernel in CUSP library". MA thesis. Masaryk University, 2023.
[3] J. Filipovič et al. "Using hardware performance counters to speed up autotuning convergence on GPUs". In: Journal of Parallel and Distributed Computing 160 (2022), pp. 16-35. ISSN: 0743-7315. DOl:
https://doi.org/10.1016/j.jpdc.2021.10.003.
[4]   Jaroslav Ol'ha et al. "Estimating resource budgets to ensure autotuning efficiency" . In: Available at SSRN 4661862 (2024).
[5]   Filip Petrovič and Jiří Filipovič. "Kernel Tuning Toolkit". In: SoftwareX 22 (2023), p. 101385.
Jiří Filipovič et al. Autotuning
Introduction Kernel Tuning Toolkit Evaluation Related Research
ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooo««
Bibliography II
[6]   F. Petrovic et al. "A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit". In: Future Generation Computer Systems 108 (2020), pp. 161-177. doi: 10.1016/j.future.2020.02.069.
[7]   D. Streläk et al. "Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes". In: Computing (2023), pp. 1-29.
[8]   D. Streläk et al. "A GPU Acceleration of 3D Fourier
Reconstruction in Cryo-EM". In: The International Journal of High Performance Computing Applications 0 (0 2019). doi: 10.1177/1094342019832958.
Jin Filipovic et al. Autotuning