Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooooo Autotuning Introduction to autotuning, overview of our research Jin FilipoviC et al. Fall 2024 Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research •oooooooooooooo ooooooooooo ooooooooooooo ooooooooooooooo Program development workflow Implementation questions • which algorithm to use? • how to implement the algorithm efficiently? • how to set-up a compiler? Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research O0OOOOOOOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo Program development workflow Compiler's questions 9 how to map variables to registers? • which unrolling factor to use for a loop? o which functions should be inlined? <* and many others... Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OO0OOOOOOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo Program development workflow Execution • how many nodes and threads assign to the program? o should accelerators be used? o how to mix MPI and OpenMP threads? Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OO0OOOOOOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo Program development workflow Execution • how many nodes and threads assign to the program? o should accelerators be used? o how to mix MPI and OpenMP threads? A compiler works with heuristics, people usually too. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOO0OOOOOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo Tuning of the program We can empirically tune those possibilities • use different algorithm • change code optimizations • use different compiler flags 9 execute in a different number of threads etc. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research oooo«oooooooooo ooooooooooo ooooooooooooo ooooooooooooooo Tuning of the program A tuning allows us to outperform heuristics - we just test what works better. • however, we have to invest more time into development • there are vertical dependencies, so we cannot perform tuning steps in isolation 9 the optimum usually depends on hardware and input Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooo«ooooooooo ooooooooooo ooooooooooooo ooooooooooooooo Autotuning The tuning can be automated • then we talk about autotuning Autotuning • in design time, we define the space of tuning parameters • each tuning parameter defines some property of the tuned application • during autotuing, a search method is used to traverse assign optimal values for tuning parameters • performed according to some objective, usually performance Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOO0OOOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo Taxonomy of Autotuning Tuning scope • what properties of the application are changed by autotuner o e.g. compiler flags, number of threads, source code optimizations parameters Tuning time 9 offline autotuning (performed once, e.g., after SW installation) o dynamic autotuning (performed in runtime) Developer involvement o transparent, or requiring only minor developer assist (e.g. compiler flags tuning) • application-level, requiring an expert programmer to identify tunning opportunities (e.g. code optimizations parameters tuning) Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOO0OOOOOOO ooooooooooo ooooooooooooo ooooooooooooooo Our focus We target autotuning of code optimization parameters • the source code is changed during a tuning process • the user defines how tuning parameters influence the code • very powerful (source code may control nearly everything) • autotuning framework implementation is difficult • requires recompilation • runtime checks of correctness/precision • non-trivial expression of tuning parameters we have no implicit assumptions about tuning space • heterogeneous computing (we are tuning OpenCL or CUDA code) • offline and dynamic autotuning Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOO0OOOOOO ooooooooooo ooooooooooooo ooooooooooooooo Motivation Example Let's solve a simple problem - vectors addition • we will use CUDA • we want to optimize the code Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOO0OOOOO ooooooooooo ooooooooooooo ooooooooooooooo Motivation Example __global__ void add(float* const a, float* b) { int i = blockldx.x*blockDim.x + threadldx.x; b[i] += a[i] ; } It should not be difficult to write different variants of the code... Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research oooooooooo«oooo ooooooooooo ooooooooooooo ooooooooooooooo Optimization __global__ void add(float4* const a, float4* b) { int i = blockldx.x*blockDim . x + threadldx.x; b[i] += a[i] ; } Kernel has to be executed with n/4 threads. Jin Filipovic et al. Autotuning □ S Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooo«ooo ooooooooooo ooooooooooooo ooooooooooooooo Optimization __global__ void add (float2* const a, float2* b) { int i = blockldx.x*blockDim . x + threadldx.x; b[i] += a[i] ; } Kernel has to be executed with n/2 threads. Jin Filipovic et al. Autotuning □ a Introduction Kernel Tuning Toolkit Evaluation Related Research oooooooooooo«oo ooooooooooo ooooooooooooo ooooooooooooooo Optimization __global__ void add(float* const a, float* b, const int n) { int i = blockldx.x*blockDim.x + threadldx.x; for (; i < n; i += blockDim.x*gridDim.x) b[i] += a[i] ; } Kernel has to be executed with n/m threads, where m can be anything. Jin Filipovic et al. Autotuning □ S1 Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOO0O ooooooooooo ooooooooooooo ooooooooooooooo What to Optimize? I Mixture of: • thread-block size • vector variables • serial work i.e. 3D space - and this is trivial example... Jin Filipovic et al. Autotuning □ S Introduction Kernel Tuning Toolkit Evaluation Related Research oooooooooooooo* ooooooooooo ooooooooooooo ooooooooooooooo Autotuning Autotuning tools explore code parameters automatically __global__ void add (VECTYPE* const a, VECTYPE* b, const int n) { int i = blockldx.x*blockDim.x + threadldx.x; #if SERIAL_WORK > 1 for (; i < n; i += blockDim.x*gridDim.x) #endif b[i] += a[i] ; } The code executing kernel add has to configure parallelism according to values of VECTYPE and SERIAL.WORK tuning parameters. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO «0000000000 ooooooooooooo ooooooooooooooo Kernel Tuning Toolkit We have developed a Kernel Tuning Toolkit (KTT) [5] • a framework allowing to tune code parameters for OpenCL and CUDA • allows both offline and dynamic tuning • enables cross-kernel optimizations • tuning problem described in C++, python, or JSON input • mature implementation, documented, with examples 9 https://github.com/HiPerCoRe/KTT [5] Filip Petrovič and Jiří Filipovič. "Kernel Tuning Toolkit". In: Softwaren 22 (2áP3), p. 101385 = Jiří Filipovič et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO O0OOOOOOOOO ooooooooooooo ooooooooooooooo Kernel Tuning Toolkit Typical workflow in C++ similar to CUDA/OpenCL o initialize the tuner for a specified device o create an input/output of the kernel • create a kernel 9 create a tuning space for the kernel • assign input/output to the kernel • execute or tune the kernel KTT creates a layer between an application and OpenCL/CUDA. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OO0OOOOOOOO ooooooooooooo ooooooooooooooo KTT Sample Code // Initialize tuner and kernel definition ktt::Tuner tuner(platformlndex, devicelndex); const ktt::DimensionVector ndRangeDimensions(inputSize); const ktt::DimensionVector workGroupDimensions(128); ktt : : Kernelld fooDef = tuner . AddKernelDef init ionFromFile ( 11 foo " , ke ndRangeDimensions, workGroupDimensions); // Creation and assign of kernel arguments ktt: :Argumentld a = tuner.AddArgumentVector(srcA , ktt: :ArgumentAccessType : :Readonly); ktt::Argumentld b = tuner.AddArgumentVector(srcB, ktt::ArgumentAccessType::WriteOnly); tuner.SetArguments(fooDef , {a, b}) ; // Create kernel and its tuning space ktt :: Kernelld foo = tuner . CreateSimpleKernel ( 11 foo " , fooDef); tuner.AddParameter(foo, "UNROLL", {1, 2, 4, 8}); tuner.Tune(foo ) ; tuner.SaveResult ( foo , "foo-output " , ktt : : PrintFormat: :JSON ) ; Jiří Filipovič et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOO0OOOOOOO ooooooooooooo ooooooooooooooo Alternative KTT usage Python o similar to C++ (use C++ bindings) • input/output as numpy arrays 9 easier rapid experimenting JSON • declarative way • defines input/output (generated, binary) • defines tuning space • configures tuning space 9 can be loaded in C++/python, or run by provided miniapp • allows interoperability between KTT and Kernel Tuner (and hopefuly more autotuners in future) Jin Filipovic et al. Autotuning Introduction ooooooooooooooo Kernel Tuning Toolkit OOOO0OOOOOO Evaluation ooooooooooooo Related Research ooooooooooooooo Kernel Tuning Toolkit In practise, we usually need more functionality 9 tuning parameters can affect parallelism configuration (e.g. block and grid size in CUDA) o by pre-defined functions (e.g. multiply specified block/grid dimmension) • by lambda function provided by programmer • some combinations of tuning parameters can be discarded a priori 9 lambda functions constraining tuning space • KTT can check, if tuned kernel runs successfully • automatic check of successful execution • user can provide reference kernel, or reference class, and comparing function, KTT compares results automatically Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOO0OOOOO ooooooooooooo ooooooooooooooo Advanced features of KTT Cross-kernel optimizations • the user can define a kernel launcher: the code defining how are kernel(s) executed • default launcher just execute a kernel o can query tuning parameters • can call multiple kernels or the same kernel multiple times <* can execute host code, and host-device memory transfers 9 allows tuning code parameters with wider influence, as tuned kernels do not need to be functionally equivalent Jin Filipovic et al. Autotuning Introduction OOOOOOOOOOOOOOO Kernel Tuning Toolkit oooooo«oooo Evaluation ooooooooooooo Related Research ooooooooooooooo Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooo«ooo ooooooooooooo ooooooooooooooo Advanced features of KTT Dynamic autotuning [6] o dynamic tuning performs autotuning during application runtime • KTT can execute the best kernel known so far to perform kernel's task • or try different combination of tuning parameters before the execution • tuning is transparent for the application 9 tuning can be queried in any time [6] F. Petrovic et al. "A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit". In: Future Generation Computer Systems 108 (2020), pp. 161-177. DOi: 10.1016/j.future.2020.02.069 □ - = | >0^O Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooo#oo ooooooooooooo ooooooooooooooo Dynamic Tuning Sample // Main application loop while(application_run) { • • • if (tuningRequired) tuner.Tune Iterat ion (foo , output); else { ktt::KernelConfiguration best = tuner->GetBestConf igurat ion(foo); tuner.Run(foo, best, output); } • • • } 3^ <»►«>► 1 ^Q.O Jiří Filipovič et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOO0O ooooooooooooo ooooooooooooooo Dynamic tuning Dynamic autotuning is challenging o when the kernel is executed, there must be no significant performance drop • automatic memory management has to move only necessary data 9 KTT has to support asynchronous execution of • memory copy, host and device code execution • simultaneous execution of multiple kernels Parallelism in KTT • intra-launcher: parallelism inside kernel launcher 9 global parallelism: asynchronous execution of multiple launcher instances During autotuning, global parallelism has to be disabled. Jin Filipovic et al. Autotuning Introduction ooooooooooooooo Kernel Tuning Toolkit oooooooooo* KTT Architecture Evaluation ooooooooooooo Related Research ooooooooooooooo Application Tuning parameters Kernels Input buffer(s) Output buffer(s) KTT Tuning space > _ Main f Searcher * f Code generator Code variant executor Optional validation Oi Kernels history Jin Filipovic et al. Autotuning Introduction ooooooooooooooo Kernel Tuning Toolkit ooooooooooo Benchmark set Evaluation •oooooooooooo Related Research ooooooooooooooo Benchmark dimensions configurations BiCG 11 5,122 Convolution 10 5,248 Coulomb 3D 8 1,260 GEMM 15 241,600 GEMM batched 11 424 Hotspot 6 480 Transpose 9 10,752 N-body 8 9,408 Reduction 5 175 Fourier 6 360 Table: A list of the benchmarks and the size and dimensionality (i.e., the number of tuning parameters) of their tuning spaces. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO O0OOOOOOOOOOO ooooooooooooooo Test bed setup Device Architecture SP perf. BW 2x Xeon E5-2650 Sandy Bridge 512 102 Xeon Phi 5110P Knights Corner 2,022 320 Tesla K20 Kepler 3,524 208 GeForce GTX 750 Maxwell 1,044 80 GeForce GTX 1070 Pascal 5,783 256 Radeon RX Vega 56 GCN 5 8,286 410 GeForce RTX 2080Ti Turing 11,750 616 Table: Devices used in our benchmarks. Arithmetic performance (SP perf.) is measured in single-precision GFIops, memory bandwidth (BW) is measured in GB/s. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OO0OOOOOOOOOO ooooooooooooooo Performance Benchmark 2080Ti 1070 750 K20 Vega56 E5-2650 5110P BiCG 88.3% 84.7% 81.7% 50.4% 75.6% 46.0% 6.45% Coulomb 3D 91.8% 91.4% 84.3% 43.2% 65.3% 74.2% 22.2% GEMM 79.8% 80.6% 91.1% 51.3% 96.3% 37.5% 19.7% GEMM batched 86.8% 81.4% 90.0% 49.6% 86.0% 27.7% 20.9% Transpose 87.1% 80.2% 86.3% 64.2% 86.1% 62.5% 10.0% N-body 89.7% 86.6% 87.7% 40.6% 82.2% 77.7% 29.9% Reduction 68.7% 87.5% 89.4% 64.1% 71.6% 33.9% 10.1% Hotspot 1.35X 1.94X 2.06X 1.4x 2.88X 1.2x 12.8X Table: Performance of benchmarks autotuned for various hardware devices. The performance relative to the theoretical peak of devices. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOO0OOOOOOOOO ooooooooooooooo Performance portability GPU- >GPU Benchmark avgztstdev worst failed BiCG 89.0%±12.3% 57% 1 Convolution 79.4%±14.9% 55% 3 Coulomb 3D 95.8%±6.5% 67% 0 GEMM 83.6%±16.4% 31% 0 GEMM batched 85.4%±17% 37% 0 Hotspot 80.3%±17.5% 46% 3 Transpose 85.0%±21.9% 8% 3 N-body 78.8%±24.2% 2% 3 Reduction 88.4%±24% 12% 3 Fourier 74.5%±30% 31% 0 Table: Relative performance of benchmarks ported across GPU architectures without re-tuning. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo oooo«oooooooo ooooooooooooooo Dynamic autotuining of Batched GEMM 250 200 150 100 300 Figure: Batched GEMM on GeForce GTX 1070. Jin Filipovic et al. Autotuning Introduction ooooooooooooooo Evaluation ooooo«ooooooo Related Research ooooooooooooooo Dynamic autotuining of Batched GEMM Figure: Batched GEMM on Tesla K20. Jiří Filipovič et al. Autotuning Introduction OOOOOOOOOOOOOOO Kernel Tuning Toolkit ooooooooooo Evaluation oooooo«oooooo Related Research ooooooooooooooo 3D Fourier Reconstruction Process #0 distribute tasks Process #1 (batches of samples) Thread Manager distribute samples 1 1 r 1 r CPU CPU CPU Thread Thread • • • Thread #1 #2 #n Ul (/} i-l 0> (N QJ c a; E ro Cl E E ro Q. £ E Q. £ QJ fü (/I a; m t t t \ r r GPU GPU GPU Kernel Kernel • • • Kernel #1 #2 #n 1 1 1 update 3D regular grid Process #m Thread Manager distribute samples > r CPU CPU CPU Th read Thread • • • Thread #1 #2 #n T-l QJ QJ rH QJ ream samp1 ream samp1 ream sampI t > h r < h-u_ f GPU GPU GPU Kernel Kernel • • • Kernel #1 #2 #n t 1 I update 3D regular grid reduce partial grids Process #0 Figure: Performance of dynamic tuned 3D Fourier reconstruction [8] [8] D. Streläk et al. "A GPU Acceleration of 3D Fourier Reconstruction in Cryo-EM". In: The International Journal of High Performance Computing Applications 0 (0 2019). DOi: 10.1177tfl0943fl019832958 ^0,0 Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOO0OOOOO ooooooooooooooo 3D Fourier Reconstruction 2080Ti 1070 750 680 2080Ti 100% 99% 31% 49% 1070 99% 100% 31% 50% 750 43% 67% 100% 94% 680 60% 72% 71% 100% Table: Performance portability of 3D Fourier reconstruction with 128 x 128 samples. Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOO0OOOO ooooooooooooooo 3D Fourier Reconstruction 128x128 91x91 64x64 50x50 32x32 128x128 100% 100% 77% 70% 32% 91x91 100% 100% 76% 68% 33% 64x64 94% 94% 100% 91% 67% 50x50 79% 78% 98% 100% 86% 32x32 65% 67% 80% 92% 100% Table: Performance portability on GeForce GTX1070 for different samples. Jin Filipovic et al. Autotuning Introduction OOOOOOOOOOOOOOO Kernel Tuning Toolkit ooooooooooo Evaluation OOOOOOOOO0OOO Related Research ooooooooooooooo 3D Fourier Reconstruction best runtime tuning 50 tuning full 2080Ti lm40s 88% ± 3% 54% 1070 5m49s 96% ± 2% 79% 750 16m59s 92% ± 4% 72% 680 15ml2s 94% ± 2% 75% Table: The relative performance of dynamically-tuned 3D Fourier reconstruction. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOO0OO ooooooooooooooo Dynamic autotuining of SpMV SpMV is important kernel in many applications o perform multiplication of sparse matrix with dense vector o system of equations solving, graph processing, . .. Challenging to compute efficienty • optimization decisions strongly dependent on input structure Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooo«o ooooooooooooooo Dynamic autotuining of SpMV Multiple libraries available • cuSPARSE - closed-source library actively developed by NVIDIA • CUSP - open-source library released by NVIDIA, slower compared to cuSPARSE Our goal • insert dynamic autotuning into CUSP for DIA, ELL [2] , COO and CSR [1] formats • minimize required changes in code using CUSP [2] M. Demek. "Dynamic autotuning of SpMV kernel in CUSP library". MA thesis. Masaryk University, 2023 [1] F. Brablfk. "Dynamic autotuning of SpMV kernel in CUSP library". MA thesis Masaryk University, 20S4 Jin Filipovic et al. Autotuning Introduction ooooooooooooooo Kernel Tuning Toolkit ooooooooooo Evaluation OOOOOOOOOOOCM Related Research ooooooooooooooo Dynamic autotuining of SpMV 2.00 1.75 1.50 - 1.25 y 1.00 Ol 0.75 - 0.50 0.25 0.00 CUSP cuSPARSE £> J> S" *? v« & ' £>' d* # &° £" *y ^ Vs A (S* A*' ^ Figure: SpMV benchmark for CSR format. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO «00000000000000 What do we use KTT for? So we have developed fancy autotuning framework... o which is interesting work anyway, but we can do even more... In GPU-accelerated applications • used during program development (exploration of possible optimizations) 9 manually added into applications to enable dynamic tuning • used in cryo-electron microscopy suite Xmipp Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooooo What do we use KTT for? Some more theoretical (but still with clear practical usage) tasks o searching tuning space 9 tuning budget estimation • interoperability with other tools Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooooo oo«oooooooooooo Searching tuning space Why is searching tuning spaces important and difficult? • important to speed-up autotuning convergence o discrete many-dimensional non-convex spaces are hard to optimize with mathematical optimization • as spaces changes with hardware or input, it is also hard task for machine learning (if ML model relates tuning parameters to runtime, it becomes invalid) We proposed a novel method [3] • decomposing relation between tuning parameters and runtime: ML used for relating tuning parameters to performance counters, expert system used steer optimization method o ML model is independent on HW and input [3] J. Filipovic et al. "Using hardware performance counters to speed up autotuning convergence on GPUs". In: Journal of Parallel and Distributed Computing 160 (2022), pp. 16-35. ISSN: 0743-7315. DOi: https://doi.Org/10.1016/j.jpdc.2021.10.003 □ rS" | = | >0^O Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooooo ooo«ooooooooooo Searching tuning space MgUre! Dependence between a tuning parameter and various properties of the Coulomb 3D kernel running with large gridbox on GeForce GTX 750 and with small gridbox on GeForce GTX 1070. The x-axis shows a tuning parameter changing thread coarsening. The y-axis shows normalized values of selected properties: kernel runtime, L2 cache read transactions, texture cache read transactions and 32-bit floating-point operations. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOO0OOOOOOOOOO Searching tuning space Main idea behind the searcher • relation between tuning parameters and performance counters measuring amount of operations remains stable - can be captured by ML model • relation between tuning parameters and performance counters measuring stress of GPU subsystems depend on GPU and input - can be observed during tuning and used to identify bottlenecks <* an expert system asks ML model which tuning parameters to change to supress bottlenecks 9 mimics what programmers are doing • they profile the code to observe bottlenecks, and use their understanding of the code to introduce changes supressing the bottlenecks <□► < is? ► < ^ ► < ^ ► i -o °s o Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOO0OOOOOOOOO Searching tuning space Training Problem p, GPU x, input i KTT autotuning sample tuning space Model creation captures TP-PC relations tuning data GPU dependent problem dependent Searching Problem p, GPU y, input j trial run data GPU dependent problem dependent KTT search performs trial run bottlenecks analysis analyse PC GPU dependent APC computation reacts to bottlenecks GPU dependent configurations scoring bias search searching step select next configuration Figure: Schematic view of the searcher workflow. The boxes show program components, cylinders show data objects. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOO0OOOOOOOO Searching tuning space 0 50 100 150 200 250 0 200 400 600 800 1000 tuning time (s) tuning iteration Figure: Convergence of the GEMM benchmark using KTT and Kernel Tuner. Left: convergence speed in time. Right: comparison of iterations (empirical tests). Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOOO0OOOOOOO Tuning budget estimation Tuning budget estimation • the problem: as autotuning itself requires computational resources, it is also subject of optimization • therefore, estimating when to stop autotuning is crucial, as it balances overhead of tuning process (number of tuning steps x average time of tuned kernel with re-compilation) • expected improvement of speed of tuned kernel • we shown it is possible to guess from historical data and regression of tuning searching convergence [4] [4] Jaroslav Ol'ha et al. "Estimating resource budgets to ensure autotuning efficiency". In: Available at SSRN 4661862 (2024) I □ I < fiP I 1 -0 0,0 Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooooo Tuning budget estimation 03 M— O 1/1 "O C o a; 03 M— o ~o c o (V c 13 CO 13 Ol CD Tuning iteration Tuning iteration Figure: Example of tuning space searcher convergence. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooooo ooooooooo«ooooo Tuning budget estimation O U c 'c -t—• > JV =3 E u o u Oil c 'c -t—» (1) > 03 13 u Tuning iteration Tuning iteration Figure: Example of tuning cost. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOOOOOO0OOOO Tuning budget estimation E ■4—» c =3 C o 03 ~Q-Q. 03 "03 O c o 03 U "5. Q. 03 "03 .O Stopping point Stopping point Figure: Example of total runtime depending on performed tuning steps. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOOOOOOO0OOO KTT interoperability KTT can be connected with different frameworks • programming heterogeneous nodes is generaly challenging: distribution of work among multiple accelerators and CPU, data distribution • StarPU implements task-based parallelism, it executes DAG of data-dependent tasks on heterogeneous nodes alternative implementation of tasks • StarPU schedules data movement and task execution across the node • connection of KTT and StarPU makes tasks tunable [7] • tuning transparent to user • decoples codes of domain and HPC experts [7] D. Strelak et al. "Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes". In: Computing (2023), pp. 1-29 Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOOO OOOOOOOOOOOOO OOOOOOOOOOOO0OO Future work Still many interesting topics untouched • autotuning for energy efficiency • optimizing optimization spaces • high-level programming of autotuned code o non-trivial applications Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooo«« Bibliography I [1] F. Bráblík. "Dynamic autotuning of SpMV kernel in CUSP library". MA thesis. Masaryk University, 2024. [2] M. Demek. "Dynamic autotuning of SpMV kernel in CUSP library". MA thesis. Masaryk University, 2023. [3] J. Filipovič et al. "Using hardware performance counters to speed up autotuning convergence on GPUs". In: Journal of Parallel and Distributed Computing 160 (2022), pp. 16-35. ISSN: 0743-7315. DOl: https://doi.org/10.1016/j.jpdc.2021.10.003. [4] Jaroslav Ol'ha et al. "Estimating resource budgets to ensure autotuning efficiency" . In: Available at SSRN 4661862 (2024). [5] Filip Petrovič and Jiří Filipovič. "Kernel Tuning Toolkit". In: SoftwareX 22 (2023), p. 101385. Jiří Filipovič et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo ooooooooooo ooooooooooooo ooooooooooooo«« Bibliography II [6] F. Petrovic et al. "A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit". In: Future Generation Computer Systems 108 (2020), pp. 161-177. doi: 10.1016/j.future.2020.02.069. [7] D. Streläk et al. "Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes". In: Computing (2023), pp. 1-29. [8] D. Streläk et al. "A GPU Acceleration of 3D Fourier Reconstruction in Cryo-EM". In: The International Journal of High Performance Computing Applications 0 (0 2019). doi: 10.1177/1094342019832958. Jin Filipovic et al. Autotuning