Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo oooooooooooo Autotuning Introduction to autotuning, overview of our research Jiří Filipovič et al. Institute of Computer Science Masaryk University 2023 Jiří Filipovič et al. Autotuning □ S Introduction Kernel Tuning Toolkit Evaluation Related Research •oooooooooooooo oooooooooo oooooooooo oooooooooooo Program development workflow Implementation questions • which algorithm to use? • how to implement the algorithm efficiently? • how to set-up a compiler? Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research O0OOOOOOOOOOOOO oooooooooo oooooooooo oooooooooooo Program development workflow Compiler's questions 9 how to map variables to registers? • which unrolling factor to use for a loop? o which functions should be inlined? <* and many others... Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OO0OOOOOOOOOOOO oooooooooo oooooooooo oooooooooooo Program development workflow Execution • how many nodes and threads assign to the program? o should accelerators be used? o how to mix MPI and OpenMP threads? Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OO0OOOOOOOOOOOO oooooooooo oooooooooo oooooooooooo Program development workflow Execution • how many nodes and threads assign to the program? o should accelerators be used? o how to mix MPI and OpenMP threads? A compiler works with heuristics, people usually too. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOO0OOOOOOOOOOO oooooooooo oooooooooo oooooooooooo Tuning of the program We can empirically tune those possibilities • use different algorithm • change code optimizations • use different compiler flags 9 execute in a different number of threads etc. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research oooo«oooooooooo oooooooooo oooooooooo oooooooooooo Tuning of the program A tuning allows us to outperform heuristics - we just test what works better. • however, we have to invest more time into development • there are vertical dependencies, so we cannot perform tuning steps in isolation 9 the optimum usually depends on hardware and input Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooo«ooooooooo oooooooooo oooooooooo oooooooooooo Autotuning The tuning can be automated • then we talk about autotuning Autotuning 9 in design time, we define the space of tuning parameters, which can be changed • each tuning parameter defines some property of the tuned application • a search method is used to traverse the space of tuning parameters efficiently • performed according to some objective, usually performance Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOO0OOOOOOOO oooooooooo oooooooooo oooooooooooo Taxonomy of Autotuning Tuning scope • what properties of the application are changed by autotuner o e.g. compiler flags, number of threads, source code optimizations parameters Tuning time • offline autotuning (performed once, e.g., after SW installation) 9 dynamic autotuning (performed in runtime) Developer involvement <* transparent, or requiring only minor developer assist (e.g. compiler flags tuning) • low-level, requiring an expert programmer to identify tunning opportunities (e.g. code optimizations parameters tuning) Jin Filipovic et al. Autotuning Introduction OOOOOOO^OOOOOOO Kernel Tuning Toolkit oooooooooo Evaluation oooooooooo Related Research oooooooooooo Our focus We target autotuning of code optimization parameters • the source code is changed during a tuning process • the user defines how tuning parameters influence the code • very powerful (source code may control nearly everything) • implementation is difficult • requires recompilation • runtime checks of correctness/precision • non-trivial expression of tuning parameters we have no implicit assumptions about tuning space • heterogeneous computing (we are tuning OpenCL or CUDA code) • offline and dynamic autotuning Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOO0OOOOOO oooooooooo oooooooooo oooooooooooo Motivation Example Let's solve a simple problem - vectors addition • we will use CUDA • we want to optimize the code Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOO0OOOOO oooooooooo oooooooooo oooooooooooo Motivation Example __global__ void add(float* const a, float* b) { int i = blockldx.x*blockDim.x + threadldx.x; b[i] += a[i] ; } It should not be difficult to write different variants of the code... Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOO0OOOO oooooooooo oooooooooo oooooooooooo Optimization __global__ void add (f loat4* const a, float4* b) { int i = blockldx.x*blockDim.x + threadldx.x; b[i] += a[i] ; } Kernel has to be executed with n/4 threads. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooo«ooo oooooooooo oooooooooo oooooooooooo Optimization __global__ void add (float 2* const a, float2* b) { int i = blockldx.x*blockDim.x + threadldx.x; b[i] += a[i] ; } Kernel has to be executed with n/2 threads. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research oooooooooooo«oo oooooooooo oooooooooo oooooooooooo Optimization __global__ void add(float* const a, float* b, const int n) { int i = blockldx.x*blockDim.x + threadldx.x; for (; i < n; i += blockDim.x*gridDim.x) b[i] += a[i] ; } Kernel has to be executed with n/m threads, where m can be anything. Jin Filipovic et al. Autotuning □ S1 Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOO0O oooooooooo oooooooooo oooooooooooo What to Optimize? I Mixture of: • thread-block size • vector variables • serial work i.e. 3D space - and this is trivial example... Jin Filipovic et al. Autotuning □ S Introduction Kernel Tuning Toolkit Evaluation Related Research oooooooooooooo* oooooooooo oooooooooo oooooooooooo Autotuning Autotuning tools may explore code parameters automatically __global__ void add (VECTYPE* const a, VECTYPE* b, const int n) { int i = blockldx.x*blockDim.x + threadldx.x; #if SERIAL_WORK > 1 for (; i < n; i += blockDim.x*gridDim.x) #endif b[i] += a[i] ; } The code executing kernel add has to configure parallelism according to values of VECTYPE and SERIAL.WORK tuning parameters. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO «000000000 oooooooooo oooooooooooo Kernel Tuning Toolkit We have developed a Kernel Tuning Toolkit (KTT) • a framework allowing to tune code parameters for OpenCL and CUDA • allows both offline and dynamic tuning • enables cross-kernel optimizations • mature implementation, documented, with examples 9 https://github.com/HiPerCoRe/KTT Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO O0OOOOOOOO oooooooooo oooooooooooo Kernel Tuning Toolkit Typical workflow similar to CUDA/OpenCL o initialize the tuner for a specified device • create input/output of the kernel • create kernel 9 create a tuning space for the kernel • assign input/output to the kernel • execute or tune the kernel KTT creates a layer between an application and OpenCL/CUDA. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo oooooooooooo KTT Sample Code // Initialize tuner and kernel ktt::Tuner tuner(platformlndex, devicelndex); const ktt::DimensionVector ndRangeDimensions(inputSize) const ktt::DimensionVector workGroupDimensions(128); ktt::Kernelld foo = tuner.addKernelFromFile(kernelFile, ndRangeDimensions, workGroupDimensions); // Creation and assign of kernel arguments ktt: :Argumentld a = tuner.addArgumentVector(srcA , ktt: :ArgumentAccessType : :Readonly); ktt::Argumentld b = tuner.addArgumentVector(srcB, ktt::ArgumentAccessType::WriteOnly); tuner.setKernelArguments(foo, std: :vector{a, b}); // Addition of tuning variables tuner.addParameter(foo, "UNROLL", {1, 2, 4, 8}); tuner.tuneKernel(foo); tuner.printResuit(foo, "foo.csv", ktt::PrintFormat::CSV); "foo" , Jiří Filipovič et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo oooooooooooo Kernel Tuning Toolkit In practise, we usually need more functionality 9 tuning parameters can affect parallelism configuration (e.g. block and grid size in CUDA) o by pre-defined functions (e.g. multiply specified block/grid dimmension) • by lambda function provided by programmer • some combinations of tuning parameters can be discarded a priori 9 lambda functions constraining tuning space • KTT can check, if tuned kernel runs successfully • automatic check of successful execution • user can provide reference kernel, or reference class, and comparing function, KTT compares results automatically Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooo«ooooo oooooooooo oooooooooooo Advanced features of KTT Cross-kernel optimizations • the user can add specific code for kernels execution into launchComputation method • the code may query tuning parameters o the code may call multiple kernels 9 allows tuning code parameters with wider influence, as tuned kernels do not need to be functionally equivalent Jin Filipovic et al. Autotuning Introduction ooooooooooooooo Kernel Tuning Toolkit OOOOO0OOOO Evaluation oooooooooo Related Research oooooooooooo Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOO0OOO oooooooooo oooooooooooo Advanced features of KTT Dynamic autotuning 9 dynamic tuning performs autotuning during application runtime • KTT can execute the best kernel known so far to perform kernel's task • or try different combination of tuning parameters before the execution • tuning is transparent for the application • tuning can be queried in any time Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOO0OO oooooooooo oooooooooooo Dynamic Tuning Sample // Main application loop while(application_run) { • • • if (tuningRequired) tuner.tuneKernelByStep(foo, {b}); else { ktt: : Computat ionResuit best = tuner->getBestComputationResult (foo) ; tuner.runKernel(compositionld, best.getConfiguration(), {b}); } • • • } Jiří Filipovič et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOO0O oooooooooo oooooooooooo Dynamic tuning Dynamic autotuning is challenging <* when the kernel is executed, there must be no significant performance drop • automatic memory management has to move only necessary data 9 KTT has to support asynchronous execution of • memory copy, host and device code execution • simultaneous execution of multiple kernels Parallelism in KTT • intra-manipulator: parallelism inside launchComputation method • global parallelism: asynchronous execution of multiple launchComputation instances During autotuning, global parallelism is disabled. □ Si - = 1 -O O Jin Filipovic et al. Autotuning Introduction ooooooooooooooo Kernel Tuning Toolkit ooooooooo« KTT Architecture Evaluation oooooooooo Related Research oooooooooooo Application Tuning parameters Kernels Input buffer(s) Output buffer(s) KTT Tuning space > _ Main f Searcher * f Code generator Code variant executor Optional validation Oi Kernels history Jin Filipovic et al. Autotuning Introduction ooooooooooooooo Kernel Tuning Toolkit oooooooooo Evaluation •ooooooooo Related Research oooooooooooo Benchmark set Benchmark dimensions configurations BiCG 11 5,122 Convolution 10 5,248 Coulomb 3D 8 1,260 GEMM 15 241,600 GEMM batched 11 424 Hotspot 6 480 Transpose 9 10,752 N-body 8 9,408 Reduction 5 175 Fourier 6 360 Table: A list of the benchmarks and the size and dimensionality (i.e., the number of tuning parameters) of their tuning spaces. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo o«oooooooo oooooooooooo Test bed setup Device Architecture SP perf. BW 2x Xeon E5-2650 Sandy Bridge 512 102 Xeon Phi 5110P Knights Corner 2,022 320 Tesla K20 Kepler 3,524 208 GeForce GTX 750 Maxwell 1,044 80 GeForce GTX 1070 Pascal 5,783 256 Radeon RX Vega 56 GCN 5 8,286 410 GeForce RTX 2080Ti Turing 11,750 616 Table: Devices used in our benchmarks. Arithmetic performance (SP perf.) is measured in single-precision GFIops, memory bandwidth (BW) is measured in GB/s. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OO0OOOOOOO oooooooooooo Performance Benchmark 2080Ti 1070 750 K20 Vega56 E5-2650 5110P BiCG 88.3% 84.7% 81.7% 50.4% 75.6% 46.0% 6.45% Coulomb 3D 91.8% 91.4% 84.3% 43.2% 65.3% 74.2% 22.2% GEMM 79.8% 80.6% 91.1% 51.3% 96.3% 37.5% 19.7% GEMM batched 86.8% 81.4% 90.0% 49.6% 86.0% 27.7% 20.9% Transpose 87.1% 80.2% 86.3% 64.2% 86.1% 62.5% 10.0% N-body 89.7% 86.6% 87.7% 40.6% 82.2% 77.7% 29.9% Reduction 68.7% 87.5% 89.4% 64.1% 71.6% 33.9% 10.1% Hotspot 1.35X 1.94X 2.06X 1.4x 2.88X 1.2x 12.8X Table: Performance of benchmarks autotuned for various hardware devices. The performance relative to the theoretical peak of devices. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo oooooooooooo Performance portability GPU- >GPU Benchmark avgztstdev worst failed BiCG 89.0%±12.3% 57% 1 Convolution 79.4%±14.9% 55% 3 Coulomb 3D 95.8%±6.5% 67% 0 GEMM 83.6%±16.4% 31% 0 GEMM batched 85.4%±17% 37% 0 Hotspot 80.3%±17.5% 46% 3 Transpose 85.0%±21.9% 8% 3 N-body 78.8%±24.2% 2% 3 Reduction 88.4%±24% 12% 3 Fourier 74.5%±30% 31% 0 Table: Relative performance of benchmarks ported across GPU architectures without re-tuning. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOOOO Dynamic autotuining of Batched GEMM 250 200 150 100 300 Figure: Batched GEMM on GeForce GTX 1070. S ST) <\Qs Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo oooooooooooo Dynamic autotuining of Batched GEMM Figure: Batched GEMM on Tesla K20. Jiří Filipovič et al. Autotuning I'll! 6 p 9 ft) "0 iterate till there are samples to process □ 4 I CD n CL CD n C CL o n o (/) n stream 1 'FT samples ^ stream 2 FT samples Thr n ro QJ c Q. stream n FT samples ft Thr. n ro qj c CL • • • ft Thr n ro qj c CL M qj cd ic! qj ro a. T3 —* o n ro ui tn ft O 7s „ ft ro CI stream 1 ft 2i KJ 3 ro "FT samples stream 1 1 CPU 1—1 ead O FT samples —1 ft zr —s n ro qj c CL ft o qj ro LQ QJ ro o_ iterate till there are samples to process III o 3 Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOO0OO oooooooooooo 3D Fourier Reconstruction 2080Ti 1070 750 680 2080Ti 100% 99% 31% 49% 1070 99% 100% 31% 50% 750 43% 67% 100% 94% 680 60% 72% 71% 100% Table: Performance portability of 3D Fourier reconstruction with 128 x 128 samples. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOO0O oooooooooooo 3D Fourier Reconstruction 128x128 91x91 64x64 50x50 32x32 128x128 100% 100% 77% 70% 32% 91x91 100% 100% 76% 68% 33% 64x64 94% 94% 100% 91% 67% 50x50 79% 78% 98% 100% 86% 32x32 65% 67% 80% 92% 100% Table: Performance portability on GeForce GTX1070 for different samples. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo ooooooooo* oooooooooooo 3D Fourier Reconstruction best runtime tuning 50 tuning full 2080Ti lm40s 88% ± 3% 54% 1070 5m49s 96% ± 2% 79% 750 16m59s 92% ± 4% 72% 680 15ml2s 94% ± 2% 75% Table: The relative performance of dynamically-tuned 3D Fourier reconstruction. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO «00000000000 What do we use KTT for? So we have developed fancy autotuning framework... • which is interesting work anyway, but we can use it also for something more... In GPU-accelerated applications • used during program development (exploration of possible optimizations) <* manually added into applications to enable dynamic tuning • used in cryo-electron microscopy suite Xmipp Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO O0OOOOOOOOOO What do we use KTT for? Some more theoretical (but still with clear practical usage) tasks o searching tuning space 9 tuning budget estimation • interoperability with other tools Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo oo«ooooooooo Searching tuning space Why is searching tuning spaces important and difficult? • important to speed-up autotuning convergence o discrete many-dimensional non-convex spaces are hard to optimize with mathematical optimization o as spaces changes with hardware or input, it is also hard task for machine learning (if ML model relates tuning parameters to runtime, it becomes invalid) Our method • decomposing relation between tuning parameters and runtime: ML used for relating tuning parameters to performance counters, expert system used steer optimization method o ML model is independent on HW and input Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo ooo«oooooooo Searching tuning space MgUre! Dependence between a tuning parameter and various properties of the Coulomb 3D kernel running with large gridbox on GeForce GTX 750 and with small gridbox on GeForce GTX 1070. The x-axis shows a tuning parameter changing thread coarsening. The y-axis shows normalized values of selected properties: kernel runtime, L2 cache read transactions, texture cache read transactions and 32-bit floating-point operations. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOO0OOOOOOO Searching tuning space Main idea behind the searcher • relation between tuning parameters and performance counters measuring amount of operations remains stable - can be captured by ML model • relation between tuning parameters and performance counters measuring stress of GPU subsystems depend on GPU and input - can be observed during tuning and used to identify bottlenecks <* an expert system asks ML model which tuning parameters to change to supress bottlenecks 9 mimics what programmers are doing • they profile the code to observe bottlenecks, and use their understanding of the code to introduce changes supressing the bottlenecks Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo oooooooooooo Searching tuning space Training Problem p, GPU x, input i KTT autotuning sample tuning space Model creation captures TP-PC relations tuning data GPU dependent problem dependent Searching Problem p, GPU y, input j trial run data GPU dependent problem dependent KTT search performs trial run bottlenecks analysis analyse PC GPU dependent APC computation reacts to bottlenecks GPU dependent configurations scoring bias search searching step select next configuration Figure: Schematic view of the searcher workflow. The boxes show program components, cylinders show data objects. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOO0OOOOO Searching tuning space 0 50 100 150 200 250 0 200 400 600 800 1000 tuning time (s) tuning iteration Figure: Convergence of the GEMM benchmark using KTT and Kernel Tuner. Left: convergence speed in time. Right: comparison of iterations (empirical tests). Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOO0OOOO Tuning budget estimation Tuning budget estimation • the problem: as autotuning itself requires computational resources, it is also subject of optimization • therefore, estimating when to stop autotuning is crucial, as it balances overhead of tuning process (number of tuning steps x average time of tuned kernel with re-compilation) • expected improvement of speed of tuned kernel • we believe it is possible to guess from historical data and regression of tuning searching convergence Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOO0OOO Tuning budget estimation CD M— o in TD C 13 o M— CD E c 13 CD CO Tuning iteration Figure: Example of tuning space searer convergence. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO ooooooooo«oo Tuning budget estimation Tuning iteration Figure: Example of tuning cost. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research OOOOOOOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO0O Tuning budget estimation CD E -t—* c 13 i_ C O U ~Q_ ns "03 H—' ,o Stopping point Figure: Example of total runtime depending on performed tuning steps. Jin Filipovic et al. Autotuning Introduction Kernel Tuning Toolkit Evaluation Related Research ooooooooooooooo oooooooooo oooooooooo ooooooooooo* What do we use KTT for? Interoperability • programming heterogeneous nodes is generaly challenging: distribution of work among multiple accelerators and CPU, data distribution • we work on connection of KTT with StarPU • StarPU implements task-based parallelism, it executes DAG of data-dependent tasks on heterogeneous nodes 9 alternative implementation of tasks • StarPU schedules data movement and task execution across the node • KTT makes tasks tunable Jin Filipovic et al. Autotuning