Introduction Motivation Kernel Tuning Toolkit
Autotuning
Introduction to autotuning, overview of our research
Jiří Filipovič et al. Institute of Computer Science Masaryk University
25. října 2018
Jiří Filipovič et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
Program development workflow
Implementation questions
► which algorithm to use?
► how to implement the algorithm efficiently?
► how to set-up a compiler? Compiler's questions
► how to map variables to registers?
► which unrolling factor to use for a loop?
► which functions should be inlined?
► and many others... Execution
► how many nodes and threads assign to the program?
► should accelerators be used?
► how to mix MPI and OpenMP threads?
A compiler works with heuristics, people usually too.
Jin Filipovic et al.Institute of Computer Science Masaryk Univ< Autotuning
Introduction Motivation Kernel Tuning Toolkit		
		
Tuning of the program		
We can empirically tune those possibilities
► use different algorithm
► change code optimizations
► use different compiler flags
► execute in a different number of threads
► etc.
A tuning allows us to outperform heuristics - we just test what works better.
► however, we have to invest more time into development
► there are vertical dependencies, so we cannot perform tuning steps in isolation
► the optimum usually depends on hardware and input
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit		
Autotuning		
The tuning can be automated
► then we talk about autotuning Autotuning
► in design time, we define the space of tuning parameters, which can be changed
► during autotuning, a combination of tuning parameters is repeatedly selected and empirically evaluated
► a search method is used to traverse the space of tuning parameters efficiently
► performed according to some objective, usually performance, but may be also energy consumption, numerical precision of pareto-optimal combination of several objectives
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit		
		
Taxonomy of Autotuning		
Tuning scope
► what properties of the application are changed by autotuner
► e.g. compiler flags, number of threads, parameters of the source code
Tuning time
► off-line autotuning (performed once, e.g. after SW installation)
► on-line autotuning (performed in runtime) Developer involvement
► transparent, or requiring only minor developer assist (e.g. compiler flags tuning)
► low-level, requiring the developer to identify tunning opportunities (e.g. code parameters tuning)
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Our focus
Introduction Motivation Kernel Tuning Toolkit
We target autotuning of code parameters
► the source code is changed during a tuning process
► the user defines how tuning parameters influence the code
► very powerful (source code may control nearly everything)
► implementation is difficult
► requires recompilation
► runtime checks of correctness/precision
► non-trivial expression of tuning parameters
► we have no implicit assumptions about tuning space
► heterogeneous computing (we are tuning OpenCL or CUDA code)
► offline and online autotuning
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
Motivation Example
Let's solve a simple problem - vectors addition
► we will use CUDA
► we want to optimize the code
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
□
Motivatio
Introduction Motivation Kernel Tuning Toolkit
__global__  void  add(float *  const  a,   float*  b) { int   i  =  blockldx.x*blockDim.x + threadldx.x; b[i]   + =  a[i] ;
}
It should not be difficult to write different variants of the code...
Jin Filipovic et al.Institute of Computer Science Masaryk Univt Autotuning
Optimization
Introduction Motivation Kernel Tuning Toolkit
__global__  void  add (float4*  const  a,  float4* b) { int   i  =  blockldx.x*blockDim.x +  threadldx.x; b[i]   += a[i] ;
}
Kernel has to be executed with n/4 threads.
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Optimization
Introduction Motivation Kernel Tuning Toolkit
__global__  void  add (float2*  const  a,  float2* b) { int   i  =  blockldx.x*blockDim.x + threadldx.x; b[i]   += a[i] ;
}
Kernel has to be executed with n/4 threads.
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Optimization
Introduction Motivation Kernel Tuning Toolkit
__global__  void  add(float*  const  a,   float* b,   const   int  n) { int   i  =  blockldx.x*blockDim.x + threadldx.x; for (; i < n; i += blockDim.x*gridDim.x) b[i]   += a[i] ;
}
Kernel has to be executed with n/m threads, where m can be anything.
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
What to Optimize?
Mixture of:
► thread-block size
► vector variables
► serial work
i.e. 3D space - and this is trivial example...
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
Autotuning
Autotuning tools may explore code parameters automatically
__global__ void
add (VECTYPE*   const  a,  VECTYPE*  b,   const   int  n) {
int   i  =  blockldx.x*blockDim.x + threadldx.x;
#if SERIAL_WORK > 1
for   (;   i  < n;   i  +=  blockDim.x*gridDim.x)
#endif
b[i]   += a[i] ;
}
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
□      is ► •< s ► < :
Introduction Motivation Kernel Tuning Toolkit
Is autotuning worthwile?
OK, so there are multiple variants of a code, but does it make sense to autotune?
► yes, tuning parameters interact, some sort of automatic search make sense
And wouldn't be enough to use a simple script?
► let's consider 3D Fourier Reconstruction1 as an example
► the complex code in CUDA, brings an order of magnitude speedup over parallel CPU implementation
► we have identified 7 tuning parameters forming a tuning space of 430 configurations
► we have tuned it for different GPUs to see performance portability
^D. Strelak, C. O. S. Sorzano, J. M. Carazo, J. Filipovic. A GPU Acceleration of 3D Fourier Reconstruction in Cryo-EM, accepted in International Journal of High Performance Computing Application.
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
3D Fourier Reconstruction Portability
Tabulka : Performance portability of 3D Fourier Reconstruction
	P100	GTX1070	GTX750	GTX680
Tesla P100	100%	95%	44%	96%
GTX 1070	88%	100%	31%	50%
GTX 750	65%	67%	100%	94%
GTX 680	71%	72%	71%	100%
We can gain over 3x speedup when tuning for each GPU architecture.
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
3D Fourier Reconstruction Portability
Tabulka : Sensitivity on input images in 3D Fourier Reconstruction (GTX 1070)
	128x128	91x91	64x64	50x50	32x32
128x128	100%	100%	77%	70%	32%
91x91	100%	100%	76%	68%	33%
64x64	94%	94%	100%	91%	67%
50x50	79%	78%	98%	100%	86%
32x32	65%	67%	80%	92%	100%
We can gain over 3x speedup when tuning for specific input size.
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
Is autotuning worthwile?
It is impractical to re-tune implementation for each combination of HW and input manually.
► even offline tuning is not practical here, as we have too much combinations
► the best solution is to tune application when HW and input size is defined
Jin Filipovic et al.Institute of Computer Science Masaryk Univ< Autotuning
Introduction Motivation Kernel Tuning Toolkit
Kernel Tuning Toolkit
We have developed a Kernel Tuning Toolkit (KTT)
► a framework allowing to tune code parameters for OpenCL and CUDA
► allows both offline and online tuning
► enables cross-kernel optimizations
► mature implementation, documented, with examples
► https://github.com/Fillo7/KTT
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit		
		
Kernel Tuning Toolkit		
Typical workflow similar to CUDA/OpenCL
► initialize the tuner for a specified device
► create input/output of the kernel
► create kernel
► create a tuning space for the kernel
► assign input/output to the kernel
► execute or tune the kernel
KTT creates a layer between an application and OpenCL/CUDA.
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
KTT Sample Code
// Initialize   tuner  and kernel
ktt::Tuner  tuner(platformlndex,   devicelndex); const  ktt::DimensionVector  ndRangeDimensions(inputSize) const  ktt::DimensionVector  workGroupDimensions(128); ktt::Kernelld  foo  =  tuner.addKernelFromFile(kernelFile, ndRangeDimensions,   workGroupDimensions);
// Creation  and  assign  of  kernel arguments ktt: :Argumentld  a =  tuner.addArgumentVector(srcA ,
ktt: :ArgumentAccessType : :Readonly); ktt::Argumentld b =  tuner.addArgumentVector(srcB,
ktt::ArgumentAccessType::WriteOnly); tuner.setKernelArguments(foo,
std: :vector<ktt: :Argumentld >{a,   b});
// Addition  of  tuning variables
tuner.addParameter(foo,   "UNROLL",   {1,   2,   4, 8});
tuner.tuneKernel(foo); tuner.printResuit(foo,   "foo.csv",   ktt::PrintFormat::CSV);
"foo" ,
Jiří Filipovič et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
Unique features of KT1
Cross-kernel optimizations
► the user can add specific code for kernels execution
► the code may query tuning parameters
► the code may call multiple kernels
► allows tuning code parameters with wider influence, as tuned kernels do not need to be functionally equivalent
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit		
Reduction		
new kernel
atomic
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
Unique features of KT1
Online autotuning
► KTT can be called to execute a kernel and retrieve results or try different combination of tuning parameters before the execution
► transparent for the application
► errors need to be handled explicitly
► tuning can be queried in any time
Jin Filipovic et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
Online Tuning Sample
// Main application loop while(application_run) {
•   • •
if   (tuningModeOn)
tuner.tuneKernelByStep(foo, {b}); else {
ktt: :Computat ionResuit best = tuner->getBestComputationResult( tuner.runKernel(compositionld,   best.getConfiguration(),   {b});
}
•   • •
Jiří Filipovič et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
3D Fourier Reconstruction
Online tuning must mimic rich functionality of OpenCL/CUDA API.
Process #0
distribute tasks
Process #1
(batches of samples)
Thread Manager
distribute samples
CPU			CPU	
Thread			Thread	
#1			#2	
-—1	0>		r\i	01
E	Q. £		E tu	Q. £
a	ro		QJ	
	t		tží	t
				
GPU			GPU	
Kernel			Kernel	
#1			#2	
CPU Thread #n
GPU Kernel #n
update 3D regular grid
Process #m
reduce partial grids I
Process #0
				Thread					
				Manager			<■---		
						distribute samples			
	ř				ř				
CPU			CPU					CPU	
Thread			Thread			•	• •	Thread	
#1				#2				#n	
i—i	a)				OJ			r-l	QJ
E ro	Q. E			E 03	o. £			E re	Q. E
CU				C1J				OJ	
	t			Í/5	t				l-
									
GPU			GPU					GPU	
Kernel			Kernel			•	• •	Kernel	
#1				#2				#n	
update 3D regular grid
Obrázek : Architecture of 3D Fourier Reconstruction
Jiří Filipovič et al.Institute of Computer Science Masaryk Univc Autotuning
Introduction Motivation Kernel Tuning Toolkit
x
b.
CD O
c
E
CD Q.
150       200 250 runtime (seconds)
300
350
400
Obrázek : Performance of online tuned 3D Fourier reconstruction
=
Jiří Filipovič et al.Institute of Computer Science Masaryk Univ< Autotuning