Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks oooooooooooooooooo ooo ooooooo ooooo Code Optimizations J-W I— ■ I ■ ■ v in Fihpovic Fall 2022 Jiří Filipovič Code Optimizations Reduction •ooooooooooooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Vector Reduction Let v be the vector of size n. We want to compute x = vi- Jiří Filipovič Code Optimizations Reduction •ooooooooooooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Vector Reduction Let v be the vector of size n. We want to compute x = vi-C code (not very reasonable for floats) int x = 0; for (int i = 0; i < n; i++) x += v [ i ] ; There is flow dependency across iterations. Jiří Filipovič Code Optimizations Reduction •ooooooooooooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Vector Reduction Let v be the vector of size n. We want to compute x = vi-C code (not very reasonable for floats) int x = 0; for (int i = 0; i < n; i++) x += v [ i ] ; There is flow dependency across iterations. • we cannot compute completely parallel • addition is (at least in theory :-)) associative • so, we do not need to add numbers in sequential order Jiří Filipovič Code Optimizations Reduction O0OOOOOOOOOOOOOOOO Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Parallel Algorithm The sequential algorithm performs seven steps: (((((("i + v2) + V3) + "4) + "5) + v&) + ^) + v8 Jiří Filipovič Code Optimizations Reduction 0*0000000000000000 Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Parallel Algorithm The sequential algorithm performs seven steps: (((((K + V2) + V3) + V4) + V5) + V6) + V7) + V8 Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((v5 + v6) + (vj + v8)) Jiří Filipovič Code Optimizations Reduction 0*0000000000000000 Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Parallel Algorithm The sequential algorithm performs seven steps: ((((((vi + V2) + v3) + v4) + vb) + v6) + Vj) + v8 Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((vb + v6) + (w + vs)) We can work in parallel now: • four additions in the first step o two additions in the second step • one addition in the third step In summary, we perform n — 1 additions in log2 n parallel steps! Jiří Filipovič Code Optimizations Reduction OO0OOOOOOOOOOOOOOO Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Parallel Algorithm We have found the parallel algorithm • the same number of additions as the serial algorithm • in logarithmic time (if we have enough cores) We add results of previous additions • flow-dependency across threads • we need global barrier Jiří Filipovič Code Optimizations Reduction OOO0OOOOOOOOOOOOOO Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Naive Approach The simplest scheme of the algorithm: • for even /, i < n perform v[i] += v[i+l] • repeat for n /= 2 untill n > 1 The performance is not ideal 9 2n numbers loaded from global memory • n numbers stored to global memory o log2 n kernel invocations We have three memory accesses to one arithmetics operation and considerable kernel invocation overhead. Jiří Filipovič Code Optimizations Reduction OOOO0OOOOOOOOOOOOO Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Exploiting Data Locality We can add more than pairs during single kernel call. • each block bx loads m numbers into shared memory o it reduces the input (in shared memory in log2 m steps) <* it stores only one number containing YlT^mX™-1 vi Reduces both memory transfers and number of kernel invocations • number of loads: n + % + % + .. + ^ = (n - 1)^ • approximately n+ ^ numbers read, ^ written • \ogm n kernel invocations Jiří Filipovič Code Optimizations Reduction 00000*000000000000 mplementation Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo __global__ void reduce 1(int *v){ extern __shared__ int sv [ ]; unsigned int tid = threadldx.x; unsigned int i = blockldx.x*blockDim.x + threadldx.x; sv[tid] = v[i]; __syncthreads(); for(unsigned int s=l; s < blockDim.x; s *= 2) { if (tid % (2*s) = 0) } sv[tid] += sv[tid + s]; __syncthreads () ; if (tid 0) } v[biockldx.x] = sv[0]; J in Fihpovic Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks OOOOOO0OOOOOOOOOOO ooo ooooooo ooooo Performance Beware modulo operation. High degree of divergence • during the first iteration, only half of threads is working • during the second iteration, only quarter of threads is working o etc. Performance on GTX 280: 3.77 GB/s (0.94 MEIem/s). Jiří Filipovič Code Optimizations Reduction OOOOOOO0OOOOOOOOOO Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Implementation 2 We will modify indexation for (unsigned int s = 1; s < blockDim.x; s *= 2) { int index = 2 * s * tid; if (index < blockDim.x) sv[index] += sv[index + s]; __syncthreads(); } Performance: 8.33 GB/s (2.08 MEIem/s). The code is free of modulo and divergence, but generates shared memory bank conflicts. Jiří Filipovič Code Optimizations Reduction OOOOOOOO0OOOOOOOOO Cross-kernel Optimizations General Advices Searching for Bottlenecks ooo ooooooo ooooo Implementation 2 So we can try another indexing... for (unsigned int s = blockDim . x / 2; s > 0; s »= 1) { if (tid < s) s v[tid] += s v [ t i d + s] ; __syncthreads(); } No divergence and no conflicts. Performance 16.34 GB/s (4.08 MEIem/s). Half of threads do not compute... Jiří Filipovič Code Optimizations Reduction ooooooooo«oooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Implementation 4 We can add numbers during loading them from global memory. unsigned int i = blockldx.x*(blockDim.x*2) + threadldx.x; sv[tid] = v[i] + v[i+blockDim.x]; Performance 27.16 GB/s (6.79 MEIem/s). There is no problem with data access, but the performance is still low - we will focus to instructions. Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks oooooooooo«ooooooo ooo ooooooo ooooo Implementation 5 The number of active threads decreases during computation in shared memory. 9 in the last six iterations, only the last warp is active • the warp is synchronized implicitly on GPUs with c.c. < 7.0, so we do not need syncthreadsQ 9 we need volatile variable in this case • condition if(tid < s) does not spare any computation So we can unroll the last warp... Jiří Filipovič Code Optimizations Reduction OOOOOOOOOOO0OOOOOO Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Implementation 5 float mySum = 0; for (unsigned int s if (tid < s) sv [ t id ] = mySum __syncthreads(); } if (tid < 32){ blockDim . x/2; s > 32; s »= 1){ mySum + sv[tid + s]; volat ile float * s = sv > s [ t id ] = mySum = mySum + s tid + 32]; _ _ s [ t id ] = mySum = mySum + s tid + 16]; _ _ s [ t id ] = mySum = mySum + s tid + 8]; __s s [ t id ] = mySum = mySum + s tid + 4]; __s s [ t id ] = mySum = mySum + s tid + 2]; __s s [ t id ] = mySum = mySum + s tid + i]; } We save time in all warps (the last warp is simpler, others exits earlier from the for loop). Performance: 37.68 GB/s (9.42 MEIem/s) J in Fihpovic Code Optimizations Reduction oooooooooooo«ooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Implementation 5 For c.c. 3.0 or greater, we can use warp shuffle: if (tid < 32){ mySum += sdata[tid + 32]; __syncthreads () ; for (int offset = warpSize/2; offset > 0; offset /= 2) mySum += __shfl_down_sync(mySum, offset); } This is safe for all GPUs. Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks 0000000000000*0000 ooo ooooooo ooooo Implementation 6 Can we unroll the for loop? If we know the number of iterations, we can unroll it • the number of iterations depends on the block size Can we implement it generically? • algorithm uses blocks of size 2n • the block size is upper-bound • if we know the block size during compilation, we can use a template template __global__ void reduce6(int *v) Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks OOOOOOOOOOOOOO0OOO ooo ooooooo ooooo Implementation 6 Conditions using blockSize are evaluated during compilation: if (blockSize >= 512){ if (tid < 256) sv[tid] += sv[tid + 256]; __syncthreads(); } if (blockSize >= 256){ if (tid < 128) sv[tid] += sv[tid + 128]; __syncthreads(); } if (blockSize >= 128){ if (tid < 64) sv[tid] += sv[tid + 64]; __syncthreads(); } Performance: 50.64 GB/s (12.66 MEIem/s). <□► < rS1 ► < ^ ► < ^ ► 1 ^0,0 J in Fihpovic Code Optimizations Reduction OOOOOOOOOOOOOOO0OO Cross-kernel Optimizations General Advices Searching for Bottlenecks ooo ooooooo ooooo Implementation " J Can we implement faster algorithm? Let's reconsider the complexity: 9 \ogn parallel steps 9 n — 1 additions o time complexity for p threads running in parallel (using p processors): 0(^- + \ogn) Cost of parallel computation • defined as number of processors multiplied by time complexity o if we assign one thread to one data element, we get p — n • and the cost is 0(n • log n) o which is not efficient J in Fihpovic Code Optimizations Reduction OOOOOOOOOOOOOOOO0O Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Implementation 7 Decreasing the cost • we use 0(r°—) threads v log n / • each thread performs 0(\ogn) sequential steps a after that, it performs (D(\ogn) parallel steps • time complexity is the same • the cost is O(n) What it means in practice? • we reduce overhead of the computation (e.g. integer arithmetics) • advantage if we have much more threads that is needed to saturate GPU J in Fihpovic Code Optimizations Reduction ooooooooooooooooo< Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Implementation 7 We modify loading into shared memory unsigned int gridSize = blockSize*2*gridDim.x ; sv[tid] = 0; while(i < n){ s v[t i d] += v[i] + v[i+blockSize]; i += gridSize; } __syncthreads () ; Performance: 77.21 GB/s (19.3 MEIem/s). Jiří Filipovič Code Optimizations Reduction ooooooooooooooooo< Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooooo Implementation 7 We modify loading into shared memory unsigned int gridSize = blockSize*2*gridDim.x ; sv[tid] = 0; while(i < n){ s v[t i d] += v[i] + v[i+blockSize]; i += gridSize; } __syncthreads () ; Performance: 77.21 GB/s (19.3 MEIem/s). You can find those implementations in CUDA SDK. Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations •oo General Advices ooooooo Searching for Bottlenecks ooooo Intra-kernel Optimizations The compiler optimizes each kernel separately, so it may miss some optimization opportunities. • kernel fusion - gluing code from several kernels into one kernel • kernel fission - splitting a kernel into several smaller kernels Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations General Advices ooooooo Searching for Bottlenecks ooooo Kernel fusion Performance impact of kernel fusion 9 reduce kernel execution overhead • may add more parallelism • allow more scalar code optimizations: common subexpression elimination, loop fusion, condition fusion • reduce global memory transfers if kernels are flow-dependent or input-dependent Correctness • no flow dependency between thread blocks • shared memory and registers locality has to be maintained Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations oom General Advices ooooooo Searching for Bottlenecks ooooo Kernel fission Kernel fission reduces resources consumption • increases occupancy • may allow to use algorithms which would be inefficient when run in one kernel (e.g. if part of the algorithm uses different amount of parallelism or different amount of resources) • more complicated and divergent codes may be separated (e.g. handling array boundaries) Correctness • much easier, we just need to transfer data between new kernels Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices •oooooo Searching for Bottlenecks ooooo Problem Choice Before we start with code acceleration, we should consider carefully, if it is meaningful. The accelerated code should be o critical for application performance (profile... and profile on real data) • large enough (usually not ideal for relatively simple but latency critical application) • parallelizable (problematic e.g. in simulation of a small system evolving for a long time) o sufficient number of flops to memory transfers (consider slow PCI-E) Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices O0OOOOO Searching for Bottlenecks ooooo Problem Choice Do we optimize running time or power consumption? • accelerators are usually faster, but also have higher power consumption • how to deal with hybrid systems (e.g. CPU, GPU and Xeon Phi) 9 influences decision what to buy as well as what to use (which resources let in power-saving mode) Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices oo«oooo Searching for Bottlenecks ooooo Algorithm Design Parallelization • we need to parallelize computational problem o we should be aware about target architecture even in this stage (consider, e.g., graph algorithms) It is difficult to accelerate codes on GPU: • if threads within the warp access rather random addresses in the memory • if threads within the warp diverges (by nature of the algorithm) • if the parallelism is insufficient in certain parts of computation Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices OOO0OOO Searching for Bottlenecks ooooo How to Write Bug-Free Code Test if API and kernel calls are successful • otherwise, errors can appear later... The memory allocation on GPU occurs seldom • if your modify your code in a way your kernel does not write any result, you got a result from its previous run • clear output arrays for debugging purposes Be aware of out-of-bounds shared memory access • kernel usually runs successfully, but one block interferes with another Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices OOOO0OO Searching for Bottlenecks ooooo Optimization Start with the most important optimizations and continue with less important (so the effect of less important optimizations is not hidden). In general, this order should work well: • PCI-E transfers reduction/overlay • global memory access (bandwidth, latency) • access to other types of memory • divergence • parallelism configuration (block size, amount of serial work per thread) • instruction optimization It is good idea to write your code configurable • block size, number of serial iterations per thread, loop unrolling factor, used algorithm .. . • use macros or templates to ease configutation of the optimizations J in Fihpovic Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks oooooooooooooooooo ooo ooooo«o ooooo Interpretation of Algorithm Performance Some optimizations may be hidden • e.g. optimizing instruction cannot help when code is bound by wrong global memory access • can be reduced by applying more important optimizations earlier a use the profiler The optimization space is not continuous • due to restricted amount of GPU resources • e.g. improving efficiency of scalar code by using one more register may decrease the performance by restricting GPU occupancy Performance is data-dependent o data size: partition camping, underutilized GPU • data content: sparse data with varying structure 1 J " ^)C\(V J in Fihpovic Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices oooooo* Searching for Bottlenecks ooooo What is real speedup over CPU? Comparison of a theoretical peak is basic metric • however, the speedup can be lower • insufficient parallelism • inappropriate data structures, random access • PCI-E bottleneck (especially multi-GPU algorithms) • however, the speedup can be also higher • frequent usage of SFUs • complicated vectorization on CPU • insufficient scaling on SMP (cache interferences, NUMA) • different scaling of CPU and GPU with growing problem size Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks •oooo Searching for Bottlenecks The amount of arithmetic operations and memory transfers tells us what is expected to be a limit for algorithm • sometimes bottleneck is not clear (overhead instructions, irregular memory access) • code profiling - suitable to identify issues with instructions throughput or bad memory access pattern, more difficult to identify source of latency problems o code modifications - more precise, but more difficult and not usable in all cases Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks otooo Profiling How close is the code to the hardware limits? • profiler shows the overall utilization of particular GPU subsystems, such as cache, global memory, FP instructions etc. Issues identification profiler detects some issues, such as shared memory bank conflicts or code divergence We can inspect a code in details • time spent on particular instructions/C for CUDA lines of code • we need to compile the code with flag -lineinf o Jiří Filipovič Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks oo»oo Code Modifications Global memory performance • we comment-out the computation 9 but we need to somehow use loaded data (to disallow compiler to exclude loading) • we can check with profiler that the same amount of data are transfered Instructions performance • we comment-out data movement o but the resulting data is needed to be stored (to disallow compiler to exclude computation) • but we do not want to store data. .. we can move the code storing data into condition which is evaluated as false during computation (but not during compilation) • be aware of execution overhead in the case of fast kernels t J in Fihpovic Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks ooo«o Code Modifications Be aware of occupancy changes • code modifications can release some resources • we can restrict occupancy by allocating some dummy shared memory array Interpretation of measured times o original kernel execution time is close to sum of computation and memory kernel time - the latency is an issue • computation or memory kernel time dominates and is closed to original kernel time - the performance is bounded by computation or memory • computation and memory kernel times are similar to original kernel time - we need to optimize both J in Fihpovic Code Optimizations Reduction oooooooooooooooooo Cross-kernel Optimizations ooo General Advices ooooooo Searching for Bottlenecks oooo* Code Modifications Approximation of an optimization effect • when we already know some performance issue 9 when we want to know the effect of optimization before we actually implement it • we can modify the code without preserving original functionality, but preserving amount of work and removing performance issue • cannot be done in all cases • may show us if we really address the performance issue • see matrix transposition example Jiří Filipovič Code Optimizations