Flood maps dRMSD Acceleration ooooooooo ooooooooooo Flood Maps, dRMSD Computation J-W i- ■ i " "V in Fihpovic Fall 2022 Jiří Filipovič Flood Maps, dRMSD Computation Flood maps dRMSD Acceleration •oooooooo ooooooooooo Introduction Artificial problem, was solved as a project PV197 in 2010 • various formulations of algorithm possible with very different performance The input of the problem is height map, source of water and elevation of the source. The output is the map of flooded area. • the water source has infinite amount of the water, the ground is waterproof • so, we need to found a continuous region containing the water source, where the ground is under the water source level • this is analogy to searching for a connected component in special form of a graph Jiří Filipovič Flood Maps, dRMSD Computation Flood maps dRMSD Acceleration O0OOOOOOO ooooooooooo Naive Algorithm Probably the simplest version to implement • a thread is created for each point in the map • each thread periodically checks its neighbourhood points, if there is a water and thread's point is under water source level, thread puts water into its point • algorithm iterates until no points are flooded Extremely inefficient 9 let's consider trivial example - the whole map will be flooded • in an iteration, we perform n2 operations, we need 2n iterations • complexity is 0(n3), but sequential algorithm has 0(n2) Jin Fihpovic Flood Maps, dRMSD Computation Flood maps OO0OOOOOO dRMSD Acceleration ooooooooooo Moving Line How to distribute more water within a single iteration? • we will create a "moving line"(each line point is processed by a thread) o the line is moved through a picture horizontály and vertically in both directions • if a thread hits a water source, it inserts water until the heigh of processed points is higher than the water source level • the line iterativelly passes the map until no water is written in all directions of line processing Jiří Filipovič Flood Maps, dRMSD Computation Flood maps dRMSD Acceleration OOO0OOOOO ooooooooooo Moving Line Efficiency • we perform n2 steps in one iteration, where we can flood at most n2 points • low performance if the height map is complicated (the water is spread by river basins) • we need at least 2 iterations, each consisting of 4 line movements (left, right, up, down), but often we need more • complexity is 0(i • r?2), where / is number of iterations The second stage of the project requires 250Mpix/s performance • if the line access memory coalesced (vertical line needs to use shared memory), we overcome the required performance Jin Fihpovic Flood Maps, dRMSD Computation Flood maps oooo«oooo dRMSD Acceleration ooooooooooo Tiled Access Issues with the moving line • we are processing already flooded points many times • no temporal locality in data access Jiří Filipovič Flood Maps, dRMSD Computation Flood maps oooo«oooo dRMSD Acceleration ooooooooooo Tiled Access Issues with the moving line • we are processing already flooded points many times • no temporal locality in data access We can process the tiles which fits into the shared memory • the tile is loaded into shared memory and the moving line is processing the tile iterativelly 9 we can do more work without accessing the global memory o if some tile do not have newly flooded neighbours in an iteration, it can be ommited in the iteration • can suffer from insufficient parallelism Jiří Filipovič Flood Maps, dRMSD Computation Flood maps ooooo«ooo Tiled Access dRMSD Acceleration ooooooooooo Better efficiency • much less replications of global memory accesses • replicated accesses mainly into fast shared memory Some tricks • blocks completely under/above vater source level handled by special code Jiří Filipovič Flood Maps, dRMSD Computation Flood maps OOOOOO0OO dRMSD Acceleration ooooooooooo Construction of Regions In nature, the water is moved sequentially • but do we need to respect it in our algorithm? Jiří Filipovič Flood Maps, dRMSD Computation Flood maps dRMSD Acceleration OOOOOO0OO ooooooooooo Construction of Regions In nature, the water is moved sequentially • but do we need to respect it in our algorithm? We can found continuous regions under the water source level in parallel • but we do not know, which are to be flooded • we need to connect those regions with the one containing water source So we have transformed sequential spread of the water to a problem of connecting regions • the number of steps is logarithmic with respect to map size Jiří Filipovič Flood Maps, dRMSD Computation Flood maps OOOOOOO0O dRMSD Acceleration ooooooooooo Evaluation What we get from students 9 various implementations, various algorithms (only main ideas presented here) Performance • 4 orders of magnitude difference between the fastest and the slowest implementation • mainly by using different algorithm • CUDA optimizations and various tricks creates high spread of performance (so there is no clear distinction of algorithms seen in performance) Jiří Filipovič Flood Maps, dRMSD Computation Flood maps dRMSD Acceleration oooooooo* ooooooooooo Evaluation 2500 2000 1500 > 1000 500 0 1-t-t-r i-r GTX480 GTX285 body I I TTTTTTTl 1111111 rtmtTtm^— 1 4 7 10 13 16 19 22 25 28 31 34 poradi J in Fihpovic Flood Maps, dRMSD Computation Flood maps ooooooooo RMSD dRMSD Acceleration •oooooooooo RMSD (root-mean-square deviation) is used in computational chemistry to compare different structures of a molecule • tell us how similar are two structures of a molecule (i.e., we have two structures with the same atoms, but different spatial organization) • used to cluster similar molecules (many outputs of a simulation, some of them very similar) • used to trace convergence of simulations Jiří Filipovič Flood Maps, dRMSD Computation Flood maps ooooooooo dRMSD Acceleration O0OOOOOOOOO dRMSD dRMSD is one possible variant of RMSD computation, it measures average difference between corresponding pairs of atoms of two molecules ^MSD = ± {d*-d$f where df? is the Euclidean distance between /-th and /-th atom in molecule A. GPU acceleration is part of publication: • Jiří Filipovič, Jan Plhák, David Střelák. Acceleration of dRMSD Calculation and Efficient Usage of GPU Caches. In Proceedings of IEEE International Conference on High Performance Computing &. Simulation. 2015. Jiří Filipovič Flood Maps, dRMSD Computation Flood maps dRMSD Acceleration OOOOOOOOO OO0OOOOOOOO Memory locality Flop-to-word ratio • an iteration of inner sum performs 21 flops and transfers 6 numbers o flop-to-word ratio 0.875 for single precision, so memory locality needs to be exploited Improving memory locality • we store n atoms in registers of threads within a thread block <* when new atom is loaded into shared memory, it can be used to compute n distances o compute intensity required to saturate GPU can be easily reached Jiří Filipovič Flood Maps, dRMSD Computation Flood maps ooooooooo dRMSD Acceleration OOO0OOOOOOO Parallelization ID parallelization n—l n • only outer sum is parallelized: ~~ ^u)2 i=ij=i+i o straightforward naive implementation (inner sum transformed into kernel code) quite good cache locality, so even naive algorithm works well • when shared memory is utilized • sliding window is not efficient • threads within the block can be synchronized to access the same atom in the same time Jiří Filipovič Flood Maps, dRMSD Computation Flood maps ooooooooo Parallelization dRMSD Acceleration OOOO0OOOOOO 2D parallelization both outer and inner sums are parallelized i=ij=i+i o better strong scaling 9 higher memory workload Jiří Filipovič Flood Maps, dRMSD Computation Flood maps ooooooooo Optimizations dRMSD Acceleration ooooo«ooooo When shared memory is utilized, the algorithm is instruction-bound. o we will optimize instructions Block size and loop unrolling factor balance • balance latency hiding and instruction efficiency • when properly set, we reach 95% occupancy of compute units (according to profiler at Fermi and Maxwell) Jiří Filipovič Flood Maps, dRMSD Computation Flood maps ooooooooo Optimizations dRMSD Acceleration OOOOOO0OOOO SFU instructions optimization • with -use_f astJiiath compiler flag, sqrtf (x) is transformed to 1/rsqrtf(x) • it may generate too high workload for SFU units (both reciprocal square root and reciprocal are computed on SFU) • can be manually optimized by using x * rsqrtf (x) • speed up computation on Fermi, no effect on Maxwell (it has 2x higher speed of SFUs relative to SPs) Jiří Filipovič Flood Maps, dRMSD Computation Flood maps dRMSD Acceleration OOOOOOOOO OOOOOOO0OOO Optimizations Overhead instructions • after loop unrolling and SFU optimizations, about 90% of instructions are FP32 • most of instructions in remaining 10% are shared memory loads (6 128-bit loads per 4 iterations) • to reduce load instructions, we can unroll outer loop Outer loop unrolling • unrolling factor u means thread process u iterations of outer sum and thus reduces number of load instructions to - u • u has to be small to keep registers consumption at reasonable level • we set u to 4 for Maxwell and 2 for Fermi • reduces strong scaling J in Fihpovic Flood Maps, dRMSD Computation Flood maps dRMSD Acceleration OOOOOOOOO OOOOOOOO0OO Performance GPU Utilization o we have tested the performance on GeForce GTX480 (Fermi) and GeForce GTX750 (Maxwell) • we are able to reach 95 % utilization of ALUs, where 95 % of executed instructions are floating point ones Comparison with CPU • dRMSD computation 62.7x faster than ClusCo computation on Core i7-3820 • 13.4x faster than clustering in ClusCo running on Core i7-3820 • performance boost not proportional to theoretical performance difference - the main reason is poor vectorization of ClusCo J in Fihpovic Flood Maps, dRMSD Computation Jin Fihpovic Flood Maps, dRMSD Computation Flood maps ooooooooo dRMSD Acceleration oooooooooo« Scaling with Maxwell 800 i-1-1-1-1-r 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 atoms Jiří Filipovič Flood Maps, dRMSD Computation