PauliusPaulius MicikevičiusMicikevičius | NVIDIA| NVIDIA Dec 14, 2009Dec 14, 2009 Single GPU 3D Finite DifferenceSingle GPU 3D Finite Difference 3D Finite Difference * 25-point stencil (8th order in space) * Isotropic: 5 distinct coefficients * For each output element we need: ­ 29 flops­ 29 flops ­ 25 input values * Some applications: ­ FD of the wave equation (oil & gas exploration) General Approach * Tile a 2D slice with 2D threadblocks ­ Slice is in the two fastest dimensions: x and y * Each thread iterates along the slowestˇ Each thread iterates along the slowest dimension (z) ­ Each thread is responsible for one element in every slice ­ Also helps data reuse Naive Implementation * One thread per output element * Fetch all data for every output element ­ Redundant: input is read 25 times ­ Required bandwidth = 25 reads, 1 write (26x) * Access Redundancy:ˇ Access Redundancy: ­ Metric for evaluating implementations ­ Ratio between the elements accessed and the elements processed * Appropriate for user-managed-cache architectures * Optimization: share data among threads ­ Use shared memory for data needed by many threads ­ Use registers for data not shared among threads Using Shared Memory: 2 Passes * 3DFD done with 2 passes: ­ 2D-pass (2DFD) ­ 1D-pass (1DFD and output of the 2D-pass) * SMEM is sufficient for 2D subdomains ­ Square tiles require the smallest halos­ Square tiles require the smallest halos ­ Up to 64x64 storage (56x56 subdomain) * 76.5% of storage is not halo * Redundancy (volume accesses): ­ Read/write for both passes ­ 16x16 subdomain tiles: 6.00 times ­ 32x32 subdomain tiles: 5.50 times ­ 56x56 subdomain tiles: 5.29 times Input Reuse within a 2x2 Threadblock Used only by thread (1,0) Used only by thread (0,1) Used only by thread (0,0) Used by at least 2 threads y x z * Store the xy-slice in SMEM * Each thread keeps its 8 z-elements in registers ­ 4 "infront", 4 "behind" Used only by thread (1,1) Used only by thread (1,0) Process at z = k y x z Stored in registers Value we "track" Stored in SMEM Value we "track" Process at z = k+1 Stored in registers y x z Value we "track" Stored in SMEM Newly-read value Value we "track" Process at z = k+2 y x z Stored in registers Value we "track" Stored in SMEM Newly-read value Value we "track" Process at z = k+3 y x z Stored in registers Value we "track" Stored in SMEM Newly-read value Value we "track" Inner Loop of the Stencil Kernel // ------- advance the slice (move the thread-front) ------------------- behind4 = behind3; behind3 = behind2; behind2 = behind1; behind1 = current; current = infront1; infront1 = infront2; infront2 = infront3; infront3 = infront4; infront4 = g_input[in_idx]; in_idx += stride; out_idx += stride; __syncthreads( ); // ------- update the data slice in smem ---------------------------------- if( threadIdx.y