Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo ooooooooooooooo GPU Hardware Performance I Jiří Filipovič fall 2018 Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication •ooooooooooooooooo ooooooooo ooooooooooooooo Global memory 9 warp should access data coalesced • thread blocks should prevent partition camping Shared memory • threads in warp should access different banks (or the same data) All memories • sufficient occupancy needed to hide memory latencies Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo Matrix Transposition Matrix Transposition Instructions Speed O0OOOOOOOOOOOOOOOO ooooooooo From theoretical perspective: • a trivial problem • a trivial parallelization • trivially limited by the memory throughput (no arithmetic ops done) __global__ void mtran(float *odata, float* idata, int n){ int x = blockldx.x * blockDim.x + threadldx.x; int y = blockldx.y * blockDim.y + threadldx.y; odata[x*n + y] = idata[y*n + x]; Jiří Filipovič GPU Hardware Performance II Matrix Transposition OO0OOOOOOOOOOOOOOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Performance When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem? Jiří Filipovič GPU Hardware Performance II Matrix Transposition OO0OOOOOOOOOOOOOOO Performance Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem? Access to odata is interleaved. After modification (copy instead of transpose matrices): odata[y*n + x] = idata[y*n + x]; the throughput is 112.4 GB/s. If idata is accessed in an interleaved way too, the resulting throughput would be 2.7 GB/s. Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOO0OOOOOOOOOOOOOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo On Removing Interleaving The matrix can be processed per tiles o we read the tile into the shared memory row-wise • we will store its transposition into the global memory row-wise • thus having both reading and writing without interleaving Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOO0OOOOOOOOOOOOOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo On Removing Interleaving The matrix can be processed per tiles o we read the tile into the shared memory row-wise • we will store its transposition into the global memory row-wise • thus having both reading and writing without interleaving What size of tiles should be used? 9 lets consider square tiles for simplicity • for aligned reading, the row size has to be multiple of 16 • we can consider tile sizes of 16 x 16, 32 x 32, and 48 x 48 because of shared memory size limitations • best size can be determined experimentally Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo Tiled Transposition Matrix Transposition Instructions Speed OOOO0OOOOOOOOOOOOO ooooooooo __global__ void mtran_coalesced(float *odata, float *idata, int n) { __shared__ float tile[TILE_DIM][TILE_DIM]; int x = blockldx.x * TILE_DIM + threadldx.x; int y = blockldx.y * TILE_DIM + threadldx.y; int index_in = x + y*n; x = blockldx.y * TILE_DIM + threadldx.x; y = blockldx.x * TILE_DIM + threadldx.y; int index_out = x + y*n; for (int i = 0; i < TILE_DIM; i += BL0CK_R0WS) tile[threadldx.y+i][threadldx.x] = idata[index_in+i*n] ; __syncthreads(); for (int i = 0; i < TILE_DIM; i += BL0CK_R0WS) odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i]; } Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo Performance Matrix Transposition Instructions Speed ooooo«oooooooooooo ooooooooo The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s Jiří Filipovič GPU Hardware Performance II Matrix Transposition ooooo«oooooooooooo Performance Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still less than simple copying Jiří Filipovič GPU Hardware Performance II Matrix Transposition ooooo«oooooooooooo Performance Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still less than simple copying • the kernel is more complex, contains synchronization • we need to figure out whether we got the maximum or there's still a problem somewhere Jiří Filipovič GPU Hardware Performance II Matrix Transposition ooooo«oooooooooooo Performance Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still less than simple copying • the kernel is more complex, contains synchronization • we need to figure out whether we got the maximum or there's still a problem somewhere • if we only copy within the blocks, we get 94.9GB/s • something is still sub-optimal Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOO0OOOOOOOOOOO ooooooooo ooooooooooooooo When reading from the global memory, we write into the shared memory row-wise t ile [ threadldx . y+i ] [ threadldx . x ] = idata [ index_in+i*n ] ; Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOO0OOOOOOOOOOO Instructions Speed Revision of Matrix Multiplication ooooooooo ooooooooooooooo s\ Tared Memory When reading from the global memory, we write into the shared memory row-wise tile[threadldx.y+i][threadldx.x] = idata[index_in+i*n]; When writing to the global memory, we read from the shared memory column-wise odata[index_out+i *n] = tile[threadldx.x][threadldx.y+i ] ; That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict. □ i3" Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOO0OOOOOOOOOOO Instructions Speed Revision of Matrix Multiplication ooooooooo ooooooooooooooo Shared Memory When reading from the global memory, we write into the shared memory row-wise tile[threadldx.y+i][threadldx.x] = idata[index.in+i *n]; When writing to the global memory, we read from the shared memory column-wise odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i ] ; That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict. A solution is padding: __shared__ float tile[TILE_DIM][TILE_DIM + 1]; Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOOO0OOOOOOOOOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Performance Now our implementations shows 93.4 GB/s. • as good as simple copying • it seems we can't do much better for given matrix o beware of different input data sizes (partition camping) Jiří Filipovič GPU Hardware Performance II Matrix Transposition 00000000*000000000 Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Performance Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOOOOO0OOOOOOOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooo«ooooooo Performance Drops Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo The performance drops for some size and the behavior is regular Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooo«ooooooo Performance Drops Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo The performance drops for some size and the behavior is regular 9 for matrices sized multiple of 512, we only get 19GB/s for other matrices sized multiple of 256, we only get 35GB/s for other matrices sized multiple of 128, we only get 62GB/s Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOOOOOOO0OOOOOO Performance Drops Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo One memory region has width of 2 tiles (256 B / 4 B per float, 32 floats in a tile). If we analyze tiles placement w.r.t. matrix size, we learn that • with multiple of 512 size, the tiles in the same column are in the same region • with multiple of 256 size, each column is at most in two regions • with multiple of 128, each column is at most in four regions We have discovered partition camping. Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooo«ooooo Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo How to Remove Partition Camping? We can pad matrices and avoid bad matrix sizes. o more complicated work with such implementation (all kernels accessing matrix have to implement padding, or we need to convert matrix) • it occupies more memory Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo How to Remove Partition Camping? Matrix Transposition Instructions Speed oooooooooooo«ooooo ooooooooo We can pad matrices and avoid bad matrix sizes. o more complicated work with such implementation (all kernels accessing matrix have to implement padding, or we need to convert matrix) • it occupies more memory We can change the mapping of thread blocks id's on matrix tiles • diagonal mapping ensures access to different regions int blockIdx_y = blockldx.x; int blockIdx_x = (blockldx.x+blockldx.y) % gridDim.x; Jiří Filipovič GPU Hardware Performance II Matrix Transposition 0000000000000*0000 Performance Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo New implementation gives 80GB/s • performance doesn't drop where we saw it previously • slower for matrices of size not divisible by 128 • the algorithm is more complex • we can use it only for the problematic data sizes For given problem, there may not be (and often there is not) an ideal algorithm for the whole input data size range. It is necessary to benchmark as not all the problems are easily revealed just by looking at the code. Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOOOOOOOOOO0OOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Performance 100 diagonálni mapováni optimalizace GMEH + SMEM nn rin li M n n n n 500 1000 1500 £000 £500 velikost matice 3000 3500 4000 Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOOOOOOOOOOO0OO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Performance 500 1000 1500 £000 £500 3000 3500 4000 velikost matice Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOOOOOOOOOOOO0O Performance Summary Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo All optimizations were only toward better accommodation of HW properties 9 however, we got 17.6x speedup • when creating an algorithm, it is necessary to understand HW limitations 9 otherwise we wouldn't have to develop specifically for GPUs -developing a good sequential algorithm would have been just fine.. . Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo Optimizations Effects Matrix Transposition Instructions Speed ooooooooooooooooo» ooooooooo Beware of optimization effects o if we took 4096 x 4096 matrices instead of 4000 x 4000, the shared memory bank conflict removal would have been just marginal • after removing partition camping, the effect of memory bank conflicts becomes visible • thus it makes sense to go from more general/substantial optimizations to the less general ones o if some (provably correct) optimization does not result in performance increase, we need to analyze, what limits the algorithm performance Jiří Filipovič GPU Hardware Performance II Instructions Speed •oooooooo Matrix Transposition oooooooooooooooooo Processing of Instructions Revision of Matrix Multiplication ooooooooooooooo Processing of instructions on a multiprocessor (c. c. 1.x) • there are 8 SP cores and 2 SFU cores 9 if the SP and SPU instruction processing is not overlapped, the multiprocessor can process up to 8 instructions per cycle o one warp is thus done in 4 or more cycles • some instructions are significantly slower <* knowledge of instruction processing time helps us to design efficient code Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo Floating Point Operations Matrix Transposition Instructions Speed oooooooooooooooooo o«ooooooo GPU is designed as a graphical HW • graphical operations mostly use floating point numbers o efficiently implemented in GPUs • newer GPUs (c. c. > 1.3) can work in double precision while older ones in single precision only, new GPUs (c.c. > 5.3) supports half-precision • some arithmetic operations are used very frequently in graphics • GPU implements them in SFUs • HW implementation provides lower precision (not an issue for lots of applications) • differentiated using prefix Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo Arithmetic Operations Matrix Transposition Instructions Speed oooooooooooooooooo oo«oooooo Floating point operations o addition and multiplication are very fast • multiplication and addition may be combined into a single MAD instruction for c. c. 1.x • lower precision • 1 cycle speed on SP • —fadd-rn() and —fmuLrnQ may be used to enforce avoiding MAD instruction during compilation • MAD is replaced by FMAD for c. c. > 2.0 (same speed, higher precision) • 64b versions at lower speed: 1/8 (1.3), 1/2 (2.0), 1/12 (2.1), 1/24 (3.0), 1/3 (3.5), 1/32 (5.x), 1/2 (6.0), 1/32 (6.1, 6.2), 1/2 (7.0), 1/32 (7.5) • division is relatively slow, reciprocal is faster Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo Arithmetic Operations Matrix Transposition Instructions Speed OOOOOOOOOOOOOOOOOO OOO0OOOOO Transcendental functions • —sinf(x), __cosf(x), __expf(x) 9 sinf(x), cosf(x), expf(x) more precise but an order of magnitude slower <* other operations with different speed and precision trade-offs are implemented, see CUDA manual Integer operations o addition and multiplication as for the floating point ops (fast with 24-bit only on c.c. 1.x) o division and modulo is very slow, but if n is power of 2, we can utilize • i/n is equivalent to / >> log2(n) • i%n is equivalent to /&(a? — 1) Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed oooo«oooo Revision of Matrix Multiplication ooooooooooooooo Loops Small loops have significant overhead • jumps • conditions • control variable updates 9 significant part of instructions may be pointer arithmetics • low I LP Loop unrolling is an option o partially may be done by the compiler • we can do manual unrolling or use #pragma unroll Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Other Instructions Instructions Speed 00000*000 Revision of Matrix Multiplication ooooooooooooooo Other common instructions are performed at the basic speed (i.e correspond to number of SPs) o comparison • bit operations o memory access instructions (given the limitations discussed earlier and memory latency/bandwidth) • the offset may be register value + constant for 32-bit addressing (higher overhead for 64-bit addressing) • synchronization (unless we get blocked) Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed OOOOOO0OO Revision of Matrix Multiplication ooooooooooooooo Beware of Shared Memory If memory bank conflict is avoided, the shared memory is as fast as registers at c.c. 1.x But beware • instructions can work with only one operand in the shared memory • if more than one operands in shared memory are used for one instruction, explicit load/store is necessary • MAD instructions run slower (c.c. 1.x) • a + s[i] 4 cycles per warp 9 a + a * s[i] 5 cycles per warp o a + b * s[i] 6 cycles per warp • these details are not published by NVIDIA (revealed through measurements) • interesting only for really performance-critical code •<[5i^ -<^^ < ± > 1 ^0,0 Jiří Filipovič GPU Hardware Performance II Instructions Speed OOOOOOO0O Matrix Transposition oooooooooooooooooo Beware of Shared Memory Revision of Matrix Multiplication ooooooooooooooo Newer GPUs have relatively slower shared memory (comparing to register speed) o Fermi, Maxwell, Pascal and Volta have lower bandwidth even if one operand in shared memory is accessed o Kepler uses only 1/2 of available bandwidth for 32-bit access Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooooo C for CUDA Compilation Matrix Transposition Instructions Speed oooooooooooooooooo oooooooo» Device code can be compiled into PTX assembler and binary files • PTX is intermediate code, does not correspond directly to GPU instructions • easier to read • harder to figure out what really happens on GPU Binary files may be disassembled using cuobjdump tool • for GT200 and newer • decuda for older GPUs (may not work completely reliably) Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Naive Implementation Instructions Speed ooooooooo Revision of Matrix Multiplication •oooooooooooooo global__ void mmul(float *A, float *B, float *C, int n){ int x = blockldx.x*blockDim.x + threadldx.x; int y = blockldx.y*blockDim.y + threadldx.y; float tmp = 0; for (int k = 0; k < n; k++) tmp += A[y*n+k] * B[k*n+x]; C[y*n + x] = tmp; Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication O0OOOOOOOOOOOOO Recapitulation Naive implementation • each thread computes one element of the resulting matrix o memory-bound o theoretical peak 66.8GFIops • performance depends on threads arrangement - blocks 128 x 1: 36.6GFIops, blocks 1 x 128: 3.9GFIops Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication O0OOOOOOOOOOOOO Recapitulation Naive implementation • each thread computes one element of the resulting matrix o memory-bound o theoretical peak 66.8GFIops • performance depends on threads arrangement - blocks 128 x 1: 36.6GFIops, blocks 1 x 128: 3.9GFIops Now, we understand the results • theoretical maximum cannot be reached - we access GPU memory in at least 32-byte chunks, so reading from A is not efficient • blocks 128 x 1 result in coalesced access into B, blocks 1 x 128 result in strided access Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Recapitulation Instructions Speed ooooooooo Revision of Matrix Multiplication OO0OOOOOOOOOOOO We have implemented a tiled algorithm o each thread block read tiles from A, B into shared memory, exploit data locality (data are moved into shared memory once and read many times) o theoretical peak 568GFIops, we have reached 198GFIops We can improve the implementation... Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication OOO0OOOOOOOOOOO Tiled Algorithm global__ void mmul(float *A, float *B, float *C, int n){ int bx = blockldx.x; int by = blockldx.y; int tx = threadldx.x; int ty = threadldx.y; __shared__ float As[BLOCK][BLOCK ] ; __shared__ float Bs[BLOCK][BLOCK ] ; float Csub = O.Of; for (int b = 0; b < n/BLOCK; b++){ As[ty][tx] = A[(ty + by*BL0CK)*n + b*BL0CK+tx]; Bs[ty][tx] = B[(ty + b*BL0CK)*n + bx*BLOCK+tx ] ; __syncthreads () ; for (int k = 0; k < BLOCK; k++) Csub += As[ty][k]* Bs[k][tx]; __syncthreads () ; } C[(ty + by*BL0CK)*n + bx*BL0CK+tx] = Csub; Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication OOOO0OOOOOOOOOO Implementation Pitfalls As[ty][tx] = A[(ty + by*BL0CK)*n + b*BL0CK+tx ] ; Bs[ty][tx] = B[(ty + b*BL0CK)*n + bx*BLOCK+tx ] ; C[(ty + by*BL0CK)*n + bx*BLOCK+tx] = Csub ; Global memory access is OK. Csub += As [ ty ] [k] * Bs [k ] [ tx ] ; Also shared memory access is OK. • if a thread block x-size is multiple of warp size, variable As is broadcasted 9 array Bs is read in contiguous lines, which is conflict-free Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Theoretical Peak Instructions Speed ooooooooo Revision of Matrix Multiplication OOOOO0OOOOOOOOO Can we be more precise in theoretical peak computation? o we have used a theoretical peak of GPU in MAD instructions (622GFIops) o now, we know that MAD instructions with operand in shared memory are 50% slower • the more precise theoretical bound is 415GFIops 9 our implementation is still far from that Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication oooooo«oooooooo Performance Pitfalls Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo What causes performance degradation? • overhead of kernel execution, thread creation o mainly for fast kernels, or kernels with a few instructions per thread • threads can do more work in serial • instruction overhead • pointer arithmetics, loops • can be reduced • synchronization • may or may not be an issue • load/store in computation • two operands in SMEM per one MAD instruction If we count the performance bound for one load per MAD with operand in SMEM, we get 244GFIops. Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed Revision of Matrix Multiplication OOOOOOOOO OOOOOOO0OOOOOOO Searching for Better mplementation Can be a number of load instructions decreased? Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? 9 exploiting data locality in shared memory decreases global memory pressure Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? 9 exploiting data locality in shared memory decreases global memory pressure • exploiting data locality in registers decreases shared memory pressure Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? 9 exploiting data locality in shared memory decreases global memory pressure • exploiting data locality in registers decreases shared memory pressure • how to do it? we reduce number of threads and assign more work to them Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? 9 exploiting data locality in shared memory decreases global memory pressure • exploiting data locality in registers decreases shared memory pressure • how to do it? we reduce number of threads and assign more work to them Thread block of size m x n will process tile of size m x m, where m = n • k; k 6 N. • large m potentially increases synchronization overhead o small m reduces shared memory locality 9 small n reduces available parallelism • we will find value for m and n experimentally Jiří Filipovič GPU Hardware Performance II Instructions Speed ooooooooo Matrix Transposition oooooooooooooooooo Searching for Better Implementation Revision of Matrix Multiplication OOOOOOOO0OOOOOO Best results found for m = 32, n = 16 (32 x 16 blocks working with 32 x 32 tiles). • one load to two MAD instructions results in theoretical boun 311GFIops • we have 235.4 GFIops • something is wrong Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication 000000000*00000 Code disassembly We focus on the inner loop Csubl += As [ty ] [k]*Bs [k] [ tx ] ; Csub2 += As [ty + 16][k]*Bs [k] [tx] ; mov.b32 $rO , s[$ofs4+OxOOOO] add.b32 $ofs4 , $ofs2 , 0x00000180 mad.rn.f32 $r7 , s [ $of s 1+0x0008 ] , $rO , $r7 mad.rn.f32 $r8 , s [ $of s3+0x0008 ] , $rO , $r8 Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication 000000000*00000 Code disassembly We focus on the inner loop Csubl += As [ty ] [k]*Bs [k] [ tx ] ; Csub2 += As [ty + 16][k]*Bs [k] [tx] ; mov.b32 $rO , s[$ofs4+OxOOOO] add.b32 $ofs4 , $ofs2 , 0x00000180 mad.rn.f32 $r7 , s [ $of s 1+0x0008 ] , $rO , $r7 mad.rn.f32 $r8 , s [ $of s3+0x0008 ] , $rO , $r8 Compiler was able to use constant offsets only for As • strided access into Bs generates one load and one integer add Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication oooooooooo«oooo Removing the ADD instruction Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication oooooooooo«oooo Removing the ADD instruction We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; After disassembling, we se there is no ADD instruction mov.b32 $r0 , s[$ofs4+0x0008] mad.rn.f32 $r6 , s [ $of s3+0x0034 ] , $r0 , $r6 mad.rn.f32 $r8 , s [ $of s 1+0x0008 ] , $r0 , $r8 Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication oooooooooo«oooo Removing the ADD instruction Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; After disassembling, we se there is no ADD instruction mov.b32 $r0 , s[$ofs4+0x0008] mad.rn.f32 $r6 , s [ $of s3+0x0034 ] , $r0 , $r6 mad.rn.f32 $r8 , s [ $of s 1+0x0008 ] , $r0 , $r8 New issue - memory bank conflicts Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication oooooooooo«oooo Removing the ADD instruction Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; After disassembling, we se there is no ADD instruction mov.b32 $r0 , s[$ofs4+0x0008] mad.rn.f32 $r6 , s [ $of s3+0x0034 ] , $r0 , $r6 mad.rn.f32 $r8 , s [ $of s 1+0x0008 ] , $r0 , $r8 New issue - memory bank conflicts • solved by padding Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication oooooooooo«oooo Removing the ADD instruction Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; After disassembling, we se there is no ADD instruction mov.b32 $r0 , s[$ofs4+0x0008] mad.rn.f32 $r6 , s [ $of s3+0x0034 ] , $r0 , $r6 mad.rn.f32 $r8 , s [ $of s 1+0x0008 ] , $r0 , $r8 New issue - memory bank conflicts • solved by padding Resulting speed: 276.2 GFIops. Jiří Filipovič GPU Hardware Performance II Instructions Speed ooooooooo Matrix Transposition oooooooooooooooooo Can we Reach Better Performance Revision of Matrix Multiplication ooooooooooo«ooo Our results are petty close to theoretical bound for one load per two MADs. • to get better performance, tiled algorithm has to be revised Jiří Filipovič GPU Hardware Performance II Instructions Speed ooooooooo Matrix Transposition oooooooooooooooooo Can we Reach Better Performance Revision of Matrix Multiplication ooooooooooo«ooo Our results are petty close to theoretical bound for one load per two MADs. • to get better performance, tiled algorithm has to be revised The main issue is that we multiply two tiles in shared memory 9 need of usage load instructions together with MAD instructions Jiří Filipovič GPU Hardware Performance II Instructions Speed ooooooooo Matrix Transposition oooooooooooooooooo Can we Reach Better Performance Revision of Matrix Multiplication ooooooooooo«ooo Our results are petty close to theoretical bound for one load per two MADs. • to get better performance, tiled algorithm has to be revised The main issue is that we multiply two tiles in shared memory 9 need of usage load instructions together with MAD instructions Can we have only one block in shared memory? Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication OOOOOOOOOOOO0OO New Tiled Algorithm Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We will iteratively perform rank-1 update of tiles in C using column in A and row in B Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication OOOOOOOOOOOO0OO New Tiled Algorithm Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We will iteratively perform rank-1 update of tiles in C using column in A and row in B 9 columns in A are read from shared memory Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication OOOOOOOOOOOO0OO New Tiled Algorithm Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We will iteratively perform rank-1 update of tiles in C using column in A and row in B 9 columns in A are read from shared memory o rows in B can be read one after another, so we can use register to do so Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication OOOOOOOOOOOO0OO New Tiled Algorithm We will iteratively perform rank-1 update of tiles in C using column in A and row in B 9 columns in A are read from shared memory o rows in B can be read one after another, so we can use register to do so 9 tile in C can be stored in registers Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication OOOOOOOOOOOO0OO New Tiled Algorithm Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We will iteratively perform rank-1 update of tiles in C using column in A and row in B 9 columns in A are read from shared memory o rows in B can be read one after another, so we can use register to do so 9 tile in C can be stored in registers • we work in only one operand in shared memory, so explicit loads are not needed Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication OOOOOOOOOOOO0OO New Tiled Algorithm Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo We will iteratively perform rank-1 update of tiles in C using column in A and row in B 9 columns in A are read from shared memory o rows in B can be read one after another, so we can use register to do so 9 tile in C can be stored in registers • we work in only one operand in shared memory, so explicit loads are not needed • theoretical bound is now done by speed of MAD instruction with operand in shared memory: 415GFIops Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooo«o New Tiled Algorithm Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo The best-performing configuration: • matrix A read by 16 x 16 tiles, stored in shared memory • matrix B read by 64 x 1 tiles, stored in registers • tiles of matrix C have 64 x 16 size, they are stored in registers Jiří Filipovič GPU Hardware Performance II Revision of Matrix Multiplication ooooooooooooo«o New Tiled Algorithm Matrix Transposition Instructions Speed oooooooooooooooooo ooooooooo The best-performing configuration: • matrix A read by 16 x 16 tiles, stored in shared memory • matrix B read by 64 x 1 tiles, stored in registers • tiles of matrix C have 64 x 16 size, they are stored in registers The measured performance is 375GFIops. Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication oooooooooooooo* Summary Implementation performance rel. A abs. A Naive, blocks 1 x 128 3.9GFIops Naive 36.6 GFIops 9.4x 9.4x Tiled algorithm 198GFIops 5.4x 51x Thread blocks 32 x 16, tiles 32 x 32 235 GFIops 1.19x 60 x Removing ADD instruction 276 GFIops 1.17x 71x One block in shared memory 375 GFIops 1.36x 96 x Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication oooooooooooooo* Summary Implementation performance rel. A abs. A Naive, blocks 1 x 128 3.9GFIops Naive 36.6 GFIops 9.4x 9.4x Tiled algorithm 198GFIops 5.4x 51x Thread blocks 32 x 16, tiles 32 x 32 235 GFIops 1.19x 60 x Removing ADD instruction 276 GFIops 1.17x 71x One block in shared memory 375 GFIops 1.36x 96 x • The most relevant is exploiting memory locality and basic memory access optimization. Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication oooooooooooooo* Summary Implementation performance rel. A abs. A Naive, blocks 1 x 128 3.9GFIops Naive 36.6 GFIops 9.4x 9.4x Tiled algorithm 198GFIops 5.4x 51x Thread blocks 32 x 16, tiles 32 x 32 235 GFIops 1.19x 60 x Removing ADD instruction 276 GFIops 1.17x 71x One block in shared memory 375 GFIops 1.36x 96 x • The most relevant is exploiting memory locality and basic memory access optimization. o Finer optimizations are relatively challenging, important for really performance critical codes. Jiří Filipovič GPU Hardware Performance II