Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo ooooooooooooooo GPU Hardware Performance I J-W I- ■ I " "V in Fihpovic Fall 2022 Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication •ooooooooooooooooo ooooooooo ooooooooooooooo Recapitulation Global memory o warp should access data coalesced • thread blocks should prevent partition camping Shared memory • threads in warp should access different banks (or the same data) All memories • sufficient occupancy needed to hide memory latencies □ fiP Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication O0OOOOOOOOOOOOOOOO ooooooooo ooooooooooooooo Matrix Transposition From theoretical perspective: • a trivial problem • a trivial parallelization • trivially limited by the memory throughput (no arithmetic ops done) __global__ void mtran(float *odata, float* idata, int n){ int x = blockldx.x * blockDim.x + threadldx.x; int y = blockldx.y * blockDim.y + threadldx.y; odata[x*n + y] = idata[y*n + x]; Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication 00*000000000000000 ooooooooo ooooooooooooooo Performance When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem? □ s Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OO0OOOOOOOOOOOOOOO ooooooooo ooooooooooooooo Performance When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem? Access to odata is interleaved. After modification (copy instead of transpose matrices): odata[y*n + x] = idata[y*n + x]; the throughput is 112.4 GB/s. If idata is accessed in an interleaved way too, the resulting throughput would be 2.7 GB/s. Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOO0OOOOOOOOOOOOOO ooooooooo ooooooooooooooo On Removing Interleaving The matrix can be processed per tiles 9 we read the tile into the shared memory row-wise • we will store its transposition into the global memory row-wise • thus having both reading and writing without interleaving Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOO0OOOOOOOOOOOOOO ooooooooo ooooooooooooooo On Removing Interleaving The matrix can be processed per tiles 9 we read the tile into the shared memory row-wise • we will store its transposition into the global memory row-wise • thus having both reading and writing without interleaving What size of tiles should be used? o lets consider square tiles for simplicity • for aligned reading, the row size has to be multiple of 16 • we can consider tile sizes of 16 x 16, 32 x 32, and 48 x 48 because of shared memory size limitations • best size can be determined experimentally Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooo«ooooooooooooo Tiled Transposition Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo __global__ void mtran_coalesced(float *odata, float *idata, int n) { __shared__ float tile[TILE_DIM][TILE_DIM]; int x = blockldx.x * TILE_DIM + threadldx.x; int y = blockldx.y * TILE_DIM + threadldx.y; int index_in = x + y*n; x = blockldx.y * TILE_DIM + threadldx.x; y = blockldx.x * TILE_DIM + threadldx.y; int index_out = x + y*n; for (int i = 0; i < TILE_DIM; i += BL0CK_R0WS) tile[threadldx.y+i][threadldx.x] = idata[index_in+i*n]; __syncthreads(); for (int i = 0; i < TILE_DIM; i += BL0CK_R0WS) odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i]; } Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication ooooo«oooooooooooo ooooooooo ooooooooooooooo Performance The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication ooooo«oooooooooooo ooooooooo ooooooooooooooo Performance The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still less than simple copying Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication ooooo«oooooooooooo ooooooooo ooooooooooooooo Performance The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still less than simple copying • the kernel is more complex, contains synchronization • we need to figure out whether we got the maximum or there's still a problem somewhere Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication ooooo«oooooooooooo ooooooooo ooooooooooooooo Performance The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still less than simple copying • the kernel is more complex, contains synchronization • we need to figure out whether we got the maximum or there's still a problem somewhere 9 if we only copy within the blocks, we get 94.9GB/s 9 something is still sub-optimal Jiří Filipovič GPU Hardware Performance II When reading from the global memory, we write into the shared memory row-wise tile[threadldx.y+i][threadldx.x] = idata[index_in+i*n]; Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOO0OOOOOOOOOOO ooooooooo ooooooooooooooo Shared Memory When reading from the global memory, we write into the shared memory row-wise t ile [ threadldx . y+i ] [ threadldx . x ] = idata [ index_in+i*n ] ; When writing to the global memory, we read from the shared memory column-wise odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i ] ; That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict. Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOO0OOOOOOOOOOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Shared Memory When reading from the global memory, we write into the shared memory row-wise t ile [ threadldx . y+i ] [ threadldx . x ] = idata[index_in+i*n ] ; When writing to the global memory, we read from the shared memory column-wise odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i ] ; That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict. A solution is padding: __shared__ float tile[TILE_DIM][TILE_DIM + 1]; □ S1 Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOO0OOOOOOOOOO ooooooooo ooooooooooooooo Performance Now our implementations shows 93.4 GB/s. 9 as good as simple copying • it seems we can't do much better for given matrix o beware of different input data sizes (partition camping) Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOOOO0OOOOOOOOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Performance 500 1000 1500 £000 £500 velikost matice 3000 3500 4000 Jiří Filipovič GPU Hardware Performance II Matrix Transposition OOOOOOOOO0OOOOOOOO Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Performance 500 1000 1500 £000 £500 3000 3500 4000 velikost matice Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOO0OOOOOOO ooooooooo ooooooooooooooo Performance Drops The performance drops for some size and the behavior is regular Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOO0OOOOOOO ooooooooo ooooooooooooooo Performance Drops The performance drops for some size and the behavior is regular o for matrices sized multiple of 512, we only get 19GB/s for other matrices sized multiple of 256, we only get 35GB/s for other matrices sized multiple of 128, we only get 62GB/s Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication ooooooooooo«oooooo ooooooooo ooooooooooooooo Performance Drops One memory region has width of 2 tiles (256 B / 4 B per float, 32 floats in a tile). If we analyze tiles placement w.r.t. matrix size, we learn that • with multiple of 512 size, the tiles in the same column are in the same region • with multiple of 256 size, each column is at most in two regions • with multiple of 128, each column is at most in four regions We have discovered partition camping. Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooo«ooooo ooooooooo ooooooooooooooo How to Remove Partition Camping? We can pad matrices and avoid bad matrix sizes. • more complicated work with such implementation (all kernels accessing matrix have to implement padding, or we need to convert matrix) • it occupies more memory Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooo«ooooo ooooooooo ooooooooooooooo How to Remove Partition Camping? We can pad matrices and avoid bad matrix sizes. • more complicated work with such implementation (all kernels accessing matrix have to implement padding, or we need to convert matrix) • it occupies more memory We can change the mapping of thread blocks id's on matrix tiles • diagonal mapping ensures access to different regions int blockIdx_y = blockldx.x; int blockIdx_x = (blockldx.x+blockldx.y) % gridDim.x; Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOO0OOOO ooooooooo ooooooooooooooo Performance New implementation gives 80GB/s • performance doesn't drop where we saw it previously • slower for matrices of size not divisible by 128 • the algorithm is more complex • we can use it only for the problematic data sizes For given problem, there may not be (and often there is not) an ideal algorithm for the whole input data size range. It is necessary to benchmark as not all the problems are easily revealed just by looking at the code. Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOO0OOO ooooooooo ooooooooooooooo Performance 100 diagonálni mapováni optimalizace GMEH + SMEM n n m n n n h n rati [\f\ 500 1000 1500 £000 £500 velikost matice 3000 3500 4000 Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOO0OO ooooooooo ooooooooooooooo Performance 500 1000 1500 £000 £500 3000 3500 4000 velikost matice Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOO0O ooooooooo ooooooooooooooo Performance Summary All optimizations were only toward better use of HW properties 9 however, we got 17.6x speedup • when creating an algorithm, it is necessary to understand HW limitations • otherwise we wouldn't have to develop specifically for GPUs -developing a good sequential algorithm would have been just fine.. . Jiří Filipovič GPU Hardware Performance II Matrix Transposition ooooooooooooooooo« Optimizations Effects Beware of optimization effects 9 if we took 4096 x 4096 matrices instead of 4000 x 4000, the shared memory bank conflict removal would have been just marginal • after removing partition camping, the effect of memory bank conflicts becomes visible • thus it makes sense to go from more general/substantial optimizations to the less general ones o if some (provably correct) optimization does not result in performance increase, we need to analyze, what limits the algorithm performance Jiří Filipovič GPU Hardware Performance II Instructions Speed ooooooooo Revision of Matrix Multiplication ooooooooooooooo Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo «00000000 ooooooooooooooo Processing of Instructions Processing of instructions on a multiprocessor (c. c. 1.x) • there are 8 SP cores and 2 SFU cores • if the SP and SPU instruction finalization is not overlapped, the multiprocessor can process up to 8 instructions per cycle o one warp is thus done in 4 or more cycles • some instructions are significantly slower • knowledge of instruction processing time helps us to design efficient code Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO O0OOOOOOO ooooooooooooooo Floating Point Operations GPU is designed as a graphical HW • graphical operations mostly use floating point numbers o efficiently implemented in GPUs 9 most GPUs (c. c. > 1.3) can work in double precision while older ones in single precision only, new GPUs (c.c. > 5.3) supports half-precision • some arithmetic operations are used very frequently in graphics • GPU implements them in SFUs • HW implementation provides lower precision (not an issue for lots of applications) • differentiated using prefix Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OO0OOOOOO ooooooooooooooo Arithmetic Operations Floating point operations o addition and multiplication are very fast • multiplication and addition may be combined into a single MAD instruction for c. c. 1.x • lower precision • 1 cycle speed on SP • —fadd-rn() and -JmuLrnQ may be used to enforce avoiding MAD instruction during compilation • MAD is replaced by FMAD for c. c. > 2.0 (the same speed, higher precision) • 64b versions at lower speed: 1/8 (1.3), 1/2 (2.0), 1/12 (2.1), 1/24 (3.0), 1/3 (3.5), 1/32 (5.x), 1/2 (6.0), 1/32 (6.1, 6.2), 1/2 (7.0), 1/32 (7.5), 1/2 (8.0), 1/32 (8.6) • division is relatively slow, reciprocal is faster Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooo«ooooo ooooooooooooooo Arithmetic Operations Transcendental functions • sinf(x), —cosf(x), ^expf(x) 9 sinf(x), cosf(x), expf(x) more precise but an order of magnitude slower o other operations with different speed and precision trade-offs are implemented, see CUDA manual Integer operations o addition and multiplication as for the floating point ops (fast with 24-bit only on c.c. 1.x) • division and modulo are very slow, but if n is power of 2, we can utilize • i/n is equivalent to / >> log2(n) • i%n is equivalent to /&(a? — 1) Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo oooo«oooo ooooooooooooooo Loops Small loops have significant overhead • jumps • conditions • control variable updates • significant part of instructions may be pointer arithmetics • low I LP Loop unrolling is an option o partially may be done by the compiler • we can do manual unrolling or use #pragma unroll Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOO0OOO ooooooooooooooo Other Instructions Other common instructions are performed at the basic speed (i.e., correspond to number of SPs) 9 comparison • bit operations 9 memory access instructions (given the limitations discussed earlier and memory latency/bandwidth) • the offset may be register value + constant for 32-bit addressing (higher overhead for 64-bit addressing) 9 synchronization (unless we get blocked) Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOO0OO ooooooooooooooo Beware of Shared Memory If memory bank conflict is avoided, the shared memory is as fast as registers at c.c. 1.x But beware • instructions can work with only one operand in the shared memory • if more than one operands in shared memory are used for one instruction, explicit load/store is necessary • MAD instructions run slower (c.c. 1.x) • a + s[i] 4 cycles per warp 9 a + a * s[i] 5 cycles per warp o a + b * s[i] 6 cycles per warp • these details are not published by NVIDIA (revealed through measurements) • interesting only for really performance-critical code Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOO0O ooooooooooooooo Beware of Shared Memory Newer GPUs have relatively slower shared memory (comparing to register speed) • Fermi and newer have lower bandwidth even if only one operand in shared memory is accessed o Kepler uses only 1/2 of available bandwidth for 32-bit access Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo oooooooo« ooooooooooooooo C for CUDA Compilation Device code can be compiled into PTX assembler and binary files • PTX is intermediate code, does not correspond directly to GPU instructions • easier to read • harder to figure out what really happens on GPU Binary files may be disassembled using cuobjdump tool • for GT200 and newer • decuda for older GPUs (may not work completely reliably) Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO «00000000000000 Naive Implementation global__ void mmul(float *A, float *B, float *C, int n){ int x = blockldx.x*blockDim.x + threadldx.x; int y = blockldx.y*blockDim.y + threadldx.y; float tmp = 0; for (int k = 0; k < n; k++) tmp += A[y*n+k] * B[k*n+x]; C[y*n + x] = tmp; Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO O0OOOOOOOOOOOOO Recapitulation Naive implementation • each thread computes one element of the resulting matrix 9 memory-bound o theoretical peak 66.8GFIops • performance depends on threads arrangement - blocks 128 x 1: 36.6GFIops, blocks 1 x 128: 3.9GFIops Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO O0OOOOOOOOOOOOO Recapitulation Naive implementation • each thread computes one element of the resulting matrix 9 memory-bound 9 theoretical peak 66.8GFIops • performance depends on threads arrangement - blocks 128 x 1: 36.6GFIops, blocks 1 x 128: 3.9GFIops Now, we understand the results 9 theoretical maximum cannot be reached - we access GPU memory in at least 32-byte chunks, so reading from A is not efficient 9 blocks 128 x 1 result in coalesced access into B, blocks 1 x 128 result in strided access Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OO0OOOOOOOOOOOO Recapitulation We have implemented a tiled algorithm o each thread block read tiles from A, B into shared memory, exploit data locality (data are moved into shared memory once and read many times) • theoretical peak 568GFIops, we have reached 198GFIops We can try to improve the implementation. Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOO0OOOOOOOOOOO Tiled Algorithm global__ void mmul(float *A, float *B, int bx = blockldx.x; int by = blockldx.y; int tx = threadldx.x; int ty = threadldx.y; __shared__ float As[BLOCK][BLOCK ] ; __shared__ float Bs[BLOCK][BLOCK ] ; float *C, int n){ float Csub = O.Of; for (int b = 0; b < n/BLOCK; b++){ As[ty][tx] = A[(ty + by*BL0CK)*n + b*BL0CK+tx]; Bs[ty][tx] = B[(ty + b*BL0CK)*n + bx*BLOCK+tx ] ; __syncthreads () ; } for (int k = 0; k < BLOCK; k++) Csub += As[ty][k]* Bs[k][tx]; __syncthreads () ; } C[(ty + by*BL0CK)*n + bx*BL0CK+tx] = Csub; Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOO0OOOOOOOOOO Implementation Pitfalls As[ty][tx] = A[(ty + by*BL0CK)*n + b*BLOCK+tx ] ; Bs[ty][tx] = B[(ty + b*BL0CK)*n + bx*BLOCK+tx ] ; C[(ty + by*BL0CK)*n + bx*BLOCK+tx] = Csub ; Global memory access is OK. Csub += As [ ty ] [ k] * Bs [k ] [ tx ] ; Also shared memory access is OK. • if a thread block x-size is multiple of warp size, variable As is broadcasted 9 array Bs is read in contiguous lines, which is conflict-free Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo ooooo«ooooooooo Theoretical Peak Can we be more precise in theoretical peak computation? • we have used a theoretical peak of GPU in MAD instructions (622GFIops) o now, we know that MAD instructions with operand in shared memory are 50% slower • the more precise theoretical bound is 415GFIops o our implementation is still far from that Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo oooooo«oooooooo Performance Pitfalls What causes performance degradation? 9 overhead of kernel execution, thread creation o mainly for fast kernels, or kernels with a few instructions per thread • threads can do more work in serial • instruction overhead • pointer arithmetics, loops • can be reduced • synchronization • may or may not be an issue • load/store in computation • two operands in SMEM per one MAD instruction If we count the performance bound for one load per MAD with operand in SMEM, we get 244GFIops. Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? • exploiting data locality in shared memory decreases global memory pressure Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? • exploiting data locality in shared memory decreases global memory pressure • exploiting data locality in registers decreases shared memory pressure Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? • exploiting data locality in shared memory decreases global memory pressure • exploiting data locality in registers decreases shared memory pressure • how to do it? we reduce number of threads and assign more work to them Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO Searching for Better Implementation Can be a number of load instructions decreased? • exploiting data locality in shared memory decreases global memory pressure • exploiting data locality in registers decreases shared memory pressure • how to do it? we reduce number of threads and assign more work to them Thread block of size m x n will process tile of size m x m, where m = n • k; k 6 N. 9 large m potentially increases synchronization overhead 9 small m reduces shared memory locality 9 small n reduces available parallelism • we will find value for m and n experimentally Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOO0OOOOOO Searching for Better Implementation Best results found for m = 32, n = 16 (32 x 16 blocks working with 32 x 32 tiles). • one load for two MAD instructions results in theoretical bound 311 GFIops • we have 235.4 GFIops • something is wrong Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOO0OOOOO Code disassembly We focus on the inner loop Csubl += As [ty ] [k]*Bs [k] [ tx ] ; Csub2 += As [ty + 16][k]*Bs [k] [tx] ; mov.b32 $r0, s[$ofs4+0x0000] add.b32 $ofs4 , $ofs2 , 0x00000180 mad.rn.f32 $r7, s[$ofs1+0x0008], $r0, $r7 mad.rn.f32 $r8, s[$ofs3+0x0008], $r0, $r8 Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOO0OOOOO Code disassembly We focus on the inner loop Csubl += As [ty ] [k]*Bs [k] [ tx ] ; Csub2 += As [ty + 16][k]*Bs [k] [tx] ; mov.b32 $r0, s[$ofs4+0x0000] add.b32 $ofs4 , $ofs2 , 0x00000180 mad.rn.f32 $r7, s[$ofs1+0x0008], $r0, $r7 mad.rn.f32 $r8, s[$ofs3+0x0008], $r0, $r8 Compiler was able to use constant offsets only for As 9 strided access into Bs generates one load and one integer add Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO Removing the ADD instruction We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO Removing the ADD instruction We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; After disassembling, we se there is no ADD instruction mov.b32 $r0, s[$ofs4+0x0008] mad.rn.f32 $r6 , s[$ofs3+0x0034] , $r0 , $r6 mad.rn.f32 $r8, s[$ofs1+0x0008], $r0, $r8 □ g - = Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO Removing the ADD instruction We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; After disassembling, we se there is no ADD instruction mov.b32 $r0, s[$ofs4+0x0008] mad.rn.f32 $r6 , s[$ofs3+0x0034] , $r0 , $r6 mad.rn.f32 $r8, s[$ofs1+0x0008], $r0, $r8 New issue - memory bank conflicts □ s> - = Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO Removing the ADD instruction We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; After disassembling, we se there is no ADD instruction mov.b32 $r0, s[$ofs4+0x0008] mad.rn.f32 $r6 , s[$ofs3+0x0034] , $r0 , $r6 mad.rn.f32 $r8, s[$ofs1+0x0008], $r0, $r8 New issue - memory bank conflicts • solved by padding □ s> - = Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO Removing the ADD instruction We store transposed data into Bs and modify the inner loop Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ; After disassembling, we se there is no ADD instruction mov.b32 $r0, s[$ofs4+0x0008] mad.rn.f32 $r6 , s[$ofs3+0x0034] , $r0 , $r6 mad.rn.f32 $r8, s[$ofs1+0x0008], $r0, $r8 New issue - memory bank conflicts • solved by padding Resulting speed: 276.2 GFIops. □ s> - = Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOOO0OOO Can we Reach Better Performance? Our results are petty close to theoretical bound for one load per two MADs. o to get better performance, tiled algorithm has to be revised Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOOO0OOO Can we Reach Better Performance? Our results are petty close to theoretical bound for one load per two MADs. o to get better performance, tiled algorithm has to be revised The main issue is that we multiply two tiles in shared memory • need of usage load instructions together with MAD instructions Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOOO0OOO Can we Reach Better Performance? Our results are petty close to theoretical bound for one load per two MADs. o to get better performance, tiled algorithm has to be revised The main issue is that we multiply two tiles in shared memory • need of usage load instructions together with MAD instructions Can we have only one block in shared memory? Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo oooooooooooo«oo New Tiled Algorithm We will iteratively perform rank-1 update of tiles in C using column in A and row in B Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo oooooooooooo«oo New Tiled Algorithm We will iteratively perform rank-1 update of tiles in C using column in A and row in B • columns in A are read from shared memory Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo oooooooooooo«oo New Tiled Algorithm We will iteratively perform rank-1 update of tiles in C using column in A and row in B • columns in A are read from shared memory o rows in B can be read one after another, so we can use register to do so Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo oooooooooooo«oo New Tiled Algorithm We will iteratively perform rank-1 update of tiles in C using column in A and row in B • columns in A are read from shared memory o rows in B can be read one after another, so we can use register to do so • tile in C can be stored in registers Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo oooooooooooo«oo New Tiled Algorithm We will iteratively perform rank-1 update of tiles in C using column in A and row in B • columns in A are read from shared memory o rows in B can be read one after another, so we can use register to do so • tile in C can be stored in registers • we work in only one operand in shared memory, so explicit loads are not needed Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo oooooooooooo«oo New Tiled Algorithm We will iteratively perform rank-1 update of tiles in C using column in A and row in B • columns in A are read from shared memory o rows in B can be read one after another, so we can use register to do so • tile in C can be stored in registers • we work in only one operand in shared memory, so explicit loads are not needed 9 theoretical bound is now done by speed of MAD instruction with operand in shared memory: 415GFIops Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo ooooooooooooo«o New Tiled Algorithm The best-performing configuration: • matrix A read by 16 x 16 tiles, stored in shared memory • matrix B read by 64 x 1 tiles, stored in registers • tiles of matrix C have 64 x 16 size, they are stored in registers Jiří Filipovič GPU Hardware Performance II Matrix Transposition Instructions Speed Revision of Matrix Multiplication oooooooooooooooooo ooooooooo ooooooooooooo«o New Tiled Algorithm The best-performing configuration: • matrix A read by 16 x 16 tiles, stored in shared memory • matrix B read by 64 x 1 tiles, stored in registers • tiles of matrix C have 64 x 16 size, they are stored in registers The measured performance is 375GFIops. Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication oooooooooooooo* Summary Implementation performance rel. A abs. A Naive, blocks 1 x 128 3.9GFIops Naive 36.6 GFIops 9.4x 9.4x Tiled algorithm 198GFIops 5.4x 51x Thread blocks 32 x 16, tiles 32 x 32 235 GFIops 1.19x 60 x Removing ADD instruction 276 GFIops 1.17x 71x One block in shared memory 375 GFIops 1.36x 96 x Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication oooooooooooooo* Summary Implementation performance rel. A abs. A Naive, blocks 1 x 128 3.9GFIops Naive 36.6 GFIops 9.4x 9.4x Tiled algorithm 198GFIops 5.4x 51x Thread blocks 32 x 16, tiles 32 x 32 235 GFIops 1.19x 60 x Removing ADD instruction 276 GFIops 1.17x 71x One block in shared memory 375 GFIops 1.36x 96 x • The most relevant is exploiting memory locality and basic memory access optimization. Jiří Filipovič GPU Hardware Performance II Matrix Transposition oooooooooooooooooo Instructions Speed ooooooooo Revision of Matrix Multiplication oooooooooooooo* Summary Implementation performance rel. A abs. A Naive, blocks 1 x 128 3.9GFIops Naive 36.6 GFIops 9.4x 9.4x Tiled algorithm 198GFIops 5.4x 51x Thread blocks 32 x 16, tiles 32 x 32 235 GFIops 1.19x 60 x Removing ADD instruction 276 GFIops 1.17x 71x One block in shared memory 375 GFIops 1.36x 96 x • The most relevant is exploiting memory locality and basic memory access optimization. o Finer optimizations are relatively challenging, important for really performance critical codes. Jiří Filipovič GPU Hardware Performance II