Matrix Transposition	Instructions Speed	Revision of Matrix Multiplication
oooooooooooooooooo	ooooooooo	ooooooooooooooo
GPU Hardware Performance I
J-W I- ■ I " "V in Fihpovic
Fall 2024
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
•ooooooooooooooooo ooooooooo ooooooooooooooo
Recapitulation
Global memory
o warp should access data coalesced
• thread blocks should prevent partition camping Shared memory
• threads in warp should access different banks (or the same data)
All memories
• sufficient occupancy needed to hide memory latencies
□ fiP
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
O0OOOOOOOOOOOOOOOO ooooooooo ooooooooooooooo
Matrix Transposition
From theoretical perspective:
• a trivial problem
• a trivial parallelization
• trivially limited by the memory throughput (no arithmetic ops done)
__global__  void mtran(float  *odata,   float*   idata,   int n){ int  x = blockldx.x  *  blockDim.x + threadldx.x; int  y = blockldx.y *  blockDim.y + threadldx.y; odata[x*n + y]  = idata[y*n + x];
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
00*000000000000000 ooooooooo ooooooooooooooo
Performance
When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem?
□ s
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OO0OOOOOOOOOOOOOOO ooooooooo ooooooooooooooo
Performance
When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem?
Access to odata is interleaved. After modification (copy instead of transpose matrices):
odata[y*n + x]  = idata[y*n + x];
the throughput is 112.4 GB/s. If idata is accessed in an interleaved way too, the resulting throughput would be 2.7 GB/s.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOO0OOOOOOOOOOOOOO ooooooooo ooooooooooooooo
On Removing Interleaving
The matrix can be processed per tiles
9 we read the tile into the shared memory row-wise
• we will store its transposition into the global memory row-wise
• thus having both reading and writing without interleaving
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOO0OOOOOOOOOOOOOO ooooooooo ooooooooooooooo
On Removing Interleaving
The matrix can be processed per tiles
9 we read the tile into the shared memory row-wise
• we will store its transposition into the global memory row-wise
• thus having both reading and writing without interleaving What size of tiles should be used?
o lets consider square tiles for simplicity
• for aligned reading, the row size has to be multiple of 16
• we can consider tile sizes of 16 x 16, 32 x 32, and 48 x 48 because of shared memory size limitations
• best size can be determined experimentally
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooo«ooooooooooooo
Tiled Transposition
Instructions Speed
ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
__global__  void mtran_coalesced(float  *odata,   float  *idata,   int n)
{
__shared__   float   tile[TILE_DIM][TILE_DIM];
int x = blockldx.x * TILE_DIM + threadldx.x; int y = blockldx.y * TILE_DIM + threadldx.y; int   index_in = x + y*n;
x = blockldx.y * TILE_DIM + threadldx.x; y = blockldx.x * TILE_DIM + threadldx.y; int   index_out = x + y*n;
for   (int   i = 0;   i < TILE_DIM;   i += BL0CK_R0WS)
tile[threadldx.y+i][threadldx.x]  = idata[index_in+i*n];
__syncthreads();
for   (int   i = 0;   i < TILE_DIM;   i += BL0CK_R0WS)
odata[index_out+i*n]  = tile[threadldx.x][threadldx.y+i];
}
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
ooooo«oooooooooooo ooooooooo ooooooooooooooo
Performance
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
ooooo«oooooooooooo ooooooooo ooooooooooooooo
Performance
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
• that's significantly better but still less than simple copying
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
ooooo«oooooooooooo ooooooooo ooooooooooooooo
Performance
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
• that's significantly better but still less than simple copying
• the kernel is more complex, contains synchronization
• we need to figure out whether we got the maximum or there's still a problem somewhere
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
ooooo«oooooooooooo ooooooooo ooooooooooooooo
Performance
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
• that's significantly better but still less than simple copying
• the kernel is more complex, contains synchronization
• we need to figure out whether we got the maximum or there's still a problem somewhere
9 if we only copy within the blocks, we get 94.9GB/s 9 something is still sub-optimal
Jiří Filipovič        GPU Hardware Performance II
When reading from the global memory, we write into the shared memory row-wise
tile[threadldx.y+i][threadldx.x]  = idata[index_in+i*n];
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOO0OOOOOOOOOOO ooooooooo ooooooooooooooo
Shared Memory
When reading from the global memory, we write into the shared memory row-wise
t ile [ threadldx . y+i ] [ threadldx . x ]  = idata [ index_in+i*n ] ;
When writing to the global memory, we read from the shared memory column-wise
odata[index_out+i*n]  = tile[threadldx.x][threadldx.y+i ] ;
That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
OOOOOO0OOOOOOOOOOO
Instructions Speed
ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Shared Memory
When reading from the global memory, we write into the shared memory row-wise
t ile [ threadldx . y+i ] [ threadldx . x ]  = idata[index_in+i*n ] ;
When writing to the global memory, we read from the shared memory column-wise
odata[index_out+i*n]  = tile[threadldx.x][threadldx.y+i ] ;
That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict.
A solution is padding:
__shared__   float   tile[TILE_DIM][TILE_DIM + 1];
□ S1
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOO0OOOOOOOOOO ooooooooo ooooooooooooooo
Performance
Now our implementations shows 93.4 GB/s. 9 as good as simple copying
• it seems we can't do much better for given matrix
o beware of different input data sizes (partition camping)
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOOOOOOO0OOOOOOOOO
Instructions Speed
ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Performance
500        1000       1500       £000 £500
velikost matice
3000
3500
4000
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOOOOOOOO0OOOOOOOO
Instructions Speed
ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Performance
500        1000       1500       £000       £500       3000       3500 4000
velikost matice
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOO0OOOOOOO ooooooooo ooooooooooooooo
Performance Drops
The performance drops for some size and the behavior is regular
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOO0OOOOOOO ooooooooo ooooooooooooooo
Performance Drops
The performance drops for some size and the behavior is regular o for matrices sized multiple of 512, we only get 19GB/s for other matrices sized multiple of 256, we only get 35GB/s for other matrices sized multiple of 128, we only get 62GB/s
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
ooooooooooo«oooooo ooooooooo ooooooooooooooo
Performance Drops
One memory region has width of 2 tiles (256 B / 4 B per float, 32 floats in a tile). If we analyze tiles placement w.r.t. matrix size, we learn that
• with multiple of 512 size, the tiles in the same column are in the same region
• with multiple of 256 size, each column is at most in two regions
• with multiple of 128, each column is at most in four regions We have discovered partition camping.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooo«ooooo ooooooooo ooooooooooooooo
How to Remove Partition Camping?
We can pad matrices and avoid bad matrix sizes.
• more complicated work with such implementation (all kernels accessing matrix have to implement padding, or we need to convert matrix)
• it occupies more memory
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooo«ooooo ooooooooo ooooooooooooooo
How to Remove Partition Camping?
We can pad matrices and avoid bad matrix sizes.
• more complicated work with such implementation (all kernels accessing matrix have to implement padding, or we need to convert matrix)
• it occupies more memory
We can change the mapping of thread blocks id's on matrix tiles
• diagonal mapping ensures access to different regions
int  blockIdx_y = blockldx.x;
int  blockIdx_x = (blockldx.x+blockldx.y) % gridDim.x;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOO0OOOO ooooooooo ooooooooooooooo
Performance
New implementation gives 80GB/s
• performance doesn't drop where we saw it previously
• slower for matrices of size not divisible by 128
• the algorithm is more complex
• we can use it only for the problematic data sizes
For given problem, there may not be (and often there is not) an ideal algorithm for the whole input data size range. It is necessary to benchmark as not all the problems are easily revealed just by looking at the code.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOO0OOO ooooooooo ooooooooooooooo
Performance
100
diagonálni mapováni optimalizace GMEH + SMEM
n n m
n
n
n
h
n
rati
[\f\
500        1000       1500       £000 £500
velikost matice
3000       3500 4000
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOO0OO ooooooooo ooooooooooooooo
Performance
500        1000       1500       £000       £500       3000       3500 4000
velikost matice
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOO0O ooooooooo ooooooooooooooo
Performance Summary
All optimizations were only toward better use of HW properties 9 however, we got 17.6x speedup
• when creating an algorithm, it is necessary to understand HW limitations
• otherwise we wouldn't have to develop specifically for GPUs -developing a good sequential algorithm would have been just fine.. .
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition ooooooooooooooooo«
Optimizations Effects
Beware of optimization effects
9 if we took 4096 x 4096 matrices instead of 4000 x 4000, the shared memory bank conflict removal would have been just marginal
• after removing partition camping, the effect of memory bank conflicts becomes visible
• thus it makes sense to go from more general/substantial optimizations to the less general ones
o if some (provably correct) optimization does not result in performance increase, we need to analyze, what limits the algorithm performance
Jiří Filipovič        GPU Hardware Performance II
Instructions Speed
ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo «00000000 ooooooooooooooo
Processing of Instructions
Processing of instructions on a multiprocessor (c. c. 1.x)
• there are 8 SP cores and 2 SFU cores
• if the SP and SPU instruction finalization is not overlapped, the multiprocessor can process up to 8 instructions per cycle
o one warp is thus done in 4 or more cycles
• some instructions are significantly slower
• knowledge of instruction processing time helps us to design efficient code
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO O0OOOOOOO ooooooooooooooo
Floating Point Operations
GPU is designed as a graphical HW
• graphical operations mostly use floating point numbers o efficiently implemented in GPUs
9 most GPUs (c. c. > 1.3) can work in double precision while older ones in single precision only, new GPUs (c.c. > 5.3) supports half-precision
• some arithmetic operations are used very frequently in graphics
• GPU implements them in SFUs
• HW implementation provides lower precision (not an issue for lots of applications)
• differentiated using prefix
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OO0OOOOOO ooooooooooooooo
Arithmetic Operations
Floating point operations
o addition and multiplication are very fast
• multiplication and addition may be combined into a single MAD instruction for c. c. 1.x
• lower precision
• 1 cycle speed on SP
• —fadd-rn() and -JmuLrnQ may be used to enforce avoiding MAD instruction during compilation
• MAD is replaced by FMAD for c. c. > 2.0 (the same speed, higher precision)
• 64b versions at lower speed: 1/8 (1.3), 1/2 (2.0), 1/12 (2.1), 1/24 (3.0), 1/3 (3.5), 1/32 (5.x), 1/2 (6.0), 1/32 (6.1, 6.2), 1/2 (7.0), 1/32 (7.5), 1/2 (8.0), 1/32 (8.6), 1/2 (9.0)
• division is relatively slow, reciprocal is faster
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooo«ooooo ooooooooooooooo
Arithmetic Operations
Transcendental functions
• sinf(x), —cosf(x), ^expf(x)
9 sinf(x), cosf(x), expf(x) more precise but an order of magnitude slower
o other operations with different speed and precision trade-offs are implemented, see CUDA manual
Integer operations
o addition and multiplication as for the floating point ops (fast with 24-bit only on c.c. 1.x)
• division and modulo are very slow, but if n is power of 2, we can utilize
• i/n is equivalent to / >> log2(n)
• i%n is equivalent to /&(a? — 1)
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo oooo«oooo ooooooooooooooo
Loops
Small loops have significant overhead
• jumps
• conditions
• control variable updates
• significant part of instructions may be pointer arithmetics
• low I LP
Loop unrolling is an option
o partially may be done by the compiler
• we can do manual unrolling or use #pragma unroll
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOO0OOO ooooooooooooooo
Other Instructions
Other common instructions are performed at the basic speed (i.e., correspond to number of SPs)
9 comparison
• bit operations
9 memory access instructions (given the limitations discussed earlier and memory latency/bandwidth)
• the offset may be register value + constant for 32-bit addressing (higher overhead for 64-bit addressing)
9 synchronization (unless we get blocked)
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOO0OO ooooooooooooooo
Beware of Shared Memory
If memory bank conflict is avoided, the shared memory is as fast as registers at c.c. 1.x But beware
• instructions can work with only one operand in the shared memory
• if more than one operands in shared memory are used for one instruction, explicit load/store is necessary
• MAD instructions run slower (c.c. 1.x)
• a + s[i] 4 cycles per warp 9 a + a * s[i] 5 cycles per warp o a + b * s[i] 6 cycles per warp
• these details are not published by NVIDIA (revealed through measurements)
• interesting only for really performance-critical code
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOO0O ooooooooooooooo
Beware of Shared Memory
Newer GPUs have relatively slower shared memory (comparing to register speed)
• Fermi and newer have lower bandwidth even if only one operand in shared memory is accessed
o Kepler uses only 1/2 of available bandwidth for 32-bit access
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo oooooooo« ooooooooooooooo
C for CUDA Compilation
Device code can be compiled into PTX assembler and binary files
• PTX is intermediate code, does not correspond directly to GPU instructions
• easier to read
• harder to figure out what really happens on GPU Binary files may be disassembled using cuobjdump tool
• for GT200 and newer
• decuda for older GPUs (may not work completely reliably)
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO «00000000000000
Naive Implementation
global__  void mmul(float  *A,   float  *B,   float  *C,   int n){ int  x = blockldx.x*blockDim.x + threadldx.x; int  y = blockldx.y*blockDim.y + threadldx.y;
float  tmp = 0; for   (int  k = 0;   k < n; k++) tmp += A[y*n+k]   * B[k*n+x];
C[y*n + x]  = tmp;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO O0OOOOOOOOOOOOO
Recapitulation
Naive implementation
• each thread computes one element of the resulting matrix 9 memory-bound
o theoretical peak 66.8GFIops
• performance depends on threads arrangement - blocks 128 x 1: 36.6GFIops, blocks 1 x 128: 3.9GFIops
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO O0OOOOOOOOOOOOO
Recapitulation
Naive implementation
• each thread computes one element of the resulting matrix 9 memory-bound
9 theoretical peak 66.8GFIops
• performance depends on threads arrangement - blocks 128 x 1: 36.6GFIops, blocks 1 x 128: 3.9GFIops
Now, we understand the results
9 theoretical maximum cannot be reached - we access GPU memory in at least 32-byte chunks, so reading from A is not efficient
9 blocks 128 x 1 result in coalesced access into B, blocks 1 x 128 result in strided access
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OO0OOOOOOOOOOOO
Recapitulation
We have implemented a tiled algorithm
o each thread block read tiles from A, B into shared memory, exploit data locality (data are moved into shared memory once and read many times)
• theoretical peak 568GFIops, we have reached 198GFIops
We can try to improve the implementation.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOO0OOOOOOOOOOO
Tiled Algorithm
global__  void mmul(float  *A,   float  *B,
int  bx = blockldx.x;
int  by = blockldx.y;
int  tx = threadldx.x;
int  ty = threadldx.y;
__shared__   float  As[BLOCK][BLOCK ] ;
__shared__   float  Bs[BLOCK][BLOCK ] ;
float  *C,   int n){
float  Csub = O.Of;
for   (int  b = 0;   b < n/BLOCK; b++){
As[ty][tx] = A[(ty + by*BL0CK)*n + b*BL0CK+tx]; Bs[ty][tx] = B[(ty + b*BL0CK)*n + bx*BLOCK+tx ] ; __syncthreads () ;
}
for   (int  k = 0;   k < BLOCK; k++)
Csub += As[ty][k]* Bs[k][tx]; __syncthreads () ;
}
C[(ty + by*BL0CK)*n + bx*BL0CK+tx]  = Csub;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOO0OOOOOOOOOO
Implementation Pitfalls
As[ty][tx] = A[(ty + by*BL0CK)*n + b*BL0CK+tx]; Bs[ty][tx]  = B[(ty + b*BL0CK)*n + bx*BLOCK+tx ] ;
C[(ty + by*BL0CK)*n + bx*BLOCK+tx]  = Csub ;
Global memory access is OK.
Csub += As [ ty ] [ k] * Bs [k ] [ tx ] ;
Also shared memory access is OK.
• if a thread block x-size is multiple of warp size, variable As is broadcasted
9 array Bs is read in contiguous lines, which is conflict-free
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo ooooo«ooooooooo
Theoretical Peak
Can we be more precise in theoretical peak computation?
• we have used a theoretical peak of GPU in MAD instructions (622GFIops)
o now, we know that MAD instructions with operand in shared memory are 50% slower
• the more precise theoretical bound is 415GFIops o our implementation is still far from that
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo oooooo«oooooooo
Performance Pitfalls
What causes performance degradation?
9 overhead of kernel execution, thread creation
o mainly for fast kernels, or kernels with a few instructions per thread
• threads can do more work in serial
• instruction overhead
• pointer arithmetics, loops
• can be reduced
• synchronization
• may or may not be an issue
• load/store in computation
• two operands in SMEM per one MAD instruction
If we count the performance bound for one load per MAD with operand in SMEM, we get 244GFIops.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO
Searching for Better Implementation
Can be a number of load instructions decreased?
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO
Searching for Better Implementation
Can be a number of load instructions decreased?
• exploiting data locality in shared memory decreases global memory pressure
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO
Searching for Better Implementation
Can be a number of load instructions decreased?
• exploiting data locality in shared memory decreases global memory pressure
• exploiting data locality in registers decreases shared memory pressure
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO
Searching for Better Implementation
Can be a number of load instructions decreased?
• exploiting data locality in shared memory decreases global memory pressure
• exploiting data locality in registers decreases shared memory pressure
• how to do it? we reduce number of threads and assign more work to them
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOO0OOOOOOO
Searching for Better Implementation
Can be a number of load instructions decreased?
• exploiting data locality in shared memory decreases global memory pressure
• exploiting data locality in registers decreases shared memory pressure
• how to do it? we reduce number of threads and assign more work to them
Thread block of size m x n will process tile of size m x m, where m = n • k; k 6 N.
9 large m potentially increases synchronization overhead
9 small m reduces shared memory locality
9 small n reduces available parallelism
• we will find value for m and n experimentally
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOO0OOOOOO
Searching for Better Implementation
Best results found for m = 32, n = 16 (32 x 16 blocks working with 32 x 32 tiles).
• one load for two MAD instructions results in theoretical bound 311 GFIops
• we have 235.4 GFIops
• something is wrong
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOO0OOOOO
Code disassembly
We focus on the inner loop
Csubl += As [ty ] [k]*Bs [k] [ tx ] ; Csub2 += As [ty + 16][k]*Bs [k] [tx] ;
mov.b32  $r0,   s[$ofs4+0x0000] add.b32  $ofs4 ,   $ofs2 , 0x00000180 mad.rn.f32  $r7,   s[$ofs1+0x0008],   $r0, $r7 mad.rn.f32  $r8,   s[$ofs3+0x0008],   $r0, $r8
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOO0OOOOO
Code disassembly
We focus on the inner loop
Csubl += As [ty ] [k]*Bs [k] [ tx ] ; Csub2 += As [ty + 16][k]*Bs [k] [tx] ;
mov.b32  $r0,   s[$ofs4+0x0000] add.b32  $ofs4 ,   $ofs2 , 0x00000180 mad.rn.f32  $r7,   s[$ofs1+0x0008],   $r0, $r7 mad.rn.f32  $r8,   s[$ofs3+0x0008],   $r0, $r8
Compiler was able to use constant offsets only for As
9 strided access into Bs generates one load and one integer add
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
After disassembling, we se there is no ADD instruction
mov.b32  $r0,   s[$ofs4+0x0008]
mad.rn.f32  $r6 ,   s[$ofs3+0x0034] ,   $r0 , $r6
mad.rn.f32  $r8,   s[$ofs1+0x0008],   $r0, $r8
□      g       - =
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
After disassembling, we se there is no ADD instruction
mov.b32  $r0,   s[$ofs4+0x0008]
mad.rn.f32  $r6 ,   s[$ofs3+0x0034] ,   $r0 , $r6
mad.rn.f32  $r8,   s[$ofs1+0x0008],   $r0, $r8
New issue - memory bank conflicts
□    s>     - =
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
After disassembling, we se there is no ADD instruction
mov.b32  $r0,   s[$ofs4+0x0008]
mad.rn.f32  $r6 ,   s[$ofs3+0x0034] ,   $r0 , $r6
mad.rn.f32  $r8,   s[$ofs1+0x0008],   $r0, $r8
New issue - memory bank conflicts • solved by padding
□    s>     - =
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOO0OOOO
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
After disassembling, we se there is no ADD instruction
mov.b32  $r0,   s[$ofs4+0x0008]
mad.rn.f32  $r6 ,   s[$ofs3+0x0034] ,   $r0 , $r6
mad.rn.f32  $r8,   s[$ofs1+0x0008],   $r0, $r8
New issue - memory bank conflicts
• solved by padding Resulting speed: 276.2 GFIops.
□    s>     - =
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOOO0OOO
Can we Reach Better Performance?
Our results are petty close to theoretical bound for one load per two MADs.
o to get better performance, tiled algorithm has to be revised
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOOO0OOO
Can we Reach Better Performance?
Our results are petty close to theoretical bound for one load per two MADs.
o to get better performance, tiled algorithm has to be revised
The main issue is that we multiply two tiles in shared memory
• need of usage load instructions together with MAD instructions
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
OOOOOOOOOOOOOOOOOO OOOOOOOOO OOOOOOOOOOO0OOO
Can we Reach Better Performance?
Our results are petty close to theoretical bound for one load per two MADs.
o to get better performance, tiled algorithm has to be revised
The main issue is that we multiply two tiles in shared memory
• need of usage load instructions together with MAD instructions
Can we have only one block in shared memory?
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo oooooooooooo«oo
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo oooooooooooo«oo
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo oooooooooooo«oo
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
o rows in B can be read one after another, so we can use register to do so
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo oooooooooooo«oo
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
o rows in B can be read one after another, so we can use register to do so
• tile in C can be stored in registers
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo oooooooooooo«oo
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
o rows in B can be read one after another, so we can use register to do so
• tile in C can be stored in registers
• we work in only one operand in shared memory, so explicit loads are not needed
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo oooooooooooo«oo
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
o rows in B can be read one after another, so we can use register to do so
• tile in C can be stored in registers
• we work in only one operand in shared memory, so explicit loads are not needed
9 theoretical bound is now done by speed of MAD instruction with operand in shared memory: 415GFIops
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo ooooooooooooo«o
New Tiled Algorithm
The best-performing configuration:
• matrix A read by 16 x 16 tiles, stored in shared memory
• matrix B read by 64 x 1 tiles, stored in registers
• tiles of matrix C have 64 x 16 size, they are stored in registers
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition Instructions Speed Revision of Matrix Multiplication
oooooooooooooooooo ooooooooo ooooooooooooo«o
New Tiled Algorithm
The best-performing configuration:
• matrix A read by 16 x 16 tiles, stored in shared memory
• matrix B read by 64 x 1 tiles, stored in registers
• tiles of matrix C have 64 x 16 size, they are stored in registers The measured performance is 375GFIops.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed
ooooooooo
Revision of Matrix Multiplication oooooooooooooo*
Summary
Implementation	performance	rel. A	abs. A
Naive, blocks 1 x 128	3.9GFIops		
Naive	36.6 GFIops	9.4x	9.4x
Tiled algorithm	198GFIops	5.4x	51x
Thread blocks 32 x 16, tiles 32 x 32	235 GFIops	1.19x	60 x
Removing ADD instruction	276 GFIops	1.17x	71x
One block in shared memory	375 GFIops	1.36x	96 x
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed
ooooooooo
Revision of Matrix Multiplication oooooooooooooo*
Summary
Implementation	performance	rel. A	abs. A
Naive, blocks 1 x 128	3.9GFIops		
Naive	36.6 GFIops	9.4x	9.4x
Tiled algorithm	198GFIops	5.4x	51x
Thread blocks 32 x 16, tiles 32 x 32	235 GFIops	1.19x	60 x
Removing ADD instruction	276 GFIops	1.17x	71x
One block in shared memory	375 GFIops	1.36x	96 x
• The most relevant is exploiting memory locality and basic memory access optimization.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed
ooooooooo
Revision of Matrix Multiplication oooooooooooooo*
Summary
Implementation	performance	rel. A	abs. A
Naive, blocks 1 x 128	3.9GFIops		
Naive	36.6 GFIops	9.4x	9.4x
Tiled algorithm	198GFIops	5.4x	51x
Thread blocks 32 x 16, tiles 32 x 32	235 GFIops	1.19x	60 x
Removing ADD instruction	276 GFIops	1.17x	71x
One block in shared memory	375 GFIops	1.36x	96 x
• The most relevant is exploiting memory locality and basic memory access optimization.
o Finer optimizations are relatively challenging, important for really performance critical codes.
Jiří Filipovič        GPU Hardware Performance II