Synchronization	Memory Access Optimization
oooooo	ooooooooooooooooooooo
GPU Hardware Performance
in Fihpovic Fall 2024
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
•ooooo ooooooooooooooooooooo
Atomic operations
• performs read-modify-write operations on shared or global memory
9 no interference with other threads
9 for 32-bit and 64-bit integers (c. c. > 1.2), float addition (c. c. > 2.0), double addition (c.c. > 6.0)
o using global memory for c. c. > 1.1 and shared memory for c. c. > 1.2
• arithmetic (Add, Sub, Exch, Min, Max, Inc, Dec, CAS) a bitwise (And, Or, Xor) operations
Jiří Filipovič        GPU Hardware Performance
Synchronization O0OOOO
Memory Access Optimization
ooooooooooooooooooooo
Warp Voting
All threads in one warp evaluate the same condition and perform its comparison. Available in c. c. > 1.2.
int   __all(int predicate);
Result is non-zero iff the predicate is non-zero for all the threads in the warp.
int   __any(int predicate);
Result is non-zero iff the predicate is non-zero for at least one thread in the warp.
unsigned  int   __ballot(int predicate);
Contains voting bit mask of individual threads.
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
OO0OOO ooooooooooooooooooooo
Shuffle Functions
Threads within a warp can efficiently communicate using warp shuffle functions (from c.c. > 3.0).
float   __shfl_sync(float  var ,   int   srcLane ,   int  width=warpSize ) ;
Copy value from srcLane.
float   __shfl_up_sync(float  var,   unsigned  int delta, int  width=warpSize ) ;
Copy value from threads with lower ID relative to caller. Analogically __shf l_down.
float   __shfl_xor_sync(float  var,   int   laneMask, int  width=warpSize ) ;
Copy from a thread based on bitwise XOR of own ID and laneMask.
Parameter width defines the number of participating threads. It must be power of two, indexing starts at 0.
J in Fihpovic
GPU Hardware Performance
Synchronization Memory Access Optimization
ooo«oo ooooooooooooooooooooo
Synchronization of Memory Operations
Compiler can optimize operations on shared/global memory (intermediate results may be kept in registers) and can reorder them
9 if we need to ensure that the data are visible for others, we use __threadf enceO or __threadf ence_block()
• if a variable is declared as volatile, all load/store operations are implemented in shared/global memory
• very important if we assume implicit warp synchronization (c.c. 6.0 or lower)
Jiří Filipovič        GPU Hardware Performance
Synchronization oooo«o
Memory Access Optimization
ooooooooooooooooooooo
Global Synchronization using Atomic Operations
Alternative implementation of a vector reduction
o each thread block sums elements in its part of a vector
• barrier (weak global barrier)
• one thread block sums results of all the blocks
Jiří Filipovič        GPU Hardware Performance
Synchronization 00000«
Memory Access Optimization
ooooooooooooooooooooo
__device__  unsigned  int   count = 0; __shared__  bool   isLastBlockDone;
__global__  void  sum(const   float*  array,   unsigned  int N, float*  result) {
float  partialSum = calculatePartialSum(array,   N); if   (threadldx.x = 0) {
result[blockldx.x]  = partialSum;
__threadfence();
unsigned  int  value = atomicInc(&count , gridDim.x); isLastBlockDone = (value = (gridDim.x — 1));
}
__syncthreads () ;
if   (isLastBlockDone) {
float  totalSum = calculateTotalSum(result ) ; if   (threadldx.x = 0) { result[0]  = totalSum; count = 0;
}
}
}
Jiří Filipovič        GPU Hardware Performance
Synchronization
oooooo
Memory Access Optimization •oooooooooooooooooooo
Global Memory Access Optimization
Performance of global memory becomes a bottleneck easily
• global memory bandwdith is low relatively to arithmetic performance of GPU (GT200 > 24 FLOPS/float, GF100 > 30, GK110 > 62, GM200 > 73, GP100 > 53, GV100 > 67, TU102 > 76, GA100 > 50, AD102 > 290, GH100 > 80)
• 400-600 cycles latency
The throughput can be significantly worse with bad parallel access pattern
• the memory has to be accessed coalesced
• use of just certain subset of memory regions should be avoided (partition camping)
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO O0OOOOOOOOOOOOOOOOOOO
Coalesced Memory Access (C. C. < 2.0)
GPU memory needs to be accessed in larger blocks for efficiency 9 global memory is split into 64 B segments • two of these segments are aggregated into 128 B segments
} 64B aligned segment
}
128B aligned segment
n
Half warp of threads
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO OO0OOOOOOOOOOOOOOOOOO
Coalesced Memory Access (C. C. < 2.0)
A half of a warp can transfer data using single transaction or one to two transactions when transferring a 128 B word
• it is necessary to use large words
• one memory transaction can transfer 32 B, 64 B, or 128 B words
• GPUs with c. c. < 1.2
• the accessed block has to begin at an address divisible by 16x data size
• k-th thread has to access k-th block element
• some threads may not participate
• if these rules are not obeyed, each element is retrieved using a separate memory transaction
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO OOO0OOOOOOOOOOOOOOOOO
Coalesced Memory Access (C. C. < 2.0)
GPUs with c. c. > 1.2 are less restrictive
• each transfer is split into 32 B, 64 B, or 128 B transactions in a way to serve all requests with the least number of transactions
9 order of threads can be arbitrarily permuted w.r.t. transferred elements
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO OOOO0OOOOOOOOOOOOOOOO
Coalesced Memory Access (C C < 2.0)
Threads are aligned, element block is contiguous, order is not permuted - coalesced access on all GPUs
															
															
															
															
															
1		1		1			1	1	1	11					
Jiří Filipovič        GPU Hardware Performance
2.0)
mory
Tbreads ar     _ one van ^^PgS»
Synchronization Memory Access Optimization
oooooo oooooo«oooooooooooooo
Unaligned Memory Access (C. C. < 2.0)
Similar case may result in a need for two transactions
		J
\±f + tt tut tff *f-		
		n
\TtTTftn    Í i		/
i i 1 i 1 i TTfffíJ] 1		
1 ;j;	/  /  /1 /  i  11 / /  /  / r	
	□	
Jiří Filipovič        GPU Hardware Performance
Synchronization
oooooo
Memory Access Optimization OOOOOOO0OOOOOOOOOOOOO
Unaligned Memory Access Performance (C. C. < 2.0)
Older GPUs perform smallest possible transfer (32 B) for each element, thus reducing performance to 1/8 Newer GPUs perform (c. c. > 1.2) two transfers
D    2   4    6    8   10 12 14 16 Offset
Jiří Filipovič        GPU Hardware Performance
Synchronization
oooooo
Memory Access Optimization OOOOOOOO0OOOOOOOOOOOO
Interleaved Memory Access Performance (C. C. < 2.0)
The bigger the spaces between elements, the bigger performance drop on GPUs with c. c. > 1.2 - the effect is rather dramatic
120 -i-£ 100 \
i i i i i i i i i D   2   4   6   O  10 12 14 16 18
Stride
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO OOOOOOOOO0OOOOOOOOOOO
Global Memory Access with Fermi (C. C. = 2.x)
Fermi has LI and L2 cache
• LI: 256 B per row, 16 kB or 48 kB per multiprocesor in total
• L2: 32 B per row, 768 kB on GPU in total What are the advantages?
• more efficient programs with unpredictable data locality
• more efficient when shared memory is not used from some reason
o unaligned access - no slowdown in principle
• interleaved access - data needs to be used before it is flushed from the cache, otherwise the same or bigger problem as with c. c. < 2.0 (LI cache may be turned of to avoid overfetching)
J in Fihpovic
GPU Hardware Performance
Synchronization
oooooo
Memory Access Optimization OOOOOOOOOO0OOOOOOOOOO
Global Memory Access with "gaming"Kepler (C. C. = 3.0)
There is only L2 cache for read/write global memory access
• L2: 32 B per row, up to 1.5 MB per GPU
• LI: for local memory, 16 KB, 32 KB or 48 KB in total
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO OOOOOOOOOOO0OOOOOOOOO
Global Memory Access with fully-featured Kepler and newer (C. C. > 3.5)
Read-only data cache
• shared with textures
o compiler tries to use, we can help with __restrict__ and -ldg()
o slower than Fermi's LI
Maxwell and Pascal does not have LI cache for local memory
• inefficient for programs heavily using local memory
□    s>     - =
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
oooooo oooooooooooo«oooooooo
Partition camping
• relevant for c. c. 1.x (and AMD GPUs)
• processors based on G80 have 6 regions, G200 have 8 regions of global memory
• the memory is split into regions in 256 B chunks
• even access among the regions is needed for maximum performance
• among individual blocks
• block are usually run in order given by their position in the grid
• if only part of regions is used, the resulting condition is called partition camping
9 generally not as critical as the coalesced access
• more tricky, problem size dependent, not visible from fine-grained perspective
J in Fihpovic
GPU Hardware Performance
Synchronization Memory Access Optimization
oooooo ooooooooooooo«ooooooo
HW Organization of Shared Memory
Shared memory is organized into memory banks, which can be accessed in parallel
o c. c. 1.x 16 banks, c. c. > 2.0 32 banks
• memory space mapped in an interleaved way with 32 b shift or 64 b shift (c.c. 3.x)
• to use full memory performance, we have to access data in different banks
• broadcast implemented - if all threads access the same data
Jiří Filipovič        GPU Hardware Performance
Synchronization
oooooo
Bank Conflict
Memory Access Optimization OOOOOOOOOOOOOO0OOOOOI
Bank conflict
• occurs when some threads in warp/half-warp access data in the same memory bank with several exceptions
• threads access exactly the same data
• threads access different half-words of 64 b word (c.c. 3.x)
• when occurs, memory access gets serialized
• performance drop is proportional to number of parallel operations that the memory has to perform to serve a request
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO OOOOOOOOOOOOOOO0OOOOO
Access without Conflicts
J in Fihpovic
GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO OOOOOOOOOOOOOOOO0OOOO
n-Way Conflicts
J in Fihpovic
GPU Hardware Performance
Synchronization Memory Access Optimization
OOOOOO OOOOOOOOOOOOOOOOO0OOO
Broadcast
J in Fihpovic
GPU Hardware Performance
Synchronization Memory Access Optimization
oooooo oooooooooooooooooo«oo
Access Patterns
Alignment is not needed, bank conflicts not generated
int  x = s[threadldx.x + offset];
Interleaving does not create conflicts if c is odd, for c.c. > 3.0 no conflict if c = 2 and 32 b numbers are accessed
int  x = s[threadldx.x  * c];
Access to the same variable never generates conflicts on c. c. 2.x, while on 1.x only if thread count accessing the variable is multiple of 16
int  x = s[threadldx.x  / c];
Jiří Filipovič        GPU Hardware Performance
Synchronization Memory Access Optimization
oooooo ooooooooooooooooooo«o
Other Memory Types
Transfers between host and GPU memory
• need to be minimized (often at cost of decreasing efficiency of computation on GPU)
• may be accelerated using page-locked memory
• it is more efficient to transfer large blocks at once
• computations and memory transfers should be overlapped Texture memory
• designed to reduce number of transfers from the global memory
• does not help if latency is the bottleneck
• may simplify addressing or add filtering
J in Fihpovic
GPU Hardware Performance
Synchronization Memory Access Optimization
oooooo oooooooooooooooooooo»
Other Memory Types
Constant memory
• as fast as registers if the same value is read by all threads within a warp
9 performance decreases linearly with number of different values read
Registers
• read-after-write latency, hidden if at least 192 threads are running for c. c. 1.x or at least 768 threads are running for c. c. 2.x (approximation)
• possible bank conflicts even in registers
• compiler tries to avoid them
• we can make life easier for the compiler if we set block size to multiple of 64
J in Fihpovic
GPU Hardware Performance