Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed oooooooo GPU Hardware Performance Jiří Filipovič Fall 2010 Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed •oooooooooooooooooo ooooooooooooooooo oooooooo Global Memory Access Optimization Performance of global memory becomes a bottleneck easily • global memory bandwdith is low relatively to arithmetic performance of GPU (G200 > 24 FLOPS/float, G100 > 30) • 400-600 cycles latency The throughput can be significantly worse with bad parallel access pattern • the memory has to be accessed continuously (coalescing) • use of just certain subset of memory regions should be avoided (partition camping) □ g - = = ^q^o Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization o»ooooooooooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed oooooooo Continous Memory Access (C. C. < 2.0) GPU memory needs to be accessed in larger blocks for efficiency • global memory is split into 64 B segments a two of these segments are aggregated into 128 B segments } 64B aligned segment ^ 128B aligned segment ................. Half warp of threads Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization oo»oooooooooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed oooooooo Continous Memory Access (C. C. < 2.0) A half of a warp can transfer data using single transaction or one to two transactions when transactions when transferring a 128 B word • it is necessary to use large words • one memory transaction can transfer 32 B, 64 B, or 128 B words • GPUs with c. c. < 1.2 • the accessed block has to begin at an address dividable by 16x data size » k-th thread has to access k-th block element • some threads needn't participate • if these rules are not obeyed, each element is retrieved using a separate memory transaction □ 1.2 are less restrictive • each transfer is split into 32 B, 64 B, or 128 B transactions in a way to serve all requests with the least number of transactions • order of threads can be arbitrarily permuted w.r.t. transferred elements □ g - = = ^q^o Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization oooo»oooooooooooooo Matrix Transposition ooooooooooooooooo Continous Memory Access (C. C. < 2.0) Instruction Speed oooooooo Threads are aligned, element block is continous, order is not permuted - continuous access on all GPUs k ál 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization ooooo»ooooooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed oooooooo Unaligned Memory Access (C. C. < 2.0) Threads are not aligned, element block is continuous, order is not permuted - one transaction on GPUs with c. c. > 1.2 iyi'i'i'i'i'i'i'i'i'i*ŤWn Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization oooooo»oooooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed □ooooooo Unaligned Memory Access (C. C. < 2.0) Similar case may result in a need for two transactions + + + + + + /■/■/. /i/i/i /■ /■ /■ /iWiW i □ fit - t g -rjc^o Jiří Filipovie GPU Hardware Performance Global Memory Access Optimization ooooooo»ooooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed oooooooo Unaligned Memory Access Performance (C. C. < 2.0) Older GPUs perform smallest possible transfer (32 B) for each element, thus reducing performance to 1/8 Newer GPUs perform (c. c. > 1.2) two transfers 140 oH— D 2 4 B B 10 12 14 16 Offset □ g - = = ^q^o Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization oooooooo»oooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed oooooooo Interleaved Memory Access Performance (C. C. < 2.0) The bigger the spaces between elements, the bigger performance drop on GPUs with c. c. > 1.2 - the effect is rather dramatic '20 0 2 A B B 10 12 14 16 18 □ g - = = ^q^o Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization ooooooooo«ooooooooo Matrix Transposition ooooooooooooooooo Global Memory Access with Fermi (C. C. > 2.0) Instruction Speed oooooooo Fermi has LI and L2 cache • LI: 256 B per row, 16 kB or 48 kB per multiprocesor in total • L2: 32 B per row, 768 kB on GPU in total What are the advantages? • more efficient programs with unpredictable data locality • unaligned access - no slowdown in principle • interleaved access - data needs to be used before it is flushed from the cache, otherwise the same or bigger problem as with c. c. < 2.0 (LI cache may be turned of to avoid overfetching) □ g - = = ^q^o Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization oooooooooo»oooooooo Matrix Transposition ooooooooooooooooo Instruction Speed oooooooo Partition camping • relevant for c. c. 1.x • processors based on G80 have 6 regions, G200 have 8 regions of global memory • the memory is split into 256 B regions • even access among the regions is needed for maximum performance • among individual blocks • block are usually run in order given by their position in the grid • if only part of regions is used, the resulting condition is called partition camping • generally not as critical as the continuous access • more tricky, problem size dependent, disguised from fine-grained perspective □ 1.3) can work in double precision while older ones in single precision only • some arithmetic operations are used very frequently in graphics a GPU implements them in HW • HW implementation provides lower precision (not in issue for lots of applications) • differentiated using prefix □ tgp - = = ^q^o Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo ooooooooooooooooo oo»ooooo Aritmetic Operations Floating point operations (throughput on an MP) • addition, multiplication 8 (1.x), 32 (2.0), 48 (2.1) • multiplication and addition may be combined into a single MAD instruction for c. c. 1.x • lower precision • 1 cycle speed on SP • —fadd-rn() and —fmuLrnQ may be used to enforce avoiding MAD instruction during compilation • MAD is replaced by FMAD for c. c. 2.x (same speed, higher precision) • 64b versions 1/8 (1.3), 1/2 (2.0), 1/12 (2.1) • inverse value 2 (1.x), 4 (2.0) a 8 (2.1) • division is relatively slower (by 1.23 on average for c. c. 1.x) • faster variant —fdividef(x, y) 1.6 (c. c. 1.x) • inverted square root 2 (1.x), 4 (2.0) a 8 (2.1) • type conversion 8 (c.c. 1.x), 16 (c.c. 2.x) Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed ooo»oooo Aritmetic Operations Floating point operations • __sinf(x), __cosf(x), -expf(x) 2 (c.c. 1.x), 4 (c.c. 2.0), 8 (c.c. 1.2) • sinf(x), cosf(x), expf(x) more precise but an order of magnitude slower • other operations with different speed and precision trade-offs are implemented, see CUDA manual Integer operations • addition as for the floating point ops • multiplication on c. c. 1.x 2 instructions on an MP » __mu/24(x, y) a __umul24(x, y) 8 instructions • multiplication on c. c. 2.x is as fast as floating point ops, 24-bit version is slow • division and modulo is very slow, but if n is power of 2, we can utilize • i/n is equivalent to ; >> log2{n) Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Loops Matrix Transposition ooooooooooooooooo Instruction Speed oooo»ooo Small loops have significant overhead • jumps need to be implemented • it is necessary to update control variable • significant part of instructions may be pointer arithmetics Loop unrolling is an option • partially may be done by the compiler • we can do manual unrolling or use #pragma unroll □ tgp - = = ^q^o Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Other Instructions Matrix Transposition ooooooooooooooooo Instruction Speed ooooo«oo Other common instructions are done at the basic speed (i.e., correspond to number of SPs) • comparison • bit operations • memory access instructions (given the limitations discussed earlier and memory latency/bandwidth) • the offset may be register value + constant • synchronization (unless we get blocked) □ g - = = ^q^o Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo ooooooooooooooooo oooooo»o Beware of Shared Memory If memory bank conflict is avoided, the shared memory is as fast as registers But beware • instructions can work with only one operand in the shared memory • if more than one operands in shared memory are used for one instruction, explicit load/store is necessary • MAD instructions run slower (c.c. 1.x) • a + s[i] 4 cycles per warp • a + a * s[i] 5 cycles per warp • a + b * s[i] cycles per warp » these details are not published by nVidia (revealed through measurements) • may change with future GPU generations, interested only for really critical code Jiří Filipouič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo ooooooooooooooooo ooooooo« C for CUDA Compilation Device code can be compiled into PTX assembler and binary files • PTX is intermediate code, does not correspond directly to GPU instructions • easier to read • harder to figure out what really happens on GPU • native GPU code compiler is to be released Binary files may be disassembled using decuda tool • third party product • may not work completely reliably • still quite useful □ tgp - = = ^q^o Jiří Filipouič GPU Hardware Performance