GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO OOOOOOOOOO OOOOOOOO OOOOOOOOOOO GPU Hardware and Parallelism Jiří Fi li povič Fall 2013 Jiří Filipovič GPU Hardware and Parallelism GPU hardware •oooo Parallelism OOOOOO Memory Hierarchy OOOOOOOOOO Synchronization OOOOOOOO Matrix Multiplication OOOOOOOOOOO Alternatives to CUDA CUDA is (and probably will be) only for nVidia GPU. Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication •oooo oooooo oooooooooo oooooooo ooooooooooo Alternatives to CUDA CUDA is (and probably will be) only for nVidia GPU. OpenCL • a standard for various types of accelerators (independent on HW vendor and OS) • strongly inspired by CUDA, very easy transition Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication •oooo oooooo oooooooooo oooooooo ooooooooooo Alternatives to CUDA CUDA is (and probably will be) only for nVidia GPU. OpenCL • a standard for various types of accelerators (independent on HW vendor and OS) • strongly inspired by CUDA, very easy transition DirectX compute • Various GPU vendors, one OS Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication •oooo oooooo oooooooooo oooooooo ooooooooooo Alternatives to CUDA CUDA is (and probably will be) only for nVidia GPU. OpenCL • a standard for various types of accelerators (independent on HW vendor and OS) • strongly inspired by CUDA, very easy transition DirectX compute • Various GPU vendors, one OS Brook(+) • multi-platform, only for AMD/ATI • only for streams Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication 0«000 OOOOOO OOOOOOOOOO OOOOOOOO OOOOOOOOOOO Why to learn CUDA? Why CUDA and not OpenCL? • published results still show higher speed • better stability of the environment • biggest number of applications • biggest number of libraries • biggest number of publications • easier to learn • similarity to OpenCL allows easy transition • PGI x86 CUDA compiler Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication oo«oo oooooo oooooooooo oooooooo ooooooooooo Differences among CUDA GPUs New generations bring higher performance and new computing capabilities. • compute capability describes richness of GPU instruction set and amount of resources such as registers, number of concurrently running threads, etc. • the performance grows with the ability to put more than one core on a GPU Cards in on generation also differ in performance substantially • to produce more affordable cards • due to changes introduces later in the manufacturing process • to minimize power consumption of mobile GPUs Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooo»o oooooo oooooooooo oooooooo ooooooooooo GPUs Available Today Currently available GPUs • compute capability 1.0 - 2.1 • we will learn the differences later • 1-30 multiprocessors (19.2 - 1 345.0 GFLOPs) • frequency of 800 MHz-1.836 GHz • width and speed of data bus sběrnice (64-512 bit, 6.4-177 GB/s) Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOO* OOOOOO OOOOOOOOOO OOOOOOOO OOOOOOOOOOO Available products GeForce graphics cards » mainstream solution for gaming • cheap, wildely used, broad range of performance • disadvantage - limited memory (up to 1.5 GB on GPU) Professional Quadro graphics cards • the same as GeForce from CUDA perspective • up to 4 GB of memory on GPU • several times more expensive Tesla • a solution specially designed for CUDA computing • one GPU per generation (basic variant), always large memory • available as a PCIe card or standalone multi-GPU machines • expensive, interesting for computing centers and personal supercomputers m -00.0 Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO •OOOOO oooooooooo OOOOOOOO OOOOOOOOOOO GPU Paral lelism Parallel algorithms need to be designed w.r.t. the parallelism available in the HW • GPU: array of SIMT multiprocessors working using shared memory Decomposition for GPU • coarse-grained decomposition of the problem into the parts that don't need intensive communication • fine-grained decomposition similar to vectorization (but SIMT is more flexible) Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo o»oooo oooooooooo oooooooo ooooooooooo Task Hierarchy Grid Block (0,0) Block (1,0) Block (2,0) Block (1, 1) Thread (0, 0) Thread (1, 0) i Thread (2, 0) Thread (3, 0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (0, 2) Thread (1, 2) i Thread (2, 2) Thread (3, 2) Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO oo*ooo OOOOOOOOOO OOOOOOOO OOOOOOOOOOO SIMT A multiprocessor has one unit executing an instruction • all 8 SPs have to execute the same instruction • new instruction is executed every 4 cycles • 32 threads (so called warp) need to execute the same instruction Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO oo*ooo OOOOOOOOOO OOOOOOOO OOOOOOOOOOO SIMT A multiprocessor has one unit executing an instruction • all 8 SPs have to execute the same instruction • new instruction is executed every 4 cycles • 32 threads (so called warp) need to execute the same instruction How about code branching? • if different parts of a warp perform different instructions, they are serialized • decreases performance—should be avoided Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO oo*ooo OOOOOOOOOO OOOOOOOO OOOOOOOOOOO SIMT A multiprocessor has one unit executing an instruction • all 8 SPs have to execute the same instruction • new instruction is executed every 4 cycles • 32 threads (so called warp) need to execute the same instruction How about code branching? • if different parts of a warp perform different instructions, they are serialized • decreases performance—should be avoided The multiprocessor is thus MIMD (Multiple-Instruction Multiple-Thread) from programmer's perspective and SIMT (Single-Instruction Multiple-Thread) from performance perspective. Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo ooo»oo oooooooooo oooooooo ooooooooooo GPU Architecture Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOO^O OOOOOOOOOO OOOOOOOO OOOOOOOOOOO Thread Properties GPU threads are very lightweight compared to CPU threads. • their run time can be very shorts (even tens of instructions) • there may be (should be) many of them • they should not use large amount of resources Threads are aggregated into blocks • blocks are run on individual multiprocessors • having sufficient number of blocks is substantial to achieve good scalability Number of threads and thread blocks per multiprocesor is limited. Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOO* oooooooooo OOOOOOOO OOOOOOOOOOO Memory Latency Masking Memory has latency • global memory has high latency (hundreds of cycles) • registers and shared memory have read-after-write latency Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOO* oooooooooo OOOOOOOO OOOOOOOOOOO Memory Latency Masking Memory has latency • global memory has high latency (hundreds of cycles) • registers and shared memory have read-after-write latency Memory latency hiding is different from CPU • no instructions are executed out of order • most memory types have no cache Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo ooooo* oooooooooo oooooooo ooooooooooo Memory Latency Masking Memory has latency • global memory has high latency (hundreds of cycles) • registers and shared memory have read-after-write latency Memory latency hiding is different from CPU • no instructions are executed out of order • most memory types have no cache When a warp waits for data from memory, another warp may be executed • allows memory latency hiding • requires execution of an order of magnitude more threads compared to number of GPU cores • thread execution scheduling and switching is implemented directly in HW without overhead Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOO* oooooooooo OOOOOOOO OOOOOOOOOOO Memory Latency Masking Memory has latency • global memory has high latency (hundreds of cycles) • registers and shared memory have read-after-write latency Memory latency hiding is different from CPU • no instructions are executed out of order • most memory types have no cache When a warp waits for data from memory, another warp may be executed • allows memory latency hiding • requires execution of an order of magnitude more threads compared to number of GPU cores • thread execution scheduling and switching is implemented directly in HW without overhead Works similarly for synchronization. Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO »000000000 oooooooo ooooooooooo Thread-Local Memory Registers • fastest memory, directly usable in instructions • local variables in a kernel and variables for intermediate results go automatically into the registers • if there is sufficient number of registers • if the compiler can determine static array indexing • thread (warp) scoped Jiří Filipovič GPU Hardware and Parallelism Registers • fastest memory, directly usable in instructions • local variables in a kernel and variables for intermediate results go automatically into the registers • if there is sufficient number of registers • if the compiler can determine static array indexing • thread (warp) scoped Local memory • data that doesn't fit into the registers go into the local memory • local memory is stored in DRAM • thread (warp) scoped Jiří Filipovič GPU Hardware and Parallelism GPU hardware OOOOO Parallelism Memory Hierarchy oooooo o«oooooooo Synchronization OOOOOOOO Matrix Multiplication OOOOOOOOOOO Block-Local Memory Shared memory • as fast as registers for c. c. 1.x • if memory bank conflicts are avoided • instructions can use only one operand in shared memory (otherwise explicit load/store is needed) • declared using __shared__ in C for CUDA • a variable in shared memory can have dynamic size (determined at startup), if declared as extern withou size specification • block scoped Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo oo»ooooooo oooooooo ooooooooooo Shared Memory Static shared memory declaration __shared__ float myArray[128] ; Dynamic allocation extern__shared__char myArray [] ; float *arrayl = (float*)myArray; int *array2 — ( int*)&array 1 [ 1 28] ; short *array3 = ( short *)&array2 [256]; It creates an array arrayl of float type with size 128, array2 of int type sized 256, and array3 of floating size. Total size has to be specified at kernel startup. myKernel<«gr id , block, n>>>(); Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO 000*000000 OOOOOOOO OOOOOOOOOOO GPU Local Memory Global memory • an order of magnitude lower bandwidth compared to shared memory • latency in order of hundreds for GPU cycles • addressing needs to be aligned to get optimum performance • application-scoped • LI cache (128 bytes/row) and L2 cache (32 bytes/row) in Fermi architecture May be dynamically allocated using cudaMalloc or statically allocated using __cfew'ce__ declaration. Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO OOOO^OOOOO OOOOOOOO OOOOOOOOOOO GPU Local Memory Constant memory • read-only • cached • cache hit is as fast as registry (under certain constraints), cache miss is as fast as global memory • limited size (64 kB for currently available GPUs) • application-scoped Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo ooooo«oooo oooooooo ooooooooooo Constant Memory Declared using __constant— keyword; the following function is used for copying data to constant memory: cudaError_t cndaMemcpyToSymbol(const char ^symbol, const void *src, size_t count, size_t offset, enum cudaMemcpyKind kind) Data is copied from system memory (cudaMemcpyHostToDevice) or global memory (cudaMemcpyDeviceToDevice) from src into symbol. The copied block has count bytes. Copied with offset into the symbol memory. Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO 000000*000 OOOOOOOO OOOOOOOOOOO GPU Local Memory Texture memory • cached, 2D locality • read-only for cache coherency reasons • high latency • several addressing modes • normalization into [0,1] range • truncation or overflowing of coordinates • possible data filtering • linear interpolation or nearest value • this functionality is "for free" (implemented in HW) More details are available in CUDA Programming Guide. Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo ooooooo«oo oooooooo ooooooooooo System-Local Memory System RAM • connected to GPU using PCIe • CPU (host) and GPU (device) memory transfers are complicated by virtual addressing • it is possible to allocate so called page-locked memory areas • overall system performance may be reduced • limited size • data is transferred faster over PCIe • allows for parallel kernel run and data copying • allows for mapping of host address space onto the device • allows for write-combining access (data is not cached by CPU) Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo oooooooo«o oooooooo ooooooooooo Page Locked Memory cudaMallocHost() is used instead of mallocQ to allocate the memory; the memory is freed using cudaFreeHostQ • cudaHostAllocPortable flag ensures page-locked memory for all CPU threads • cudaHostAllocWriteCombined flag turns off caching for CPU allocated memory • cudaHostAllocMapped flag sets host memory mapping in the device address space Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo ooooooooo* oooooooo ooooooooooo Page-Locked Memory Mapped memory • the same position has a different address for device and host code • device address may be obtained using cudaHostGetDevicePointer() • before calling any other CUDA API functions, it is necessary to call cudaSetDeviceFlagsQ with cudaDeviceMapHost flag Asynchronous transfers • API funkce Async suffix • both data transfers - CPU computation and data transfer -GPU computation may be overlapping (more detailed explanation will come with streams) Non-cached memory • slow access from host code • faster access from device memory a CPU cache doesn't get flushed_►<> ► i Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO OOOOOOOOOO »0000000 ooooooooooo Synchronization within the Block • native barrier synchronization • all threads have to enter it (beware of conditions!) • one instruction only, very fast if it doesn't degrade parallelism • C for CUDA call __syncthreads() • Fermi extensions: count, and, or • shared memory communication o threads can exchange data • synchronization using atomic variables or a barrier Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo oooooooooo o«oooooo ooooooooooo Atomic operations • performs read-modify-write operations on shared or global memory • no interference with other threads • for 32-bit and 64-bit integers (c. c. > 1.2) and float (u c. c. > 2.0) • using global memory for c. c. > 1.1 and shared memory for c. c. > 1.2 • arithmetic (Add, Sub, Exch, Min, Max, Inc, Dec, CAS) a bitwise (And, Or, Xor) operations Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO OOOOOOOOOO OO0OOOOO OOOOOOOOOOO Warp Voting All threads in one warp evaluate the same condition and perform its comparison. Available in c. c. > 1.2. int __all(int predicate); Result is non-zero iff the predicate is non-zero for all the threads in the warp. int __any(int predicate); Result is non-zero iff the predicate is non-zero for at least one thread in the warp. unsigned int __ballot(int predicate); Contains voting bit mask of individual threads. Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo oooooooooo ooo«oooo ooooooooooo Synchronization of Memory Operations Shared memory is usually used for communication among threads or as a cache for data used by threads. • threads use data stored by other threads • it is necessary to ensure that we do not read data that is not available yet • should we wait, we can use __syncthreads() Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo oooooooooo oooo»ooo ooooooooooo Synchronization of Memory Operations Compiler can optimize operations on shared/global memory (intermediate results may be kept in registers) and can reorder them • if we need to ensure that the data is visible for others, we use __threadfence() or __threadfence_block() • if a variable is declared as volatile, all load/store operations are implemented in shared/global memory • very important if we assume implicit warp synchronization Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo oooooooooo ooooo«oo ooooooooooo Block Synchronization Among blocks • global memory is visible for all blocks • poor native support for synchronization • no global barrier • atomic operations on global memory for newer GPUs • global barrier can be implemented using kernel calls (another solution is quite tricky) • poor options for global synchronization make programming hard but allow for very good scalability Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication ooooo oooooo oooooooooo oooooo«o ooooooooooo Global Synchronization using Atomic Operations Problem of sum of elements in a vector • each block sums elements in its part of a vector • global barrier • one block sums results of all the blocks Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO OOOOOOOOOO 0000000« OOOOOOOOOOO __device__ unsigned int count = 0; __shared__ bool isLastBlockDone; __global__ void sum(const float* array, unsigned int N, float* result) { float partialSum = calculatePartialSum(array, N); if (threadldx.x = 0) { result[blockldx.x] = partialSum; __threadfence(); unsigned int value = at omicine(&count , gridDim.x); isLastBlockDone = (value = (gridDim.x — 1)); } __syncthreads(); if (isLastBlockDone) { float totalSum = calculateTotalSum(result); if (threadldx.x = 0) { result[0] = totalSum; count = 0; } } } Jiří Filipovič GPU Hardware and Parallelism GPU hardware Parallelism Memory Hierarchy Synchronization Matrix Multiplication OOOOO OOOOOO OOOOOOOOOO OOOOOOOO «0000000000 Matrix Multiplication We want to multiply matrices A a B and store the result into C. For sake of simplicity, we only assume matrices sized n x n. ci,j = ICfr=l A',k " BkJ C language: for (int i — 0; i < n; i++) for (int j = 0; j < n; j++){ C[i*n + j ] = 0.0; for (int k = 0; k < n; k++) C[i*n + j] +— A[i*n + k] * B[k*n + j]; } □ -ě: -0