Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo oo GPU Acceleration of General Computing Tasks J-W I- ■ I " " V in Fihpovic spring 2023 Jin Filipovic GPU Acceleration of General Computing Tasks 1/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details •ooo oooooooooo ooooo ooooooooooo oooooooooo Conclusion oo Motivation - arithmetic performance of GPUs Theoretical GFLOP/s 5750 5500 Apr-01 Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 Jin Filipovic GPU Acceleration of General Computing Tasks 2/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion o«oo oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - memory bandwidth of GPUs Theoretical GB/s 360 I GeForce 780 Ti Northwood 1 1 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Jin Filipovic GPU Acceleration of General Computing Tasks 3/43 Motivation oo«o GPU Architecture oooooooooo C for CUDA ooooo Demo ooooooooooo CUDA: more details OOOOOOOOOO Conclusion Motivation - programming complexity OK, so GPUs are fast, but aren't much more difficult to program? • well, it's much more complicated than writing serial C++ code... • but is it fair comparison? Jiří Filipovič GPU Acceleration of General Computing Tasks Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oo«o oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - programming complexity OK, so GPUs are fast, but aren't much more difficult to program? • well, it's much more complicated than writing serial C++ code... • but is it fair comparison? Moore's Law The amount of transistors, which can be placed into single chip, doubles every 18 months Jin Filipovic GPU Acceleration of General Computing Tasks 4/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oo«o oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - programming complexity OK, so GPUs are fast, but aren't much more difficult to program? • well, it's much more complicated than writing serial C++ code... • but is it fair comparison? Moore's Law The amount of transistors, which can be placed into single chip, doubles every 18 months The performance grow is caused by: • in the past: higher frequency, instruction-level parallelism, out-of-order instruction execution, etc. • nowadays: wider vector instructions, more cores Jin Filipovic GPU Acceleration of General Computing Tasks 4/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion ooo» oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - the paradigm shift Consequences of the Moore's Law: • in the past: the changes in processors architectures are relevant for compilers developers • nowadays: we need to explicitly parallelize and vectorize the code to keep scaling the performance • still a lot of work for developers, compilers have very limited capabilities here • writing of really efficient code is similarly difficult for both GPUs and CPUs □ s> - = Jin Filipovic GPU Acceleration of General Computing Tasks 5/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO »000000000 ooooo ooooooooooo oooooooooo oo What makes GPU powerful? Parallelism types • Task parallelism the problem is decomposed to parallel tasks • tasks are typically complex, they can perform different jobs • complex synchronization • best for lower number of high-performance processors/cores 9 Data parallelism • the parallelism on a level of data structures • typically the same operation on multiple elements of a data structure • can be executed on simpler processors □ g - = Jin Filipovic GPU Acceleration of General Computing Tasks 6/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo o«oooooooo ooooo ooooooooooo oooooooooo oo What makes GPU powerful? Programmer point of view • some problems are more task-parallel, some more data-parallel (tree traversal vs. vector addition) Hardware designer point of view • processors for data-parallel computations can be simpler so we can get more arithmetic power per square centimeter (i.e., for the same amount of transistors) 9 simpler memory access patterns allows to create a memory with higher bandwidth Jin Filipovic GPU Acceleration of General Computing Tasks 7/43 Motivation oooo GPU Architecture OO0OOOOOOO C for CUDA ooooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo GPU Architecture Jirf Filipovic GPU Acceleration of General Computing Tasks 8/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOO0OOOOOO ooooo ooooooooooo oooooooooo oo GPU Architecture CPU vs. GPU • hundreds ALU in tens of cores vs. tens of thousands ALU in tens of multiprocessors • out-of-order vs. in-order • MIMD, SIMD for short vectors vs. SIMT for long vectors big cache vs. small cache, often read-only GPUs use more transistors for ALUs than for cache and instruction control => higher peak performance, less universal Jin Filipovic GPU Acceleration of General Computing Tasks 9/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO 0000*00000 ooooo ooooooooooo oooooooooo oo GPU Architecture High-end GPU: 9 co-processor with dedicated memory • asynchronous instructions execution o connected via PCI-E to the rest of the system Jin Filipovic GPU Acceleration of General Computing Tasks 10 /43 Motivation oooo CUDA GPU Architecture OOOOO0OOOO C for CUDA ooooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo CUDA (Compute Unified Device Architecture) • architecture for parallel computations developed by NVIDIA 9 a programming model allowing to implement general programs on GPUs • can be used with multiple programming languages Jin Filipovic GPU Acceleration of General Computing Tasks 11 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooo«ooo ooooo ooooooooooo oooooooooo oo Processor G80 9 the first CUDA processor • contains 16 multiprocessors • a multiprocessor 8 scalar processors • 2 special function units • up to 768 threads • HW switching and scheduling groups of 32 threads are organized into warps • SIMT • native synchronization within a multiprocessor Jin Filipovic GPU Acceleration of General Computing Tasks 12 /43 Motivation oooo GPU Architecture OOOOOOO0OO C for CUDA ooooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo Memory model of G80 Memory model • 8192 registers shared among all threads within a multiprocessor • 16 KB shared memory 9 local within a multiprocessor close to the registers' speed (under some circumstances) • constant memory • cached, optimized for broadcast, read-only • texture memory • cached, 2D spatial locality, read-only • global memory • read-write, not cached 9 transfers between system and global memory via PCI-E Jin Filipovic GPU Acceleration of General Computing Tasks 13 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo oo Processor G80 Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 < □ ► 4 S1 ► < ► 4 J in Fihpovic GPU Acceleration of General Computing Tasks Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO 000000000« ooooo ooooooooooo oooooooooo oo Newer GPUs Similar architecture, new features • double-precision • relaxed rules for efficient access into global memory • LI, L2/data cache • higher amount of on-chip resources (registers, shared memory, threads etc.) 9 wider synchronization options (e.g., atomic operations) • nested parallelism • unified memory □ s Jin Filipovic GPU Acceleration of General Computing Tasks 15 / 43 Motivation oooo GPU Architecture oooooooooo C for CUDA •oooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo C for CUDA C for CUDA extends C/C++ language for parallel computations with GPUs 9 explicit separation of a host (CPU) and a device (GPU) code • threads hierarchy • memory hierarchy 9 synchronization mechanisms a API (context manipulation, memory, errors handling etc.) □ S1 ■=►■<-= -E-OQ^O Jin Filipovic GPU Acceleration of General Computing Tasks 16 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA o«ooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo Threads hierarchy Threads hierarchy • threads are organized into thread-blocks • thread-blocks create a grid • a computational problem is typically decomposed into independent sub-problems, solved by thread-blocks • subproblems are further parallelized and solved by (potentially collaborating) threads o ensures good scaling Jin Filipovic GPU Acceleration of General Computing Tasks 17 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA oo«oo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo Threads hierarchy Grid Block (0, 0) Block (1, 0) Block (2, 0) Block (1,1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Jin Filipovic GPU Acceleration of General Computing Tasks 18 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo oooto ooooooooooo oooooooooo oo emory hierarchy Multiple types of memory • differ in visibility 9 differ in life-time • differ in latency and bandwidth Jiří Filipovič GPU Acceleration of General Computing Tasks 19 / 43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo oooo« ooooooooooo oooooooooo oo Memory hierarchy Thread Per-thread local memory Thread Block Per-block shared memory Jin Filipovic GPU Acceleration of General Computing Tasks 20 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo •oooooooooo CUDA: more details oooooooooo Conclusion oo Example - vector addition We want to add vectors a, b, and store the result into vector c. □ g - = Jin Filipovic GPU Acceleration of General Computing Tasks 21 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo •oooooooooo CUDA: more details oooooooooo Conclusion oo Example - vector addition We want to add vectors a, b, and store the result into vector We need to parallelize the problem. Jin Filipovic GPU Acceleration of General Computing Tasks 21 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo OOOOOOOOOO OOOOO «0000000000 oooooooooo oo Example - vector addition We want to add vectors a, b, and store the result into vector c. We need to parallelize the problem. Serial code: for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; Jiff Filipovic GPU Acceleration of General Computing Tasks 21 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo •oooooooooo CUDA: more details oooooooooo Conclusion oo Example - vector addition We want to add vectors a, b, and store the result into vector c. We need to parallelize the problem. Serial code: for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; Independent iterations - easy to parallelize, scales with the vector size. Jin Filipovic GPU Acceleration of General Computing Tasks 21 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo •oooooooooo CUDA: more details oooooooooo Conclusion Example - vector addition We want to add vectors a, b, and store the result into vector c. We need to parallelize the problem. Serial code: for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; Independent iterations - easy to parallelize, scales with the vector size. i-th thread adds i-th elements of a, b\ c[i] = a[i] + b[i]; How to find out which index to pick? Jin Fihpovic GPU Acceleration of General Computing Tasks 21/43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo o«ooooooooo CUDA: more details oooooooooo Conclusion oo Threads hierarchy Grid Block (0, 0) Block (1, 0) Block (2, 0) Block (1,1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Jin Filipovic GPU Acceleration of General Computing Tasks 22 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo oo«oooooooo oooooooooo oo Identification of a thread and a block Each thread in C for CUDA has build-in variables: • threadldx.jx, y, z} contains the position of the thread within its block • blockDim.jx, y, z} contains the size of the block • blockldx.jx, y, z} contains the position of the block within a grid • gridDim.{x, y, z} contains the size of the grid Jin Filipovic GPU Acceleration of General Computing Tasks 23 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo OOO0OOOOOOO CUDA: more details oooooooooo Conclusion oo Example - vector addition We need to compute a global position of the thread (using ID blocks and grid): Jin Filipovic GPU Acceleration of General Computing Tasks 24 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo OOO0OOOOOOO CUDA: more details oooooooooo Conclusion oo Example - vector addition We need to compute a global position of the thread (using ID blocks and grid): int i = blockldx.x*blockDim.x + threadldx.x; Jin Filipovic GPU Acceleration of General Computing Tasks 24 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo OOO0OOOOOOO CUDA: more details oooooooooo Conclusion oo Example - vector addition We need to compute a global position of the thread (using ID blocks and grid): int i = blockldx.x*blockDim.x + threadldx.x; The complete function for the parallel vector addition: __global__ void addvec(float *a, float *b , float *c){ int i = biockldx.x*biockDim.x + threadldx.x; c[i] = a[i] + b[i]; } Jin Filipovic GPU Acceleration of General Computing Tasks 24 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo OOO0OOOOOOO CUDA: more details oooooooooo Conclusion oo Example - vector addition We need to compute a global position of the thread (using ID blocks and grid): int i = blockldx.x*blockDim.x + threadldx.x; The complete function for the parallel vector addition: __global__ void addvec(float *a, float *b , float *c){ int i = biockldx.x*biockDim.x + threadldx.x; c[i] = a[i] + b[i]; } The code defines a kernel (a parallel function executed on GPU). When executing kernel, the size of block and number of blocks has to be defined. Jin Filipovic GPU Acceleration of General Computing Tasks 24 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo oooo«oooooo CUDA: more details oooooooooo Conclusion oo Function type quantifiers The syntax of C is extended by function type quantifiers, determining from where the function can be called and where it is executed • __device__ function is executed on device (GPU) and called from device code • __global__ function is executed on device and called from host (CPU) • __host__ function is executed on host, and called from host • __host__ and __device__ can be combined, the function is then compiled for both host and device and also can be called from both host and device Jin Filipovic GPU Acceleration of General Computing Tasks 25 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo 00000*00000 CUDA: more details oooooooooo Conclusion oo Example - vector addition For complete computation of vector addition, we need to: Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo 00000*00000 CUDA: more details oooooooooo Conclusion oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo 00000*00000 CUDA: more details oooooooooo Conclusion oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo 00000*00000 CUDA: more details oooooooooo Conclusion oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory • copy vectors a a b to GPU memory Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo 00000*00000 CUDA: more details oooooooooo Conclusion oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory • copy vectors a a b to GPU memory o compute vector addition on GPU Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo 00000*00000 CUDA: more details oooooooooo Conclusion oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory • copy vectors a a b to GPU memory o compute vector addition on GPU • copy back the result from GPU memory into c Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo 00000*00000 CUDA: more details oooooooooo Conclusion oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory • copy vectors a a b to GPU memory o compute vector addition on GPU • copy back the result from GPU memory into c 9 use c somehow :-) When managed memory is used (supported from compute capability 3.0 and CUDA 6.0), we don't need to perform steps printed in italic. □ s> - = Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo OOOOOO0OOOO CUDA: more details oooooooooo Conclusion oo Example - vector addition CPU code fills a a b, and prints c: #include #define N 64 int main(){ float *a, *b, *c ; cudaMallocManaged(&a, N*sizeof(*a)) cudaMallocManaged(&b, N*sizeof(*b)) cudaMallocManaged(&c , N*sizeof (* c ) ) for (int i = 0; i < N; i++) { a[i] = i; b[i] = i*2; } // placeholder for GPU computation for (int i = 0; i < N; i++) printf("%f, " , c[i]); cudaFree(a); cudaFree(b); cudaFree(c); return 0; } J in Fihpovic GPU Acceleration of General Computing Tasks Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooo«ooo oooooooooo oo GPU memory management We use managed memory, so CUDA automatically copies data between CPU and GPU. • memory coherency is automatically ensured • we cannot access managed memory while any GPU kernel is running (even if it does not touch the buffer we want to use) Alternatively, we can allocate and copy memory explicitly: cudaMalloc(void** devPtr, size_t count); cudaFree(void* devPtr); cudaMemcpy(void* dst , const void* src , size_t count, enum cudaMemcpyKind kind ) ; Jin Filipovic GPU Acceleration of General Computing Tasks 28 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo OOOOOOOO0OO CUDA: more details oooooooooo Conclusion oo Example - vector addition Kernel execution: • the kernel is called as a C-function; between the name and the arguments, there are triple angle brackets with specification of grid and block size • we need to know block size and their count • we will use ID block and grid with fixed block size • the size of the grid is determined in a way to compute the whole problem of vector sum For vector size divisible by 32: #define BLOCK 32 addvec«(a, b, c); cudaDeviceSynchronize(); The synchronization after kernel call ensures that c is going to be accessed by host code after the called kernel finishes. Jin Filipovic GPU Acceleration of General Computing Tasks 29 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo OOOOOOOOO0O CUDA: more details oooooooooo Conclusion oo Example - vector addition How to solve a general vector size? We will modify the kernel source: __global__ void addvec(float *a, float *b, float *c , int n){ int i = biockldx.x*biockDim.x + threadldx.x; if (i < n) c[i] = a[i] + b[i]; } And call the kernel with sufficient number of threads: addvec«»(a, b, c, N) ; Jin Filipovic GPU Acceleration of General Computing Tasks 30 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo oooooooooo« CUDA: more details oooooooooo Conclusion oo Compilation Now we just need to compile it :-). nvcc -o vecadd vecadd.cu Jin Filipovic GPU Acceleration of General Computing Tasks 31 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO «000000000 oo Thread-local memory Registers 9 the fastest memory, directly used by instructions • local variables and intermediate results are stored into registers • if there is enough registers • if compiler can determine array indexes in compile time • life-time of a thread Jin Filipovic GPU Acceleration of General Computing Tasks 32 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO «000000000 oo Thread-local memory Registers 9 the fastest memory, directly used by instructions • local variables and intermediate results are stored into registers • if there is enough registers • if compiler can determine array indexes in compile time • life-time of a thread Local memory • what cannot fit into registers, goes to the local memory o physically stored in global memory, have longer latency and lower bandwidth • life-time of a thread Jin Filipovic GPU Acceleration of General Computing Tasks 32 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo o«oooooooo oo Block-local memory Shared memory • the speed is close to registers « if there are no bank-conflicts typically requires some load/store instructions • declared by shared— 9 can have dynamic size (determined during kernel execution), if declared as extern without specification of the array size • life-time of a thread block Jin Filipovic GPU Acceleration of General Computing Tasks 33 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO 00*0000000 oo GPU-local memory Global memory • order-of-magnitude lower bandwidth compared to the shared memory 9 latency in hundreds of GPU clocks • coalesced access necessary for efficient access • life-time of an application • can be cached (depending on GPU architecture) Dynamic allocation with cudaMalloc, static allocation by using __c/ewce__ □ s> - = Jin Filipovic GPU Acceleration of General Computing Tasks 34 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo oo Other memories • constant memory • texture memory • system memory Jin Filipovic GPU Acceleration of General Computing Tasks 35 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo oo Thread block-scope synchronization native barrier • has to be visited by all threads within a thread-block 9 only one instruction, very fast if not reduce parallelism • __syncthreads() Jin Filipovic GPU Acceleration of General Computing Tasks 36 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO OOOOO0OOOO oo Atomic operations • perform read-modify-write operations using shared or global memory 9 no interference with other threads • for 32-bit and 64-bit integers (compute capability > 1.2, float add with c.c. > 2.0) • arithmetic (Add, Sub, Exch, Min, Max, Inc, Dec, CAS) and bitwise (And, Or, Xor) operations Jin Filipovic GPU Acceleration of General Computing Tasks 37 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooo«ooo oo Synchronization of memory operations Compiler can optimize access into shared and global memory by placing intermediate results into registers, and it can change order of memory operations: • —threadfence() and —threadfenceJolockQ can be used to ensure data we are storing are visible for others • variables declared as volatile are always read/written from/to global or shared memory Jin Filipovic GPU Acceleration of General Computing Tasks 38 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO OOOOOOO0OO oo Thread-block synchronization Thread blocks communication • global memory visible for all blocks • but weak possibilities to synchronize between blocks • in general no global barrier (can be implemented if all blocks are persistent on GPU) • using atomic operations can solve some problems • generic global barrier only by kernel invocation • harder to program, but allows better scaling Jin Filipovic GPU Acceleration of General Computing Tasks 39 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo oo Global synchronization via atomic operations Alternative implementation of vector reduction • each thread-block reduces a subvector 9 the last running thread-block adds results of all thread-blocks • implementation of weak global barrier: after finishing blocks 1..A7 — 1, blocks n continues Jin Filipovic GPU Acceleration of General Computing Tasks 40 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo ooooooooo* oo __device__ unsigned int count = 0; __shared__ bool isLastBlockDone; __global__ void sum(const float* array, unsigned int N float* result) { float partialSum = calculatePartialSum(array, N); if (threadldx.x = 0) { result[blockldx.x] = partialSum; __threadfence(); unsigned int value = atomicInc(&count , gridDim.x); isLastBlockDone = (value = (gridDim.x — 1)); } __syncthreads () ; if (isLastBlockDone) { float totalSum = calculateTotalSum(result ) ; if (threadldx.x = 0) { result[0] = totalSum; count = 0; } } } Jin Filipovic GPU Acceleration of General Computing Tasks 41 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo «o Materials CUDA documentation (part of CUDA Toolkit, downloadable from developer, n vidia. com) • CUDA C Programming Guide (CUDA essentials) • CUDA C Best Practices Guide (more details on optimization) • CUDA Reference Manual (complete C for CUDA API reference) • a lot of other useful documents (nvcc manual, documentation of PTX and assembly, documentation for various accelerated libraries, etc.) CUDA, Supercomputing for the Masses • http://www.ddj.com/cpp/207200659 Jin Filipovic GPU Acceleration of General Computing Tasks 42 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo o« Today, we learned • what is CUDA good for • basic GPU architecture 9 basic C for CUDA programming In the next lecture, we will focus • how to write efficient GPU code Jin Filipovic GPU Acceleration of General Computing Tasks 43 /43