About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO ooooooooooo ooooooo oooooo oooooooooooo Introduction, CUDA Basics Jiří Fi li povič Fall 2013 Jiří Filipovič Introduction, CUDA Basics About The Class •OOOO Motivation GPU Architecture OOOOOOOOOOO OOOOOOO CUDA oooooo Sample Code oooooooooooo Conclusions What is included The class is focused on algorithm design and programming of general purpose computing applications on graphical processors Jiří Filipovič Introduction, CUDA Basics About The Class •OOOO Motivation GPU Architecture ooooooooooo ooooooo CUDA oooooo Sample Code Conclusions oooooooooooo What is included The class is focused on algorithm design and programming of general purpose computing applications on graphical processors We will learn: • design of parallel algorithms with focus on utilization of programming model available in todays GPU Jiří Filipovič Introduction, CUDA Basics About The Class •OOOO Motivation GPU Architecture OOOOOOOOOOO OOOOOOO CUDA oooooo Sample Code oooooooooooo Conclusions What is included The class is focused on algorithm design and programming of general purpose computing applications on graphical processors We will learn: • design of parallel algorithms with focus on utilization of programming model available in todays GPU • CUDA-based GPU architectures Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions •oooo ooooooooooo ooooooo oooooo oooooooooooo What is included The class is focused on algorithm design and programming of general purpose computing applications on graphical processors We will learn: • design of parallel algorithms with focus on utilization of programming model available in todays GPU • CUDA-based GPU architectures • programming in C for CUDA Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions •oooo ooooooooooo ooooooo oooooo oooooooooooo What is included The class is focused on algorithm design and programming of general purpose computing applications on graphical processors We will learn: • design of parallel algorithms with focus on utilization of programming model available in todays GPU • CUDA-based GPU architectures • programming in C for CUDA • tools and libraries Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions •oooo ooooooooooo ooooooo oooooo oooooooooooo What is included The class is focused on algorithm design and programming of general purpose computing applications on graphical processors We will learn: • design of parallel algorithms with focus on utilization of programming model available in todays GPU • CUDA-based GPU architectures • programming in C for CUDA • tools and libraries • code optimization for CUDA Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions •oooo ooooooooooo ooooooo oooooo oooooooooooo What is included The class is focused on algorithm design and programming of general purpose computing applications on graphical processors We will learn: • design of parallel algorithms with focus on utilization of programming model available in todays GPU • CUDA-based GPU architectures • programming in C for CUDA • tools and libraries • code optimization for CUDA • case studies Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions •oooo ooooooooooo ooooooo oooooo oooooooooooo What is included The class is focused on algorithm design and programming of general purpose computing applications on graphical processors We will learn: • design of parallel algorithms with focus on utilization of programming model available in todays GPU • CUDA-based GPU architectures • programming in C for CUDA • tools and libraries • code optimization for CUDA • case studies The class is practically orented - GPU is constant-times faster than CPU, therefore besides time complexity, writing an optimal code is important. Jiří Filipovič Introduction, CUDA Basics About The Class oaooo Motivation ooooooooooo GPU Architecture ooooooo Sample Code oooooooooooo What is expected from you During the semester, you will work on a practically oriented project • important part of your total score in the class • the same task for everybody, we will compare speed of your implementation • 50 + 20 points of total score • working code: 25 points • efficient implementation: 25 points • speed of your code relative to your class mates: 20 points (only to improve your final grading) Exam (oral or written, depending on the number of students) • 50 points Introduction, CUDA Basics m -OQ.O About The Class Motivation GPU Architecture CUDA Sample Code Conclusions oo*oo ooooooooooo ooooooo oooooo oooooooooooo Grading For those finishing by exam: • A: 92-100 • B: 86-91 • C: 78-85 • D: 72-77 « E: 66-71 • F: 0-65 pts For those finishing by colloquium: • 50 pts Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions 000»0 OOOOOOOOOOO OOOOOOO OOOOOO OOOOOOOOOOOO Materials - CUDA CUDA documentation (installed as a part of CUDA Toolkit, downloadable from developer.nvidia.com) • CUDA C Programming Guide (most important properties of CUDA) • CUDA C Best Practices Guide (more detailed document focusing on optimizations) • CUDA Reference Manual (complete description of C for CUDA API) • other useful documents (nvcc guide, PTX language description, library manuals, ...) University of Illinois textbook • available from http://courses.ece. illinois.edu/ece498/al/Syl la bus. htm I CUDA article series, Supercomputing for the Masses • http: //www.ddj.com/cpp/207200659 Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions oooo* ooooooooooo ooooooo oooooo oooooooooooo Materials - Parallel Programming • Ben-Ari M., Principles of Concurrent and Distributed Programming, 2nd Ed. Addison-Wesley, 2006 • Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill, Patterns for Parallel Programming, Addison-Wesley, 2004 Jiří Filipovič Introduction, CUDA Basics About The Class OOOOO Motivation #0000000000 GPU Architecture OOOOOOO CUDA OOOOOO Sample Code OOOOOOOOOOOO Conclusions Motivation - Moore's Law Moore's Law Number of transistors on a single chip doubles every 18 months Jiří Filipovič Introduction, CUDA Basics loore's Law Number of transistors on a single chip doubles every 18 months Corresponding growth of performance comes from • in the past: frequency increase, parallelism of instructions, of-of-order instruction processing, caches, etc. • today: vector instructions, increase in number of cores Introduction, CUDA Basics m -OQ.O About The Class Motivation GPU Architecture ooooo o«ooooooooo ooooooo Motivation - paradigm change Moore's Law consequences: • in the past: speed of a single-threaded program doubled each 18 months • changes were important for compiler developers; application developers didn't need to worry • today: speed of prcessing of a parallel program having sufficient number of processes/threads doubles every 18 months • in order to utilize state-of-the-art processors, it is necessary to devleop parallel algorithms • it is necessary to find parallelism in the problem being solved, which is a task for a programmer, not for a compiler (at least for now) Jiří Filipovič Introduction, CUDA Basics Motivation oo«oooooooo GPU Architecture ooooooo Sample Code oooooooooooo Motivation - Types of Parallelism • Task parallelism • decomposition of a task into the problems that may be processed in parallel • usually more complex tasks performing different actions • ideal for small number of high-performance processor goals • more frequent (and complex) synchronization, usually • Data parallelism • paralellism on the level of data structures • usually the same operations on many items of a data structure • finer-grained parallelism allows for simple construction of individual processors Introduction, CUDA Basics m -OQ.O Motivation ooo«ooooooo GPU Architecture ooooooo Sample Code oooooooooooo Motivation - Types of Parallelism • from programmer's perspective • different paradigm requires different approach to algorithm design • some problems are rather data-parallel, some task-parallel • from hardware perspective • processors for data-parallel tasks may be simpler • it si possible to achieve higher arithmetic performance with the same number of processors • simpler memory access patterns allow for high-throughput memory designs Introduction, CUDA Basics m -OQ.O About The Class OOOOO Motivation 0000*000000 GPU Architecture CUDA ooooooo oooooo Sample Code Conclusions OOOOOOOOOOOO Motivace - Graphical Computations • Data parallel • the same task implemented for each pixel/vertex • Predefined functions • Programmable functions o special graphics effects • GPU become more and more programmable • it is possible to implement also non-graphics tasks Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions ooooo ooooo«ooooo ooooooo oooooo oooooooooooo Motivation - Performance Jiří Filipovič Introduction, CUDA Basics About The Class OOOOO Motivation GPU Architecture oooooo»oooo ooooooo CUDA oooooo Sample Code Conclusions oooooooooooo Motivation - Performance Theoretical GB/s GeFo rc e GTX 680 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA ooooo ooooooo«ooo ooooooo oooooo Motivation - Summary Sample Code Conclusions OOOOOOOOOOOO • GPUs are powerful • an order of magnitude performance increase is worth studying a new programming model • for full utilization of modern GPUs and CPUs , parallel programming is necessary • parallel architecture of GPUs ceases to be an order of magnitude harder to master • GPUs are widespread • cheap • lots of users have a desktop supercomputer Jiří Filipovič Introduction, CUDA Basics About The Class OOOOO Motivation GPU Architecture OOOOOOOO0OO OOOOOOO CUDA oooooo Sample Code Conclusions OOOOOOOOOOOO Motivation - Applications Use of GPU for general computations is a dynamically developing field with broad applicability Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions ooooo oooooooo»oo ooooooo oooooo oooooooooooo Motivation - Applications Use of GPU for general computations is a dynamically developing field with broad applicability • high-performance scientific calculations • computational chemistry • physical simulations • image processing o and others. .. Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions ooooo oooooooo»oo ooooooo oooooo oooooooooooo Motivation - Applications Use of GPU for general computations is a dynamically developing field with broad applicability • high-performance scientific calculations • computational chemistry • physical simulations • image processing o and others. .. • performance-hungry home and desktop applications • encoding/decoding of multimedia data • game physics • image editing, 3D rendering o etc. Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions ooooo ooooooooo«o ooooooo oooooo oooooooooooo Motivation - Applications SW developers are still a sought-for scarce resource... Jiří Filipovič Introduction, CUDA Basics About The Class OOOOO Motivation GPU Architecture ooooooooo«o OOOOOOO CUDA oooooo Sample Code Conclusions oooooooooooo Motivation - Applications SW developers are still a sought-for scarce resource... SW developers capable of parallel SW development are extremely sought-for scarce resource Jiří Filipovič Introduction, CUDA Basics Motivation ooooooooo«o GPU Architecture ooooooo Sample Code oooooooooooo Motivation - Applications SW developers are still a sought-for scarce resource... SW developers capable of parallel SW development are extremely sought-for scarce resource A lot of existing software is not parallel » it is necessary to make it parallel in order to increase performance • and somebody has to do it :-) Introduction, CUDA Basics m -OQ.O About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO 0000000000» ooooooo oooooo oooooooooooo Historic Excursion • SIMD model since '60s • Solomon project by Westinghouse company at the beginning of '60s • transferred to University of Illinois as ILLIAC IV • separate ALU for each data element - massively parallel • original plan: 256 ALUs, 1 GFLOPS • finished in 1972, 64 ALUs, 100-150 MFLOPS • in '80s-90s: vector supercomputers, TOP500 • in todays CPUs: SSE (x86), ActiVec (PowerPC) • Cg: programming vertex and pixel shaders in graphics grads (cca 2003) • CUDA: general GPU programming, SIMT model (first released on 15. February 2007) • future? • OpenCL o higher programming languages, automatic parallelization Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture OOOOO OOOOOOOOOOO COOOOOO CUDA oooooo Sample Code oooooooooooo Conclusions GPU Architecture CPU vs. GPU • couple of cores vs. vs. tens of multiprocessors • out of order vs. in order • MIMD, SIMD short vectors vs. SIMT for long vectors • large cache vs. small cache, often read-only GPU uses more transistors for computating units then for cache and control =>■ higher performance, less flexibility Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture ooooo ooooooooooo oo«oooo CUDA oooooo Sample Code oooooooooooo Conclusions GPU Architecture Within the system: • co-processor with dedicated memory • asychnornous processing of instructions • attached using PCI-E to the rest of the system Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture ooooo ooooooooooo ooo»ooo CUDA oooooo Sample Code oooooooooooo Conclusions G80 Processor G80 • first CUDA processor • 16 multiprocessors • each multiprocessor • 8 scalar processors • 2 units for special functions • up to 768 threads • HW for thread switching and scheduling • threads are grouped into warps by 32 • SIMT • native synchronization within the multiprocessor Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO oooo«oo OOOOOO OOOOOOOOOOOO G80 Memory Model Memory model • 8192 registers shared among all threads of a multiprocessor • 16 kB of shared memory • local within the multiprocessor • as fast as registry (under certain constraints) • constant memory • cached, read-only • texture memory • cached with 2D locality, read-only • global memory • non cached, read-write • data transfers between global memory and system memory through PCI-E Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO OOOOO0O OOOOOO OOOOOOOOOOOO G80 Processor Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO 000000« oooooo oooooooooooo Further Development Processors based on G80 • double-precision calculations • relaxed rules for efficient memory access to global memory • more of on-chip resources (more registers, more threads per MP) • better sychronization options (atomic operations, warp voting) Fermi • higher parallelization on multiprocessor level (more cores, two warp schedulers, higher double-precission performance) • configurable LI and shared L2 cache • flat address space • better floating point precision • parallel run of kernels • better synchronization tools • other changes stemming from a different architecture, -00.0 Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO OOOOOOO OOOOOO ooooo»oooooo The following steps are needed for the full computation: • allocate memory for vectors and fill it with data » allocate memory on GPU • copy vectors a a b to GPU About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO ooooooooooo ooooooo oooooo ooooo»oooooo The following steps are needed for the full computation: • allocate memory for vectors and fill it with data » allocate memory on GPU • copy vectors a a b to GPU • compute the sum on GPU Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO ooooooooooo ooooooo oooooo ooooo»oooooo The following steps are needed for the full computation: • allocate memory for vectors and fill it with data » allocate memory on GPU • copy vectors a a b to GPU • compute the sum on GPU • store the result from GPU into c 4 □ ► < fil ► 4 ► 4 -Š ► -Š -o^o Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO OOOOOOO OOOOOO ooooo»oooooo The following steps are needed for the full computation: • allocate memory for vectors and fill it with data » allocate memory on GPU • copy vectors a a b to GPU • compute the sum on GPU • store the result from GPU into c • use the result in c :-) m -OQ.O Introduction, CUDA Basics About The Class OOOOO Motivation GPU Architecture OOOOOOOOOOO OOOOOOO CUDA oooooo Sample Code Conclusions oooooo«ooooo An Example - Sum of Vectors CPU code that fills a and b and computes c #include Sdefine N 64 int main(){ float a[N] , b[N] , c [ N ] ; for (int i = 0; i < N; i++) = = i; // GPU code will be here for (int i = 0; i < N; i++) printf("%f , " , c[i]); return 0; } Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions ooooo ooooooooooo ooooooo oooooo ooooooo»oooo GPU Memory Management It is necessary to allocate the memory dynamically. cudaMalloc(void** devPtr, size_t count); allocates memory of the count size and sets the pointer devPtr to it. To release the memory: cudaFree(void* devPtr); To copy the memory: cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind); copies count bytes from src to dst, kind determins copying direction (e.g., cudaMemcpyHostToDevice, or cudaMemcpyDevice ToHost). Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions ooooo ooooooooooo ooooooo oooooo oooooooo«ooo An Example - Sum of Vectors We allocate the memory and transfer the data: float *d_a, *d_b, *d_c; cudaMalloc((void**)&d_a, N*sizeof(*d_a)) cudaMalloc((vo id **)&d_b , N*sizeof(*d_b)) cudaMalloc((void**)&d_c, N*sizeof(*d_c)) cudaMemcpy(d_a, a, N*sizeof(*d_a), cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, N*sizeof(*d_b), CudaMemcpyHostToDevice); // the kernel will be run here cudaMemcpy(c, d_c, N*sizeof(*c), cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); Jiří Filipovič Introduction, CUDA Basics About The Class OOOOO Motivation GPU Architecture OOOOOOOOOOO OOOOOOO CUDA oooooo Sample Code Conclusions ooooooooo»oo An Example - Sum of Vectors Running the kernel: • kernel is called as a function; between the name and the arguments, there are three brackets with specification of grid and block size • we need to know block size and their count • we will use ID block and grid with fixed block size • the size of the grid is determined in a way to compute the whole problem of vector sum For vector size dividable by 32: Sdefine BLOCK 32 addvec«(d_a , d_b , d_c ) ; How to solve a general vector size? Jiří Filipovič Introduction, CUDA Basics About The Class OOOOO Motivation GPU Architecture OOOOOOOOOOO OOOOOOO CUDA oooooo Sample Code Conclusions oooooooooo«o An Example - Sum of Vectors We will modify the kernel source: __global__ void addvec(float *a, float *b, float *c, int n){ int i = blockldx.x*blockDim.x + threadldx.x; if (i < n) c[i] = a[i] + b[i]; } And call the kernel with sufficient number of threads: addvec«(d_a , d_b , d_c , N); Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO OOOOOOO OOOOOO 00000000000» An Example - Running It Now we just need to compile it :-) nvcc -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcudart \ -o vecadd vecadd.cu Where to work with CUDA? • on a remote computer: barracuda.fi.muni.cz, airacuda.fi.muni.cz, accounts will be made • Windows stations in computer halls (will be specified later) • your own machine: download and install CUDA toolkit and SDK from developer.nvidia.com • source code used in lectures will be published as a part of course materials Jiří Filipovič Introduction, CUDA Basics About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO OOOOOOO OOOOOO OOOOOOOOOOOO Today we have demonstrated • why it is good to know CUDA • differences of GPUs • C for CUDA basics Introduction, CUDA Basics m -OQ.O About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO OOOOOOO OOOOOO OOOOOOOOOOOO Today we have demonstrated • why it is good to know CUDA • differences of GPUs • C for CUDA basics Next lecture will focus on • more detailed introduction to GPU from hardware perspective • parallelism provided by GPU • memory available to GPU • more complex examples of GPU implementations About The Class Motivation GPU Architecture CUDA Sample Code Conclusions OOOOO OOOOOOOOOOO OOOOOOO OOOOOO OOOOOOOOOOOO Today we have demonstrated • why it is good to know CUDA • differences of GPUs • C for CUDA basics Next lecture will focus on • more detailed introduction to GPU from hardware perspective • parallelism provided by GPU • memory available to GPU • more complex examples of GPU implementations An assignment for you: • try to compile your first CUDA program • play with it if you like Jiří Filipovič Introduction, CUDA Basics