About The Class Motivation GPU Architecture C for CUDA Sample Code
ooooooo oooooooo ooooooooo ooooo ooooooooooo
Introduction, CUDA Basics
in Fihpovic
Fall 2022
Jiří Filipovič        Introduction, CUDA Basics
About The Class •oooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
About the class
The class is focused on algorithm design and programming of general purpose computing applications on many-core vector processors
Jiří Filipovič        Introduction, CUDA Basics
About The Class •oooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
About the class
The class is focused on algorithm design and programming of general purpose computing applications on many-core vector processors
We will focus to CUDA GPUs first:
• C for CUDA is good for teaching (easy API, a lot of examples available, mature compilers and tools)
• restricted to NVIDIA GPUs and x86 CPUs (with PGI)
Jiří Filipovič        Introduction, CUDA Basics
About The Class •oooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
About the class
The class is focused on algorithm design and programming of general purpose computing applications on many-core vector processors
We will focus to CUDA GPUs first:
• C for CUDA is good for teaching (easy API, a lot of examples available, mature compilers and tools)
• restricted to NVIDIA GPUs and x86 CPUs (with PGI) After learning CUDA, we focus to OpenCL
• programming model very similar to CUDA, easy to learn when you already know CUDA
• can be used with various HW devices
o we will focus on code optimizations for x86, Intel MIC (Xeon Phi) and AMD GPUs
The class is practically oriented - besides efficient parallelization, we will focus on writing efficient code.
J in Fihpovic
Introduction, CUDA Basics
About The Class
O0OOOOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
What is offered
You will learn:
• architecture of NVIDIA and AMD GPUs, Xeon Phi
• architecture-aware design of data-parallel algorithms 9 programming in C for CUDA and OpenCL
• performance tuning and profiling
• use cases
Jiří Filipovič        Introduction, CUDA Basics
About The Class
OO0OOOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
What is expected from you
During the semester, you will work on a practically oriented project
a important part of your total score in the class
• the same task for everybody, we will compare speed of your implementation
9 50 + 20 points of total score
• working code: 25 points
9 efficient implementation: 25 points
• speed of your code relative to your class mates: at most 20 points (only to improve your final grading, from 1 to 20 points will be granted for projects above average)
Exam (oral or written, depending on the number of students)
50 points
Jiří Filipovič        Introduction, CUDA Basics
About The Class
OOO0OOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Grading
For those finishing by exam:
o	A:	92-100
9	B:	86-91
9	C:	78-85
9	D:	72-77
9	E:	66-71
9	F:	0-65 pts
For those finishing by colloquium: • 50 pts
Jiří Filipovič        Introduction, CUDA Basics
About The Class
OOOO0OO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Materials - CUDA
CUDA documentation (installed as a part of CUDA Toolkit, downloadable from developer.nvidia.com)
9 CUDA C Programming Guide (most important properties of CUDA)
• CUDA C Best Practices Guide (more detailed document focusing on optimizations)
• CUDA Reference Manual (complete description of C for CUDA API)
9 other useful documents (nvcc guide, PTX language description, library manuals, .. .)
CUDA article series, Supercomputing for the Masses
• http://www.ddj.com/cpp/207200659
J in Fihpovic
Introduction, CUDA Basics
About The Class ooooo«o
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Materials - OpenCL
• OpenCL 1.1 Specification
• AMD Accelerated Parallel Processing Programming Guide 9 Intel OpenCL SDK Programming Guide
9 Writing Optimal OpenCL Code with Intel OpenCL SDK
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooo*
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Materials - Parallel Programming
a Ben-Ari M., Principles of Concurrent and Distributed Programming, 2nd Ed. Addison-Wesley, 2006
9 Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill, Patterns for Parallel Programming, Addison-Wesley, 2004
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOO «0000000 ooooooooo ooooo ooooooooooo
Motivation - GPU arithmetic performance
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
ooooooo o»oooooo ooooooooo ooooo ooooooooooo
Motivation - GPU memory bandwidth
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOO OO0OOOOO ooooooooo ooooo ooooooooooo
Motivation - programming complexity
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
9 well, it is more difficult comparing to writing serial C/C++ code...
9 but can we compare it to serial code?
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOO OO0OOOOO ooooooooo ooooo ooooooooooo
Motivation - programming complexity
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
9 well, it is more difficult comparing to writing serial C/C++ code...
9 but can we compare it to serial code?
Moore's Law
Number of transistors on a single chip doubles every 18 months J
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOO OO0OOOOO ooooooooo ooooo ooooooooooo
Motivation - programming complexity
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
9 well, it is more difficult comparing to writing serial C/C++ code...
9 but can we compare it to serial code?
Moore's Law
Number of transistors on a single chip doubles every 18 months
Corresponding growth of performance comes from
• in the past: frequency increase, instruction parallelism, out-of-order instruction processing, caches, etc.
• today: vector instructions, increase in number of cores
J in Fihpovic
Introduction, CUDA Basics
About The Class
ooooooo
Motivation OOO0OOOO
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Motivation - paradigm change
Moore's Law consequences:
• in the past:changes were important for compiler developers; application developers didn't need to worry
o today: in order to utilize state-of-the-art processors, it is necessary to write parallel and vectorized code
• it is necessary to find parallelism in the problem being solved, which is a task for a programmer, not for a compiler (at least for now)
• writing efficient code for modern CPUs is similarly difficult as writing for GPUs
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation oooo«ooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Important problem from computational chemistry
9 we have a molecule defined by position and charges of its atoms
• the goal is to compute charges at a 3D spatial grid around the molecule
In a given point of the grid, we have
Wj
J
u
Where wj is charge of the j-th atom, r,y is Euclidean distance between atom j and the grid point / and eo is vacuum permittivity.
J in Fihpovic
Introduction, CUDA Basics
About The Class
ooooooo
Motivation OOOOO0OO
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Initial implementation
• suppose we know nothing about HW, just know C++
• algorithm needs to process 3D grid such that it sums potential of all atoms for each grid point
• we will iterate over atoms in outer loop, as it allows to precompute positions of grid points and minimizes number of accesses into input/output array
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation OOOOOO0O
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
void  coulomb(const   sAtom*  atoms,   const   int  nAtoms, const   float  gs ,   const   int  gSize ,   float  *grid) { for   (int  a = 0;   a < nAtoms;   a++) { sAtom myAtom = atoms[a]; for   (int  x = 0;   x < gSize;   x++) {
float dx2 = powf((float)x * gs — myAtom.x, 2.Of) for  (int  y = 0;   y < gSize;   y++) {
float  dy2 = powf((float)y *  gs — myAtom.y); for   (int  z = 0;   z < gSize;   z++) {
float  dz = (float)z  *  gs — myAtom.z;
float  e = myAtom.w /  sqrtf(dx2 + dy2 + dz*dz
grid[z*gSize*gSize + y*gSize + x] += e;
}
}
}
}
}
J in Fihpovic
Introduction, CUDA Basics
About The Class
ooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
9 476.9 Mevals/s when optimized cache: 2.9x speedup
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
9 476.9 Mevals/s when optimized cache: 2.9x speedup
• 2,577 Mevals/s when vectorized: 15.6x speedup
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
9 476.9 Mevals/s when optimized cache: 2.9x speedup
• 2,577 Mevals/s when vectorized: 15.6x speedup
• 9,914 Mevals/s when parallelized: 60.2x speedup
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
9 476.9 Mevals/s when optimized cache: 2.9x speedup
• 2,577 Mevals/s when vectorized: 15.6x speedup
• 9,914 Mevals/s when parallelized: 60.2x speedup
• 537,900 Mevals/s GPU version: 3266x speedup
GPU speedup over already tuned CPU code is 54x, but the optimization effort is similar for CPU and GPU. In this class, you will learn how to optimize the code.
Jiří Filipovič        Introduction, CUDA Basics
About The Class ooooooo	Motivation                  GPU Architecture OOOOOOOO «00000000	C for CUDA ooooo	Sample Code ooooooooooo
			
Why are	GPUs so powerful?		
Types of Parallelism
• Task parallelism
• decomposition of a task into the problems that may be processed in parallel
• usually more complex tasks performing different actions
• usually more frequent (and complex) synchronization
• ideal for small number of high-performance processors
• Data parallelism
• parallelism on the level of data structures
9 usually the same operations on many items of a data structure
• finer-grained parallelism allows for simple construction of individual processors
Jiří Filipovič        Introduction, CUDA Basics
About The Class ooooooo	Motivation                  GPU Architecture OOOOOOOO O0OOOOOOO	C for CUDA ooooo	Sample Code ooooooooooo
			
Why are	GPUs so powerful?		
From programmer's perspective
• some problems are rather data-parallel, some task-parallel (graph traversal vs. matrix multiplication)
From hardware perspective
9 processors for data-parallel tasks may be simpler
9 it is possible to achieve higher arithmetic performance with the same size of a processor
• simpler memory access patterns allow for high-throughput memory designs
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture OO0OOOOOO
C for CUDA
ooooo
Sample Code
ooooooooooo
GPU Architecture
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture OOO0OOOOO
C for CUDA
ooooo
Sample Code
ooooooooooo
GPU Architecture
Main differences compared to CPU
• high parallelism: hundreds thousands threads needed to utilize high-end GPUs
• SIMT model: subsets of threads runs in lock-step mode
• distributed on-chip memory: subsets of threads shares their private memory
o restricted caching capabilities: small cache, often read-only Algorithms usually need to be redesigned to be efficient on GPU.
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture oooo«oooo
C for CUDA
ooooo
Sample Code
ooooooooooo
GPU Architecture
Within the system:
• co-processor with dedicated memory (discrete GPU)
• asynchronous processing of instructions
• attached using PCI-E to the rest of the system (discrete GPU)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
CUDA
Motivation
oooooooo
GPU Architecture OOOOO0OOO
C for CUDA
ooooo
Sample Code
ooooooooooo
CUDA (Compute Unified Device Architecture)
• architecture for parallel computations developed by NVIDIA
provides a new programming model, allows efficient implementation of general GPU computations
• may be used in multiple programming languages
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture OOOOOO0OO
C for CUDA
ooooo
Sample Code
ooooooooooo
G80 Processor
9 the first CUDA processor
• 16 multiprocessors
• each multiprocessor
8 scalar processors
• 2 units for special functions
• up to 768 threads
• HW for thread switching and scheduling
• threads are grouped into warps by 32
• SIMT
• native synchronization within the multiprocessor
J in Fihpovic
□ S1
Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture OOOOOOO0O
C for CUDA
ooooo
Sample Code
ooooooooooo
G80 Memory Model
Memory model
• 8192 registers shared among all threads of a multiprocessor
• 16 kB of shared memory
• local within the multiprocessor
• as fast as registry (under certain constraints)
• constant memory
• cached, read-only
• texture memory
• cached with 2D locality, read-only
• global memory
o non cached, read-write
• data transfers between global memory and system memory through PCI-E
J in Fihpovic
Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture oooooooo«
C for CUDA
ooooo
Sample Code
ooooooooooo
G80 Processor
Device
Multiprocessor N
Multiprocessor 2 Multiprocessor 1
J in Fihpovic
Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA •oooo
Sample Code
ooooooooooo
C for CUDA
C for CUDA is an extension of C for parallel computations
• explicit separation of host (CPU) and device (GPU) code
• thread hierarchy
• memory hierarchy
• synchronization mechanisms » API
Jin Fihpovic
□ S1
Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
O0OOO
Sample Code
ooooooooooo
Thread Hierarchy
Thread hierarchy
• threads are organized into blocks
• blocks form a grid
• problem is decomposed into sub-problems that can be run independently in parallel (blocks)
o individual sub-problems are divided into small pieces that can be run cooperatively in parallel (threads)
9 all threads from a block run on the same multiprocessor
• scales well
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA oo«oo
Sample Code
ooooooooooo
Thread Hierarchy
Grid
Block (0, 0)     Block (1, 0)     Block (2, 0)
Block (1,1)
Thread (0, 0)	Thread (1, 0)	Thread (2, 0)	Thread (3, 0)
Thread (0,1)	Thread (1,1)	Thread (2,1)	Thread (3,1)
Thread (0, 2)	Thread (1, 2)	Thread (2, 2)	Thread (3, 2)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA ooo«o
Sample Code
ooooooooooo
Memory Hierarchy
More memory types: 9 different visibility 9 different lifetime 9 different speed and behavior • brings good scalability
Jin Fihpovic
□ S
Introduction, CUDA Basics
About The Class ooooooo	Motivation oooooooo	GPU Architecture ooooooooo	C for CUDA oooo»	Sample Code ooooooooooo
				
Memory	Hierarchy			
Thread
Per-thread local memory
Thread Block
Per-block shared memory
GridO
Block (0, 0)    Block (1, 0)    Block (2, 0)
Block (0,1)    Block (1,1)    Block (2,1)
Gridl
Block (0,0) Block (1,0)
Block (0,1)
Block (1,1)
Block (0, 2) Block (1, 2)
Global memory
J in Fihpovic
Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem.
□    g     - =
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i = 0;   i < N; i++) c[i]  = a[i]  + b[i];
□      g       - =
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i = 0;   i < N; i++) c[i]  = a[i]  + b[i];
Individual iterations are independent - it is possible to parallelize, scales with the size of the vector.
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i = 0;   i < N; i++) c[i]  = a[i]  + b[i];
Individual iterations are independent - it is possible to parallelize,
scales with the size of the vector.
i-th thread sums i-th component of the vector:
c[i]  = a[i]  + b[i];
How do we find id of the thread?
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
O0OOOOOOOOO
Thread Hierarchy
Grid
Block (0, 0)     Block (1, 0)     Block (2, 0)
Block (1,1)
Thread (0, 0)	Thread (1, 0)	Thread (2, 0)	Thread (3, 0)
Thread (0,1)	Thread (1,1)	Thread (2,1)	Thread (3,1)
Thread (0, 2)	Thread (1, 2)	Thread (2, 2)	Thread (3, 2)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code oo«oooooooo
Thread and Block Identification
C for CUDA has built-in variables:
• threadldx.jx, y, z} tells position of a thread in a block
• blockDim.jx, y, z} tells size of the block
• blockldx.jx, y, z} tells position of the block in grid (z always equals 1)
9 gridDim.{x, y, z} tells grid size (z always equals 1)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOO0OOOOOOO
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOO0OOOOOOO
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOO0OOOOOOO
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Whole function for parallel summation of vectors:
__global__  void  addvec(float  *a,   float  *b ,   float *c){ int  i = biockldx.x*biockDim.x + threadldx.x; c[i]  = a[i]  + b[i];
}
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOO0OOOOOOO
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Whole function for parallel summation of vectors:
__global__  void  addvec(float  *a,   float  *b ,   float *c){ int  i = biockldx.x*biockDim.x + threadldx.x; c[i]  = a[i]  + b[i];
}
The function defines so called kernel; we specify how meny threads and what structure will be run when calling.
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code oooo«oooooo
Function Type Quantifiers
C syntax enhanced by quantifiers defining where the code is executed and from where it can be called:
• __device__ function is run on device (GPU) only and can be called from the device code only
• __global__ function is run on device (GPU) only and can be called from the host (CPU) code only
• __host__ function is run on host only and can be called from the host only
• __host__ and __device__ may be combined - function is compiled for both then
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
ooooooo oooooooo ooooooooo ooooo ooooo«ooooo
The following steps are needed for the full computation:
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
ooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation: o allocate memory for vectors and fill it with data
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
ooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation
• allocate memory for vectors and fill it with data
• allocate memory on GPU
Jin Fihpovic
Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
ooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation:
o allocate memory for vectors and fill it with data
• allocate memory on GPU
• copy vectors a and b to GPU
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
ooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation
o allocate memory for vectors and fill it with data • allocate memory on GPU 9 copy vectors a and b to GPU 9 compute the sum on GPU
□ S1
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
ooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation:
o allocate memory for vectors and fill it with data
• allocate memory on GPU
9 copy vectors a and b to GPU 9 compute the sum on GPU
• store the result from GPU into c
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
ooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation:
o allocate memory for vectors and fill it with data
• allocate memory on GPU
9 copy vectors a and b to GPU
9 compute the sum on GPU
• store the result from GPU into c 9 use the result in c :-)
When managed memory is used (requires GPU with computing capability 3.0 and CUDA 6.0 or better), steps written in italics are not required.
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOOOOO0OOOO
An Example - Sum of Vectors
CPU code that fills a and b and computes c
#include <stdio.h> #define  N 64 int main(){
float  *a,   *b,   *c ;
cudaMallocManaged(&a, N*sizeof(* a ) ) ; cudaMallocManaged(&b, N*sizeof(*b)); cudaMallocManaged(<^c , N*sizeof (* c ) ) ; for   (int   i = 0;   i < N;   i++) {
a[i]  = i;
b[i] = i*3;
}
// GPU code   will   be here
for   (int   i = 0;   i < N; i++) printf("%f,   " , c[i]);
cudaFree(a);   cudaFree(b); cudaFree(c);
return 0;
}
J in Fihpovic
Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
0000000*000
GPU Memory Management
Using managed memory, CUDA maintains memory transfers between CPU and GPU automatically.
• memory coherency is guaranteed
• GPU memory cannot be used when any GPU kernel is running Memory operations can be programmed explicitly
cudaMalloc(void**  devPtr,   size_t count); cudaFree(void*  devPtr);
cudaMemcpy(void*  dst ,   const  void*  src ,   size_t count, enum  cudaMemcpyKind  kind ) ;
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOOOOOOO0OO
An Example - Sum of Vectors
Running the kernel:
• kernel is called as a function; between the name and the arguments, there are triple angle brackets with specification of grid and block size
• we need to know block size and their count
• we will use ID block and grid with fixed block size
• the size of the grid is determined in a way to compute the whole problem of vector sum
For vector size divisible by 32:
#define BLOCK 32
addvec«<N/BLOCK ,   BLOCK»>(a,   b, c);
How to solve a general vector size?
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code ooooooooo«o
An Example - Sum of Vectors
We will modify the kernel source:
__global__  void  addvec(float  *a,   float  *b,   float  *c ,   int n){ int  i = biockldx.x*biockDim.x + threadldx.x; if   (i < n)  c[i]  = a[i]  + b[i];
}
And call the kernel with sufficient number of threads:
addvec«<N/BLOCK + 1,   BL0CK>»(a,   b,   c,   N) ;
Jiří Filipovič        Introduction, CUDA Basics
About The Class
ooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOOOOOOOOCM
An Example - Running It
Now we just need to compile it :-)
nvcc -o vecadd vecadd.cu
Where to work with CUDA?
• on a remote computer: airacuda.fi.muni.cz, barracuda.fi.muni.cz, accounts will be made
o your own machine: download and install CUDA toolkit and SDK from developer.nvidia.com
Jiří Filipovič        Introduction, CUDA Basics