About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo oooooooo ooooooooo ooooo ooooooooooo
Introduction, CUDA Basics
in Fihpovic
Fall 2019
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
•ooooooo	oooooooo	ooooooooo	ooooo	ooooooooooo
Language
I speak English as long as there is some foreign student in the room
• expect Slovak :-) But I understand that
• my English is not perfect
o your English may not be perfect
o so feel free to interrupt me and ask me, if you do not understand
If you do not feel comfortable to ask me in English
• ask me in Czech/Slovak
Jiří Filipovič        Introduction, CUDA Basics
About The Class
O0OOOOOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
About the class
The class is focused on algorithm design and programming of general purpose computing applications on many-core vector processors
Jiří Filipovič        Introduction, CUDA Basics
About The Class
O0OOOOOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
About the class
The class is focused on algorithm design and programming of general purpose computing applications on many-core vector processors
We will focus to CUDA GPUs first:
• C for CUDA is good for teaching (easy API, a lot of examples available, mature compilers and tools)
• restricted to NVIDIA GPUs and x86 CPUs (with PGI)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
O0OOOOOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
About the class
The class is focused on algorithm design and programming of general purpose computing applications on many-core vector processors
We will focus to CUDA GPUs first:
• C for CUDA is good for teaching (easy API, a lot of examples available, mature compilers and tools)
• restricted to NVIDIA GPUs and x86 CPUs (with PGI) After learning CUDA, we focus to OpenCL
• programming model very similar to CUDA, easy to learn when you already know CUDA
• can be used with various HW devices
o we will focus on code optimizations for x86, Intel MIC (Xeon Phi) and AMD GPUs
The class is practically oriented - besides efficient parallelization, we will focus on writing efficient code.
J in Fihpovic
Introduction, CUDA Basics
About The Class
OO0OOOOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
What is offered
You will learn:
• architecture of NVIDIA and AMD GPUs, Xeon Phi
• architecture-aware design of data-parallel algorithms 9 programming in C for CUDA and OpenCL
• performance tuning and profiling
• basic tools and libraries for CUDA GPUs <* use cases
Jiří Filipovič        Introduction, CUDA Basics
About The Class
OOO0OOOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
What is expected from you
During the semester, you will work on a practically oriented project o important part of your total score in the class
• the same task for everybody, we will compare speed of your implementation
• 50 + 20 points of total score
• working code: 25 points
a efficient implementation: 25 points
• speed of your code relative to your class mates: at most 20 points (only to improve your final grading)
Exam (oral or written, depending on the number of students)
50 points
Jiří Filipovič        Introduction, CUDA Basics
About The Class
OOOO0OOO
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Grading
For those finishing by exam:
o	A:	92-100
9	B:	86-91
9	C:	78-85
9	D:	72-77
9	E:	66-71
9	F:	0-65 pts
For those finishing by colloquium: • 50 pts
Jiří Filipovič        Introduction, CUDA Basics
About The Class ooooo«oo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Materials - CUDA
CUDA documentation (installed as a part of CUDA Toolkit, downloadable from developer.nvidia.com)
9 CUDA C Programming Guide (most important properties of CUDA)
• CUDA C Best Practices Guide (more detailed document focusing on optimizations)
• CUDA Reference Manual (complete description of C for CUDA API)
9 other useful documents (nvcc guide, PTX language description, library manuals, .. .)
CUDA article series, Supercomputing for the Masses
• http://www.ddj.com/cpp/207200659
J in Fihpovic
Introduction, CUDA Basics
About The Class OOOOOO0O
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Materials - OpenCL
• OpenCL 1.1 Specification
• AMD Accelerated Parallel Processing Programming Guide 9 Intel OpenCL SDK Programming Guide
• Writing Optimal OpenCL Code with Intel OpenCL SDK
Jin Fihpovic
□ S
Introduction, CUDA Basics
About The Class ooooooo*
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Materials - Parallel Programming
a Ben-Ari M., Principles of Concurrent and Distributed Programming, 2nd Ed. Addison-Wesley, 2006
9 Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill, Patterns for Parallel Programming, Addison-Wesley, 2004
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOOO «0000000 ooooooooo ooooo ooooooooooo
Motivation - GPU arithmetic performance
Theoretical GFLOP/s at base clock
10500 10000 9500 9000 8500 8000 7500
-NVIDIA GPU Single Precision
7000 -f 6500
-NVIDIAGPU Double Precision
-Intel CPU Single Precision
Intel CPU Double Precision
5000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0
2003
2005
2009
2011
2013
2015
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo o«oooooo ooooooooo ooooo ooooooooooo
Motivation - GPU memory bandwidth
Theoretical Peak GB/s
800
2003 2005 2007 2009 2011 2013 2015
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOOO OO0OOOOO ooooooooo ooooo ooooooooooo
Motivation - programming complexity
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
9 well, it is more difficult comparing to writing serial C/C++ code...
9 but can we compare it to serial code?
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOOO OO0OOOOO ooooooooo ooooo ooooooooooo
Motivation - programming complexity
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
9 well, it is more difficult comparing to writing serial C/C++ code...
9 but can we compare it to serial code?
Moore's Law
Number of transistors on a single chip doubles every 18 months J
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOOO OO0OOOOO ooooooooo ooooo ooooooooooo
Motivation - programming complexity
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
9 well, it is more difficult comparing to writing serial C/C++ code...
9 but can we compare it to serial code?
Moore's Law
Number of transistors on a single chip doubles every 18 months
Corresponding growth of performance comes from
• in the past: frequency increase, instruction parallelism, out-of-order instruction processing, caches, etc.
• today: vector instructions, increase in number of cores
J in Fihpovic
Introduction, CUDA Basics
About The Class
oooooooo
Motivation OOO0OOOO
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Motivation - paradigm change
Moore's Law consequences:
• in the past:changes were important for compiler developers; application developers didn't need to worry
o today: in order to utilize state-of-the-art processors, it is necessary to write parallel and vectorized code
• it is necessary to find parallelism in the problem being solved, which is a task for a programmer, not for a compiler (at least for now)
• writing efficient code for modern CPUs is similarly difficult as writing for GPUs
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation oooo«ooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Important problem from computational chemistry
9 we have a molecule defined by position and charges of its atoms
• the goal is to compute charges at a 3D spatial grid around the molecule
In a given point of the grid, we have
Wj
J
u
Where wj is charge of the j-th atom, r,y is Euclidean distance between atom j and the grid point / and eo is vacuum permittivity.
J in Fihpovic
Introduction, CUDA Basics
About The Class
oooooooo
Motivation OOOOO0OO
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Initial implementation
• suppose we know nothing about HW, just know C++
• algorithm needs to process 3D grid such that it sums potential of all atoms for each grid point
• we will iterate over atoms in outer loop, as it allows to precompute positions of grid points and minimizes number of accesses into input/output array
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation OOOOOO0O
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
void  coulomb(const   sAtom*  atoms,   const   int  nAtoms, const   float  gs ,   const   int  gSize ,   float  *grid) { for   (int  a = 0;   a < nAtoms;   a++) { sAtom myAtom = atoms[a]; for   (int  x = 0;   x < gSize;   x++) {
float dx2 = powf((float)x * gs — myAtom.x, 2.Of) for  (int  y = 0;   y < gSize;   y++) {
float  dy2 = powf((float)y *  gs — myAtom.y); for   (int  z = 0;   z < gSize;   z++) {
float  dz = (float)z  *  gs — myAtom.z;
float  e = myAtom.w /  sqrtf(dx2 + dy2 + dz*dz
grid[z*gSize*gSize + y*gSize + x] += e;
}
}
}
}
}
J in Fihpovic
Introduction, CUDA Basics
About The Class
oooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
OOOOOOOO 0000000« ooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
• 476.9 Mevals/s when optimized cache: 2.9x speedup
Jin Fihpovic
□ a
Introduction, CUDA Basics
About The Class
oooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
9 476.9 Mevals/s when optimized cache: 2.9x speedup
• 2,577 Mevals/s when vectorized: 15.6x speedup
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
9 476.9 Mevals/s when optimized cache: 2.9x speedup
• 2,577 Mevals/s when vectorized: 15.6x speedup
• 9,914 Mevals/s when parallelized: 60.2x speedup
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation ooooooo*
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
Electrostatic Potential Map
Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)
• naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
9 476.9 Mevals/s when optimized cache: 2.9x speedup
• 2,577 Mevals/s when vectorized: 15.6x speedup
• 9,914 Mevals/s when parallelized: 60.2x speedup
• 537,900 Mevals/s GPU version: 3266x speedup
GPU speedup over already tuned CPU code is 54x, but the optimization effort is similar for CPU and GPU. In this class, you will learn how to optimize the code.
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation                  GPU Architecture OOOOOOOO «00000000	C for CUDA ooooo	Sample Code ooooooooooo
			
Why are	GPUs so powerful?		
Types of Parallelism
• Task parallelism
• decomposition of a task into the problems that may be processed in parallel
• usually more complex tasks performing different actions
• usually more frequent (and complex) synchronization
• ideal for small number of high-performance processors
• Data parallelism
• parallelism on the level of data structures
9 usually the same operations on many items of a data structure
• finer-grained parallelism allows for simple construction of individual processors
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation                  GPU Architecture oooooooo o«ooooooo	C for CUDA ooooo	Sample Code ooooooooooo
			
Why are	GPUs so powerful?		
From programmer's perspective
• some problems are rather data-parallel, some task-parallel (graph traversal vs. matrix multiplication)
From hardware perspective
9 processors for data-parallel tasks may be simpler
9 it is possible to achieve higher arithmetic performance with the same size of a processor
• simpler memory access patterns allow for high-throughput memory designs
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture oo»oooooo
C for CUDA
ooooo
Sample Code
ooooooooooo
GPU Architecture
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture OOO0OOOOO
C for CUDA
ooooo
Sample Code
ooooooooooo
GPU Architecture
Main differences compared to CPU
• high parallelism: hundreds thousands threads needed to utilize high-end GPUs
• SIMT model: subsets of threads runs in lock-step mode
• distributed on-chip memory: subsets of threads shares their private memory
o restricted caching capabilities: small cache, often read-only Algorithms usually need to be redesigned to be efficient on GPU.
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture oooo«oooo
C for CUDA
ooooo
Sample Code
ooooooooooo
GPU Architecture
Within the system:
• co-processor with dedicated memory (discrete GPU)
• asynchronous processing of instructions
• attached using PCI-E to the rest of the system (discrete GPU)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
CUDA
Motivation
oooooooo
GPU Architecture OOOOO0OOO
C for CUDA
OOOOO
Sample Code
OOOOOOOOOOO
CUDA (Compute Unified Device Architecture)
• architecture for parallel computations developed by NVIDIA
provides a new programming model, allows efficient implementation of general GPU computations
• may be used in multiple programming languages
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture oooooo«oo
C for CUDA
ooooo
Sample Code
ooooooooooo
G80 Processor
9 the first CUDA processor
• 16 multiprocessors
• each multiprocessor
8 scalar processors
• 2 units for special functions
• up to 768 threads
• HW for thread switching and scheduling
• threads are grouped into warps by 32
• SIMT
• native synchronization within the multiprocessor
J in Fihpovic
□ S1
Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture OOOOOOO0O
C for CUDA
ooooo
Sample Code
ooooooooooo
G80 Memory Model
Memory model
• 8192 registers shared among all threads of a multiprocessor
• 16 kB of shared memory
• local within the multiprocessor
• as fast as registry (under certain constraints)
• constant memory
• cached, read-only
• texture memory
• cached with 2D locality, read-only
• global memory
o non cached, read-write
• data transfers between global memory and system memory through PCI-E
J in Fihpovic
Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture oooooooo«
C for CUDA
ooooo
Sample Code
ooooooooooo
G80 Processor
Device
Multiprocessor N
Multiprocessor 2 Multiprocessor 1
□ S1
J in Fihpovic
Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA •oooo
Sample Code
ooooooooooo
C for CUDA
C for CUDA is an extension of C for parallel computations 9 explicit separation of host (CPU) and device (GPU) code
• thread hierarchy
• memory hierarchy
• synchronization mechanisms
• API
J in Fihpovic
Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
O0OOO
Sample Code
ooooooooooo
Thread Hierarchy
Thread hierarchy
• threads are organized into blocks
• blocks form a grid
• problem is decomposed into sub-problems that can be run independently in parallel (blocks)
o individual sub-problems are divided into small pieces that can be run cooperatively in parallel (threads)
9 all threads from a block run on the same multiprocessor
• scales well
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA oo«oo
Sample Code
ooooooooooo
Thread Hierarchy
Grid
Block (0, 0)     Block (1, 0)     Block (2, 0)
Block (1,1)
Thread (0, 0)	Thread (1, 0)	Thread (2, 0)	Thread (3, 0)
Thread (0,1)	Thread (1,1)	Thread (2,1)	Thread (3,1)
Thread (0, 2)	Thread (1, 2)	Thread (2, 2)	Thread (3, 2)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA ooo«o
Sample Code
ooooooooooo
Memory Hierarchy
More memory types:
• different visibility
• different lifetime
9 different speed and behavior
• brings good scalability
J in Fihpovic
□ a
Introduction, CUDA Basics
About The Class oooooooo	Motivation OOOOOOOO	GPU Architecture OOOOOOOOO	C for CUDA OOOO»	Sample Code OOOOOOOOOOO
				
Memory	Hierarchy			
Thread
Per-thread local memory
Thread Block
Per-block shared memory
GridO
Block (0, 0)    Block (1, 0)    Block (2, 0)
Block (0,1)    Block (1,1)    Block (2,1)
Gridl
Block (0,0) Block (1,0)
Block (0,1)
Block (1,1)
Block (0, 2) Block (1, 2)
Global memory
J in Fihpovic
Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem.
□    g     - =
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i = 0;   i < N; i++) c[i]  = a[i]  + b[i];
□      g       - =
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i = 0;   i < N; i++) c[i]  = a[i]  + b[i];
Individual iterations are independent - it is possible to parallelize, scales with the size of the vector.
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code •oooooooooo
An Example - Sum of Vectors
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i = 0;   i < N; i++) c[i]  = a[i]  + b[i];
Individual iterations are independent - it is possible to parallelize,
scales with the size of the vector.
i-th thread sums i-th component of the vector:
c[i]  = a[i]  + b[i];
How do we find id of the thread?
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
O0OOOOOOOOO
Thread Hierarchy
Grid
Block (0, 0)     Block (1, 0)     Block (2, 0)
Block (1,1)
Thread (0, 0)	Thread (1, 0)	Thread (2, 0)	Thread (3, 0)
Thread (0,1)	Thread (1,1)	Thread (2,1)	Thread (3,1)
Thread (0, 2)	Thread (1, 2)	Thread (2, 2)	Thread (3, 2)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code oo«oooooooo
Thread and Block Identification
C for CUDA has built-in variables:
• threadldx.jx, y, z} tells position of a thread in a block
• blockDim.jx, y, z} tells size of the block
• blockldx.jx, y, z} tells position of the block in grid (z always equals 1)
9 gridDim.{x, y, z} tells grid size (z always equals 1)
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOO0OOOOOOO
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOO0OOOOOOO
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
OOOOOOOOO
C for CUDA
OOOOO
Sample Code
OOO0OOOOOOO
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Whole function for parallel summation of vectors:
__global__  void  addvec(float  *a,   float  *b ,   float *c){ int  i = biockldx.x*biockDim.x + threadldx.x; c[i]  = a[i]  + b[i];
}
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOO0OOOOOOO
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Whole function for parallel summation of vectors:
__global__  void  addvec(float  *a,   float  *b ,   float *c){ int  i = biockldx.x*biockDim.x + threadldx.x; c[i]  = a[i]  + b[i];
}
The function defines so called kernel; we specify how meny threads and what structure will be run when calling.
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code oooo«oooooo
Function Type Quantifiers
C syntax enhanced by quantifiers defining where the code is executed and from where it can be called:
• __device__ function is run on device (GPU) only and can be called from the device code only
• __global__ function is run on device (GPU) only and can be called from the host (CPU) code only
• __host__ function is run on host only and can be called from the host only
• __host__ and __device__ may be combined - function is compiled for both then
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo oooooooo ooooooooo ooooo ooooo«ooooo
The following steps are needed for the full computation:
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation o allocate memory for vectors and fill it with data
□ S1
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation
• allocate memory for vectors and fill it with data
• allocate memory on GPU
Jin Fihpovic
Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation:
o allocate memory for vectors and fill it with data • allocate memory on GPU 9 copy vectors a a b to GPU
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation
o allocate memory for vectors and fill it with data • allocate memory on GPU 9 copy vectors a a b to GPU 9 compute the sum on GPU
□ r3>
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation
o allocate memory for vectors and fill it with data
• allocate memory on GPU 9 copy vectors a a b to GPU
9 compute the sum on GPU
• store the result from GPU into c
□ S1
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	oooooooo	ooooooooo	ooooo	ooooo«ooooo
The following steps are needed for the full computation:
o allocate memory for vectors and fill it with data
• allocate memory on GPU 9 copy vectors a a b to GPU
9 compute the sum on GPU
• store the result from GPU into c 9 use the result in c :-)
When managed memory is used (requires GPU with computing capability 3.0 and CUDA 6.0 or better), steps written in italics are not required.
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
OOOOOO0OOOO
An Example - Sum of Vectors
CPU code that fills a and b and computes c
#include <stdio.h> #define  N 64 int main(){
float  *a,   *b,   *c ;
cudaMallocManaged(&a, N*sizeof(* a ) ) ; cudaMallocManaged(&b, N*sizeof(*b)); cudaMallocManaged(&c , N*sizeof (* c ) ) ; for   (int   i = 0;   i < N;   i++) {
a[i]  = i;
b[i] = i*3;
}
// GPU code   will   be here
for   (int   i = 0;   i < N; i++) printf("%f,   " , c[i]);
cudaFree(a);   cudaFree(b); cudaFree(c);
return 0;
}
J in Fihpovic
Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code
0000000*000
GPU Memory Management
Using managed memory, CUDA maintains memory transfers between CPU and GPU automatically.
• memory coherency is guaranteed
• GPU memory cannot be used when any GPU kernel is running Memory operations can be programmed explicitly
cudaMalloc(void**  devPtr,   size_t count); cudaFree(void*  devPtr);
cudaMemcpy(void*  dst ,   const  void*  src ,   size_t count, enum  cudaMemcpyKind  kind ) ;
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
OOOOOOOOO
C for CUDA
ooooo
Sample Code
OOOOOOOO0OO
An Example - Sum of Vectors
Running the kernel:
• kernel is called as a function; between the name and the arguments, there are triple angle brackets with specification of grid and block size
• we need to know block size and their count
• we will use ID block and grid with fixed block size
• the size of the grid is determined in a way to compute the whole problem of vector sum
For vector size divisible by 32:
#define BLOCK 32
addvec«<N/BLOCK ,   BLOCK»>(a,   b, c);
How to solve a general vector size?
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code ooooooooo«o
An Example - Sum of Vectors
We will modify the kernel source:
__global__  void  addvec(float  *a,   float  *b,   float  *c ,   int n){ int  i = biockldx.x*biockDim.x + threadldx.x; if   (i < n)  c[i]  = a[i]  + b[i];
}
And call the kernel with sufficient number of threads:
addvec«<N/BLOCK + 1,   BL0CK>»(a,   b,   c,   N) ;
Jiří Filipovič        Introduction, CUDA Basics
About The Class
oooooooo
Motivation
oooooooo
GPU Architecture
ooooooooo
C for CUDA
ooooo
Sample Code oooooooooo«
An Example - Running It
Now we just need to compile it :-)
nvcc -o vecadd vecadd.cu
Where to work with CUDA?
• on a remote computer: airacuda.fi.muni.cz (more machines will appear), accounts will be made
o your own machine: download and install CUDA toolkit and SDK from developer.nvidia.com
Jiří Filipovič        Introduction, CUDA Basics