<span id="page-0-0"></span>

## Code Optimizations

### Jiří Filipovič

### Fall 2023

メロメメ 御 メメ きょくきょ

È

 $299$ 

<span id="page-1-0"></span>

## Let v be the vector of size *n*. We want to compute  $x = \sum_{i=1}^{n} v_i$ .

 $\mathcal{A} \subseteq \mathcal{P} \times \mathcal{A} \subseteq \mathcal{P} \times \mathcal{A} \subseteq \mathcal{P} \times \mathcal{A} \subseteq \mathcal{P}$ 

重

 $299$ 



```
Let v be the vector of size n. We want to compute x = \sum_{i=1}^{n} v_i.
C code (not very reasonable for floats)
```

```
int x = 0;
for (int i = 0; i < n; i++)
  x \leftarrow v[i];
```
There is flow dependency across iterations.

 $2Q$ 

```
Let v be the vector of size n. We want to compute x = \sum_{i=1}^{n} v_i.
C code (not very reasonable for floats)
```

```
int x = 0;
for (int i = 0; i < n; i++)
  x \neq v[i];
```
There is flow dependency across iterations.

- we cannot compute completely parallel
- addition is (at least in theory :-)) associative
- so, we do not need to add numbers in sequential order

### Parallel Algorithm

The sequential algorithm performs seven steps:

$$
(((((((v_1 + v_2) + v_3) + v_4) + v_5) + v_6) + v_7) + v_8
$$

メロト メタト メミト メミト

重

 $2Q$ 

 $2Q$ 

## Parallel Algorithm

The sequential algorithm performs seven steps:

$$
(((((((v_1 + v_2) + v_3) + v_4) + v_5) + v_6) + v_7) + v_8
$$

Addition is associative... so let's reorder brackets:  $((v_1 + v_2) + (v_3 + v_4)) + ((v_5 + v_6) + (v_7 + v_8))$ 

**K ロ ト K 御 ト K ヨ ト** 

∽≏ດ

## Parallel Algorithm

The sequential algorithm performs seven steps:

$$
(((((v_1 + v_2) + v_3) + v_4) + v_5) + v_6) + v_7) + v_8
$$

Addition is associative... so let's reorder brackets:  $((v_1 + v_2) + (v_3 + v_4)) + ((v_5 + v_6) + (v_7 + v_8))$ We can work in parallel now:

- four additions in the first step
- two additions in the second step
- one addition in the third step

In summary, we perform  $n-1$  additions in log<sub>2</sub> n parallel steps!

つくい

## Parallel Algorithm

We have found the parallel algorithm

- the same number of additions as the serial algorithm
- in logarithmic time (if we have enough cores)

We add results of previous additions

- flow-dependency across threads
- we need global barrier

4日)



The simplest scheme of the algorithm:

- for even i,  $i < n$  perform  $v[i]$  + =  $v[i+1]$
- repeat for  $n \neq 2$  untill  $n > 1$

The performance is not ideal

- 2n numbers loaded from global memory
- *n* numbers stored to global memory
- $\bullet$  log<sub>2</sub> n kernel invocations

We have three memory accesses to one arithmetics operation and considerable kernel invocation overhead.

## Exploiting Data Locality

We can add more than pairs during single kernel call.

- $\bullet$  each block bx loads m numbers into shared memory
- it reduces the input (in shared memory in  $log_2 m$  steps)
- it stores only one number containing  $\sum_{i=m\cdot bx}^{m\cdot bx+m-1}$  v

Reduces both memory transfers and number of kernel invocations

- number of loads:  $n + \frac{n}{m} + \frac{n}{m^2} + ... + \frac{n}{m^{\log_m n}} = (n 1) \frac{m}{m 1}$
- approximately  $n + \frac{n}{\alpha}$  $\frac{n}{m}$  numbers read,  $\frac{n}{m}$  written
- $\bullet$  log<sub>m</sub> n kernel invocations

イロメ マ桐 メメモ レマモメ

つくい

 $QQ$ 

メロメ メ母メ メミメ メミメー

### Implementation 1

```
\text{I-global}_- void reduce1 (int *v) {
  extern \_shared int sv[];
  unsigned int tid = threadIdx x:
  unsigned int i = blockIdx x*blockDim x + threedIdx x;
  sv [tid] = v[i];__syncthreads ( ) ;
  for (unsigned int s=1; s < blockDim x; s == 2) {
    if (tid % (2*s) == 0)sv [ tid ] += sv [ tid + s ];
    __syncthreads ( ) ;
  }
  if (tid = 0)v[blockIdxx] = sv[0];
}
```
つくい

Performance

Beware modulo operation. High degree of divergence

- **•** during the first iteration, only half of threads is working
- during the second iteration, only quarter of threads is working
- etc.

Performance on GTX 280: 3.77 GB/s (0.94 MElem/s).

**COLLA** 

つくい

### Implementation 2

### We will modify indexation

```
for (unsigned int s = 1; s < blockDim x; s == 2) {
  int index = 2 * s * tid:
  if ( index < blockDim . x )
    sv [index] += sv [index + s];
  __syncthreads ( ) ;
}
```
Performance: 8.33 GB/s (2.08 MElem/s).

The code is free of modulo and divergence, but generates shared memory bank conflicts.

イロト イ押ト イヨト

### Implementation 3

### So we can try another indexing...

```
for (unsigned int s = \text{blockDim.x}/2; s > 0; s \gg = 1) {
  if (tid < s)sv [ tid ] += sv [ tid + s ];
  __syncthreads ( ) ;
}
```
No divergence and no conflicts. Performance 16.34 GB/s (4.08 MElem/s). Half of threads do not compute...

a miller

 $\left\{ \begin{array}{ccc} \overline{a} & \overline{b} & \overline{c} & \overline{d} \\ \overline{c} & \overline{c} & \overline{d} & \overline{d} \end{array} \right.$ 

つくい

### Implementation 4

We can add numbers during loading them from global memory.

```
unsigned int i = blockIdx x * (blockDim x * 2) + threadIdx x;
sv [tid] = v[i] + v[i+blockDim.x];
```
Performance 27.16 GB/s (6.79 MElem/s).

There is no problem with data access, but the performance is still low – we will focus to instructions.

**K ロ ▶ K 御 ▶ K 舌** 

### Implementation 5

The number of active threads decreases during computation in shared memory.

- in the last six iterations, only the last warp is active
- $\bullet$  the warp is synchronized implicitly on GPUs with c.c.  $<$  7.0, so we do not need  $\_switchreads()$ 
	- we need volatile variable in this case
- condition if (tid  $\lt s$ ) does not spare any computation

So we can unroll the last warp...



We save time in all warps (the last warp is simpler, others exits earlier from the for loop). Performance: 37.68 GB/s (9.42 MElem/s).

 $2Q$ 

### Implementation 5

### For c.c. 3.0 or greater, we can use warp shuffle:

```
if (tid < 32){
  mySum += sdata [tid + 32]; __syncthreads ();
  for (int offset = warpSize/2; offset > 0; offset /= 2)
    mySum += __shfl_down_sync ( mySum , offset ) ;
}
```
メロト メタト メミト メミト

 $2Q$ 

つくい

### Implementation 6

Can we unroll the for loop?

If we know the number of iterations, we can unroll it

• the number of iterations depends on the block size

Can we implement it generically?

- algorithm uses blocks of size  $2<sup>n</sup>$
- the block size is upper-bound
- $\bullet$  if we know the block size during compilation, we can use a template

```
template <unsigned int blockSize>
_{-\frac{\text{global}}{-}} void reduce6 (int *v)
```


Conditions using blockSize are evaluated during compilation:

```
if (blockSize > = 512){
  if (tid < 256)sv [tid] += sv [tid + 256];\Boxsyncthreads () ;
}
if (blockSize > = 256){
  if (tid < 128)sv [tid] += sv [tid + 128];__syncthreads ( ) ;
}
if (blockSize > = 128){
  if (tid < 64)sv [tid] += sv [tid + 64];__syncthreads ( ) ;
}
```
Performance: 50.64 GB/s (12.66 MElem/s).

つくい

Implementation 7

Can we implement faster algorithm? Let's reconsider the complexity:

- $\bullet$  log *n* parallel steps
- $n-1$  additions
- $\bullet$  time complexity for  $p$  threads running in parallel (using  $p$ processors):  $\mathcal{O}(\frac{n}{p} + \log n)$

Cost of parallel computation

- defined as number of processors multiplied by time complexity
- if we assign one thread to one data element, we get  $p = n$
- and the cost is  $\mathcal{O}(n \cdot \log n)$
- which is not efficient

つくい



Decreasing the cost

- we use  $\mathcal{O}(\frac{n}{\log n})$  $\frac{n}{\log n}$ ) threads
- each thread performs  $O(\log n)$  sequential steps
- after that, it performs  $O(\log n)$  parallel steps
- time complexity is the same
- the cost is  $\mathcal{O}(n)$

What it means in practice?

- we reduce overhead of the computation (e.g., integer arithmetics)
- advantage if we have much more threads that is needed to saturate GPU



### We modify loading into shared memory

```
unsigned int gridSize = blockSize *2* gridDim.x;
svtild = 0;
while (i < n) {
  sv [tid] += v[i] + v[i+blockSize];i \neq gridSize;
}
__syncthreads ( ) ;
```
Performance: 77.21 GB/s (19.3 MElem/s).

 $4.171 \pm$ 

 $\left\{ \begin{array}{ccc} \overline{a} & \overline{b} & \overline{c} & \overline{d} \\ \overline{c} & \overline{c} & \overline{d} & \overline{d} \end{array} \right.$ 

 $QQ$ 



We modify loading into shared memory

```
unsigned int gridSize = blockSize *2* gridDim x;
svtild = 0;
while (i < n) {
  sv[tid] += v[i] + v[i+blockSize];i += gridSize ;
}
__syncthreads ( ) ;
```
Performance: 77.21 GB/s (19.3 MElem/s).

You can find those implementations in CUDA SDK.

 $2Q$ 

### <span id="page-24-0"></span>Intra-kernel Optimizations

The compiler optimizes each kernel separately, so it may miss some optimization opportunities.

- kernel fusion gluing code from several kernels into one kernel
- kernel fission splitting a kernel into several smaller kernels

€ □ F

## Kernel fusion

Performance impact of kernel fusion

- **•** reduce kernel execution overhead
- may add more parallelism
- allow more scalar code optimizations: common subexpression elimination, loop fusion, condition fusion
- reduce global memory transfers if kernels are flow-dependent or input-dependent

**Correctness** 

- no flow dependency between thread blocks
- shared memory and registers locality has to be maintained

### Kernel fission

Kernel fission reduces resources consumption

- increases occupancy
- may allow to use algorithms which would be inefficient when run in one kernel (e.g., if part of the algorithm uses different amount of parallelism or different amount of resources)
- more complicated and divergent codes may be separated (e.g., handling array boundaries)

**Correctness** 

much easier, we just need to transfer data between new kernels

<span id="page-27-0"></span>Before we start with code acceleration, we should consider carefully, if it is meaningful.

The accelerated code should be

- critical for application performance (profile... and profile on real data)
- large enough (usually not ideal for relatively simple but latency critical application)
- parallelizable (problematic, e.g., in simulation of a small system evolving for a long time)
- sufficient number of flops to memory transfers (consider slow PCI-E)

 $2Q$ 

### Problem Choice

Do we optimize running time or power consumption?

- accelerators are usually faster, but also have higher power consumption
- how to deal with hybrid systems (e.g., CPU, GPU and Xeon Phi)
- influences decision what to buy as well as what to use (which resources let in power-saving mode)

## Algorithm Design

### Parallelization

- we need to parallelize computational problem
- we should be aware about target architecture even in this stage (consider, e.g., graph algorithms)
- It is difficult to accelerate codes on GPU:
	- if threads within the warp access rather random addresses in the memory
	- $\bullet$  if threads within the warp diverges (by nature of the algorithm)
	- if the parallelism is insufficient in certain parts of computation

### How to Write Bug-Free Code

Test if API and kernel calls are successful

 $\bullet$  otherwise, errors can appear later...

The memory allocation on GPU occurs seldom

- if your modify your code in a way your kernel does not write any result, you got a result from its previous run
- clear output arrays for debugging purposes

Be aware of out-of-bounds shared memory access

• kernel usually runs successfully, but one block interferes with another

<span id="page-31-0"></span>

Start with the most important optimizations and continue with less important (so the effect of less important optimizations is not hidden). In general, this order should work well:

- PCI-E transfers reduction/overlay
- global memory access (bandwidth, latency)
- access to other types of memory
- **o** divergence
- parallelism configuration (block size, amount of serial work per thread)
- **•** instruction optimization
- It is good idea to write your code configurable
	- **•** block size, number of serial iterations per thread, loop unrolling factor, used algorithm . . .
	- use macros or templates to ease configutation of the optimizations

### <span id="page-32-0"></span>Interpretation of Algorithm Performance

Some optimizations may be hidden

- e.g., optimizing instruction cannot help when code is bound by wrong global memory access
- can be reduced by applying more important optimizations earlier
- use the profiler
- The optimization space is not continuous
	- **•** due to restricted amount of GPU resources
	- e.g., improving efficiency of scalar code by using one more register may decrease the performance by restricting GPU occupancy

Performance is data-dependent

- **o** data size: partition camping, underutilized GPU
- data content: sparse data with varying [str](#page-31-0)[uct](#page-33-0)[u](#page-31-0)[re](#page-32-0)

### <span id="page-33-0"></span>What is real speedup over CPU?

Comparison of a theoretical peak is basic metric

- **•** however, the speedup can be lower
	- insufficient parallelism
	- inappropriate data structures, random access
	- PCI-E bottleneck (especially multi-GPU algorithms)
- however, the speedup can be also higher
	- frequent usage of SFUs
	- complicated vectorization on CPU
	- insufficient scaling on SMP (cache interferences, NUMA)
- **•** different scaling of CPU and GPU with growing problem size

## <span id="page-34-0"></span>Searching for Bottlenecks

The amount of arithmetic operations and memory transfers tells us what is expected to be a limit for algorithm

- sometimes bottleneck is not clear (overhead instructions, irregular memory access)
- $\bullet$  code profiling suitable to identify issues with instructions throughput or bad memory access pattern, more difficult to identify source of latency problems
- $\bullet$  code modifications more difficult and not usable in all cases. but may help to clarify some issues

<span id="page-35-0"></span>How close is the code to the hardware limits?

profiler shows the overall utilization of particular GPU subsystems, such as cache, global memory, FP instructions etc.

### Issues identification

**•** profiler detects some issues, such as shared memory bank conflicts or code divergence

We can inspect a code in details

- $\bullet$  time spent on particular instructions/C for CUDA lines of code
- we need to compile the code with flag -lineinfo

 $\Omega$ 

### <span id="page-36-0"></span>Code Modifications

Global memory performance

- we comment-out the computation
- but we need to somehow use loaded data (to disallow compiler to exclude loading)
- we can check with profiler that the same amount of data are transfered

Instructions performance

- we comment-out data movement
- but the resulting data is needed to be stored (to disallow compiler to exclude computation)
	- **•** but we do not want to store data...
	- we can move the code storing data into condition which is evaluated as false during computation (but not during compilation)
- be aware of execution overhead in the c[as](#page-35-0)e [o](#page-37-0)[f](#page-35-0) [fa](#page-36-0)[s](#page-37-0)[t](#page-33-0) [k](#page-34-0)[er](#page-38-0)[n](#page-33-0)[e](#page-34-0)[ls](#page-38-0)

# <span id="page-37-0"></span>Code Modifications

Be aware of occupancy changes

- code modifications can release some resources
- we can restrict occupancy by allocating some dummy shared memory array

### Interpretation of measured times

- **•** original kernel execution time is close to sum of computation and memory kernel time – the latency is an issue
- computation or memory kernel time dominates and is closed to original kernel time – the performance is bounded by computation or memory
- computation and memory kernel times are similar to original kernel time – we need to optimize both

4 n + 4 n + 4 =

∽≏ດ

## <span id="page-38-0"></span>Code Modifications

Approximation of an optimization effect

- when we already know some performance issue
- when we want to know the effect of optimization before we actually implement it
- we can modify the code without preserving original functionality, but preserving amount of work and removing performance issue
	- cannot be done in all cases
	- may show us if we really address the performance issue
	- see matrix transposition example