C2115 Practical introduction to supercomputing Lesson 12 -1-
C2115
Practical introduction to
supercomputing
Petr Kulhanek
kulhanek@chemi.muni.cz
National Center for Biomolecular Research, Faculty of Science,
Masaryk University, Kotlářská 2, CZ-61137 Brno
Lesson 12
Revision 1
C2115 Practical introduction to supercomputing Lesson 12 -2-
Contents
➢ Representation of numbers ​​in digital technology
integers, real numbers
➢ From problem to result
algorithm, source codes, translation, program execution, programming
languages
➢ numerical integration
➢ matrix multiplication
C2115 Practical introduction to supercomputing Lesson 12 -3-
Conclusion
To solve problems, it is always advisable
to use existing software libraries or
programs, that are greatly optimized for
the given problem and hardware.*
* May not always be appropriate in case of design
verification (proof of concept), as the use of
optimized approaches may not be trivial at first.
C2115 Practical introduction to supercomputing Lesson 12 -4Representation
of
numbers
C2115 Practical introduction to supercomputing Lesson 12 -5Typical
computer scheme
CPU
north
bridge
north
bridge
south
bridge
south
bridge
USB
mouse, keyboard
real time clock SATA controllers
hard drives
BIOS
graphics
system
memory
memory controller
peripherals with quick access via
PCI Express
network (ethernet)sound
PCI bus
memory controller becomes part
of the latest processors
GPGPU
C2115 Practical introduction to supercomputing Lesson 12 -6-
CPU
Processor also CPU (Central Processing Unit) is an essential part of the computer; it is a
very complex sequential circuit that executes machine code stored in the computer's
operating memory. The machine code consists of individual machine instructions of a
computer programs loaded into the operating memory. www.wikipedia.org
CPU
controller
ALU
ALU (arithmetic and logic unit),
performs arithmetic operations,
evaluates conditions
reads machine instructions and data
and prepares their processing in ALU
sequential processing of machine instructions is
controlled by an internal clock cycle
How does the CPU (ALU) work with numeric values?
C2115 Practical introduction to supercomputing Lesson 12 -7Integer
numbers
The smallest unit of information in digital technology is one bit. Bits are formed into words.
The smallest word is byte which contains 8 bits.
One byte can describe integers ranging from 0 to 255.
128 64 32 16 8 4 2 1
0 1 0 1 0 1 1 1 = 87
Signed integers can also be expressed. In this case, one bit is reserved for the sign, the
remaining bits for the number. There are several implementation options. Intel
architecture uses two’s complement, which leads to range from -128 to 127.
128 64 32 16 8 4 2 1
0 1 1 1 1 1 1 1 = 127
0 1 0 1 0 1 1 1 = 87
0 0 0 0 0 0 0 1 = 1
0 0 0 0 0 0 0 0 = 0
1 1 1 1 1 1 1 1 = -1
1 0 1 0 1 0 0 1 = -87
1 0 0 0 0 0 0 0 = -128
bit reserved for
sign
C2115 Practical introduction to supercomputing Lesson 12 -8Integer
numbers, II
Integers with greater dynamic range can be expressed using larger words typically
composed of four bytes (32 bit word) or eight bytes (64 bit word).
When working with integers it is necessary to take into account that you cannot express an
arbitrarily large number and the option of underflow or overflow of values must be
consistently avoided.
32-bit unsigned integer: 0 to 4.294.967.295
32-bit signed integer: −2.147.483.648 to 2.147.483.647
64-bit unsigned integer: 0 to 18.446.744.073.709.551.615
64-bit signed integer: −9.223.372.036.854.775.808 to
9.223.372.036.854.775.807
C2115 Practical introduction to supercomputing Lesson 12 -9Real
numbers
Real numbers are expressed in the following format (floating point format):
mantisa
exponent
In digital technology, real numbers are most often expressed in a format defined by a
standard IEEE 754.
type width mantisa exponent
single precision 32 23 8
double precision 64 52 11
...
2
1
2
1
2
1
2
1
44332211 mmmmQ +++=
m1, m2, m3 are the bits of the ma
C2115 Practical introduction to supercomputing Lesson 12 -10Real
numbers, II
By a special combination of value of the mantissa and the exponent, following special
values can be expressed:
0 positive zero
-0 negative zero
NaN not a number, e.g., the result of division by zero
+Inf positive infinity (number is too large to express)
-Inf negative infinity (number is too large to express)
When working with real numbers, it is necessary to pay attention to propagation of
rounding errors, in logical comparisons, it is not appropriate to use operators equals and
does not equal, except for the zero comparison situation.
Type Range Accuracy
single precision ±1.18×10−38 to ±3.4 × 1038 approximately 7 decimal
places
double precision ±2.23×10−308 to ±1.80×10308 approximately 15 decimal
places
C2115 Practical introduction to supercomputing Lesson 12 -11Exercise
1
1. Variable of type signed char (signed byte) contains the number 127. What value will
the variable have if we increase it by one?
2. Variable of type unsigned char (unsigned byte) contains the number 88. How does the
numeric value change if the bit representation of the number is shifted one position to
the right or left? What is the mathematical meaning of the operation.
3. What will be the result of the sum of real numbers represented in double precision and
having values:
0,1346978.10-12
1,2312657.106
4. What is big-endian and little-endian? Indicate architecture that use given type of
endianity. What effect does the endianity have on transfer binary data?
Joint exercise: conversion of
numerical values ​​from decimal to
binary and hexadecimal
C2115 Practical introduction to supercomputing Lesson 12 -12-
Conclusions
➢ CPUs (or other computing units, e.g., GPGPUs)
operate with some numerical precision.
➢ Errors (numerical errors) can occur in numerical
calculations, which can lead to incorrect results
(predictions).
➢ When designing computer programs, it is therefore
necessary to use such algorithms that are either not
sensitive to rounding errors or significantly reduce
their effect.
C2115 Practical introduction to supercomputing Lesson 12 -13From
problem
to results
C2115 Practical introduction to supercomputing Lesson 12 -14From
problem to results ...
Problem Results
C2115 Practical introduction to supercomputing Lesson 12 -15From
problem to results ...
Problem Results
algorithm
source
code
program
calculation
C2115 Practical introduction to supercomputing Lesson 12 -16From
problem to results ...
Problem Results
algorithm
source
code
program
calculation
flowcharts
language selection,
manual optimization
compilation
machine optimization
allocation and
efficient use of
resources
C2115 Practical introduction to supercomputing Lesson 12 -17From
problem to result ...
Problem Results
algorithm
source
code
program
calculation
flowcharts
language selection,
manual optimization
compilation
machine optimization
allocation and
efficient use of
resources
hardware
C2115 Practical introduction to supercomputing Lesson 12 -18From
problem to result ...
algorithm
source
code
program
calculation
flowcharts
language selection,
manual optimization
compilation
machine optimization
allocation and
efficient use of
resources
hardware
When solving problems using computer technology (supercomputers), it is necessary to
comprehensively evaluate several aspects, including used hardware and its architecture.
C2115 Practical introduction to supercomputing Lesson 12 -19Covered
topics ...
algorithm
source
code
program
calculation
hardware
basic problems with creating applications
for demanding calculations, parallelization
When solving problems using computer technology (supercomputers), it is necessary to
comprehensively evaluate several aspects, including used hardware and its architecture.
C2115 Practical introduction to supercomputing Lesson 12 -20Topics
covered ...
algorithm
source
code
program
calculation
hardware
When solving problems using computer technology (supercomputers) it is necessary
comprehensively evaluate a number of aspects, including the hardware used and its
architecture.
efficient running of
computational chemistry
applications
(MetaCentrum, small clusters)
C2115 Practical introduction to supercomputing Lesson 12 -21-
Numeric
integration
C2115 Practical introduction to supercomputing Lesson 12 -22Exercise
LIII.3
1. Write a program that calculates a certain integral below. Use the trapezoidal method
for integration.
dx
x
I  +
=
1
0
2
1
4
a certain integral is the area
under the curve in the range of
integration limits
C2115 Practical introduction to supercomputing Lesson 12 -23Trapezoidal
vs rectangular method
trapezoidal method rectangular method
yi
yi
yi+1
h h
h
yy
I ii
i
2
)( 1++
= hyI ii =
numerically more accurate method
numerically less accurate method
easier implementation and parallelization
C2115 Practical introduction to supercomputing Lesson 12 -24Sequential
implementation
program integral
implicit none
integer(8) :: i
integer(8) :: n
double precision :: rl,rr,h,v,y,x
!---------------------------------------------------
rl= 0.0d0
rr= 1.0d0
n = 2000000000
h = (rr-rl)/n
v = 0.0d0
do i=1,n
x = (i-0.5d0)*h + rl
y = 4.0d0 / (1.0d0 + x**2)
v = v + y*h
end do
write(*,*) 'integral = ',v
end program integral
rectangular method
dx
x
I  +
=
1
0
2
1
4
C2115 Practical introduction to supercomputing Lesson 12 -25Exercise
2
1. Compile the program integral.f90 with optimization -O3
2. Measure application run time required to integrate the function. Use the program
/usr/bi /time to measure the time.
3. What is the value of the integral equal to?
4. What effect does the value of the variable "n" (i.e., size h) have on the accuracy of the
calculation? Use programs integral-errors_sp.f90 and integral-errors_dp.f90 to assess.
Briefly discuss obtained results.
Source codes:
/home/kulhanek/Documents/C2115/code/integral/single
C2115 Practical introduction to supercomputing Lesson 12 -26Matrix
multiplication
C2115 Practical introduction to supercomputing Lesson 12 -27-
Content
➢ Matrix multiplication
implementation, complexity, computing power, exercises
➢ Explanation of the obtained results
computer architecture and its bottlenecks
➢ Use of optimized libraries
BLAS, LAPACK, LINPACK, comparison of results, exercises
C2115 Practical introduction to supercomputing Lesson 12 -28Matrix
multiplication
A(n, m) B(m, k) C(n, k)
x =
Use:
• finding eigenvalues ​​and vectors of square matrices (quantum chemistry)
• solution of a system of linear equations (QSAR, QSPR)
• transformations (displacement, rotation, scaling - display and graphics)
C2115 Practical introduction to supercomputing Lesson 12 -29Matrix
multiplication
A(n, m) B(m, k) C(n, k)
x =
=
=
m
l
ljilij BAC
1
Element of the resulting matrix C is the scalar product of the vectors formed by the line i of
A matrix and column j of B matrix
C2115 Practical introduction to supercomputing Lesson 12 -30Matrix
mult., implementation
subroutine mult_matrices(A,B,C)
implicit none
double precision :: A(:,:)
double precision :: B(:,:)
double precision :: C(:,:)
!---------------------------------------
integer :: i,j,k
!-------------------------------------------------------------------
if( size(A,2) .ne. size(B,1) ) then
stop 'Error: Illegal shape of A and B matrices!'
end if
do i=1,size(A,1)
do j=1,size(B,2)
C(i,j) = 0.0d0
do k=1,size(A,2)
C(i,j) = C(i,j) + A(i,k)*B(k,j)
end do
end do
end do
end subroutine mult_matrices
C2115 Practical introduction to supercomputing Lesson 12 -31Number
of operations
do i=1,size(A,1)
do j=1,size(B,2)
C(i,j) = 0.0d0
do k=1,size(A,2)
C(i,j) = C(i,j) + A(i,k)*B(k,j)
end do
end do
end do
N * N * N * (1 + 1) = 2 * N3
Assuming that matrices A and B are square with dimensions NxN:
In computer technology, computing power is assed via FLOPS (FLoating-point Operations
Per Second) value, which expresses how many floating point operations a given device
performs per second.
C2115 Practical introduction to supercomputing Lesson 12 -32-
Results
wolf21: gfortran 4.6.3, optimalizace O3, Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz
N NR NOPs Time MFLOPS
----- ----- ---------------- ---------------- -------
50 50000 12500000000 6.1843858 2021.2
100 500 1000000000 0.5200334 1923.0
150 50 337500000 0.1760106 1917.5
200 50 800000000 0.4280272 1869.0
250 50 1562500000 0.8440533 1851.2
300 50 2700000000 1.4640903 1844.1
350 50 4287500000 2.3441458 1829.0
400 50 6400000000 5.7083569 1121.2
450 50 9112500000 5.9363708 1535.0
500 50 12500000000 10.3366470 1209.3
550 1 332750000 0.6880417 483.6
600 1 432000000 1.1600723 372.4
650 1 549250000 1.8601189 295.3
700 1 686000000 2.5881615 265.1
750 1 843750000 3.2762032 257.5
800 1 1024000000 3.8522377 265.8
850 1 1228250000 4.7883034 256.5
900 1 1458000000 5.6963577 256.0
950 1 1714750000 6.5044060 263.6
1000 1 2000000000 7.9444962 251.7
Key:
N - dimension of the matrix
NR - number of repetitions
NOPs - number of
operations in FP
Time - execution time in s
MFLOPS - computing power
C2115 Practical introduction to supercomputing Lesson 12 -33-
Results
wolf21
N – matrix dimension
Computingpower[MFLOPS]
C2115 Practical introduction to supercomputing Lesson 12 -34N
– matrix dimension
Computingpower[MFLOPS]
Results
significant drop in
performance
wolf21 wolf21
C2115 Practical introduction to supercomputing Lesson 12 -35Exercise
3
1. Compile the program mult_mat_naive_dp.f90 with gfortran compiler, use -O3
optimization.
2. Run the program and display the obtained dependence of the computational power
depending on the size of the matrix in the form of a graph (interactive mode gnuplot).
3. Compare the results for the optimization levels -O3 and -O0. Display the obtained
dependencies in one graph. Insert the graph into the protocol. Be sure to specify the
CPU type (command lscpu).
4. Discuss obtained results.
Source codes:
/home/kulhanek/Documents/C2115/code/matrix
C2115 Practical introduction to supercomputing Lesson 12 -36Matrix
multiplication
vs
Architecture of
computers
C2115 Practical introduction to supercomputing Lesson 12 -37Architecture,
overal view
CPU
north
bridge
north
bridge
south
bridge
south
bridge
USB
mouse, keyboard
real time clock
BIOS
graphics
system
memory
memory controller
peripherals with quick access via
PCI Express
Network (ethernet)sound
PCI bus
memory controller becomes part
of the latest processors
SATA controllers
hard drives
C2115 Practical introduction to supercomputing Lesson 12 -38Architecture,
bottleneck
CPU
memory
memory controller
Bottleneck: data transfer rate between memory and CPU is slower than the speed at which
the CPU is able to process data
C2115 Practical introduction to supercomputing Lesson 12 -39Hierarchical
model of memory
memory
L3 L2
L1
L1
CPU
fast cache, different levels with different access speeds
wolf21 - transfer rates (memtest86 +, http: // www.memtest.org/)
Type Size Speed
L1 32kB 89 GB/s
L2 256 kB 35 GB/s
L3 8192 kB 24 GB/s
paměť 8192 MB 12 GB/s
C2115 Practical introduction to supercomputing Lesson 12 -40Hierarchical
model of memory
When the size of problem exceeds
size of CPU cache, data transfer rate
between physical memory and CPU
becomes the speed limiting step.
N=600
600x600x3x8 = 8437 kB
A, B, C double precision
Type Size Speed
L1 32kB 89 GB/s
L2 256 kB 35 GB/s
L3 8192 kB 24 GB/s
paměť 8192 MB 12 GB/s
memory
L3 L2
L1
L1
CPU
fast cache, different levels with different access speeds
wolf21 - transfer rates (memtest86 +, http: // www.memtest.org/)
C2115 Practical introduction to supercomputing Lesson 12 -41N
– matrix dimension
Computingpower[MFLOPS]
Results
significant drop in
performance
wolf21
C2115 Practical introduction to supercomputing Lesson 12 -42Libraries
for linear algebra
BLAS
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building
blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar,
vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations,
and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient,
portable, and widely available, they are commonly used in the development of high quality
linear algebra software, LAPACK for example.
LAPACK
LAPACK is written in Fortran 90 and provides routines for solving systems of simultaneous
linear equations, least-squares solutions of linear systems of equations, ownproblems, and
singular value problems. The associated matrix factorizations (LU,Cholesky, QR, SVD, Schur,
generalized Schur) are also provided, as are related computations such as reordering of the
Schurfactorizations and estimating condition numbers. Dense and banded matrices are
handled, but not general sparse matrices. In all areas, similar functionality is provided for
real and complex matrices, in both single and double precision.
http://netlib.org
C2115 Practical introduction to supercomputing Lesson 12 -43Optimized
libraries
Optimized BLAS and LAPACK libraries
➢ optimized by the hardware vendor
➢ ATLAS http://math-atlas.sourceforge.net/
➢ MKL http://software.intel.com/en-us/intel-mkl
➢ ACML http://developer.amd.com/tools/cpu-development/
amd-core-math-library-acml/
➢ cuBLAS https://developer.nvidia.com/cublas
Optimized FFT libraries (Fast Fourier Transform)
➢ optimized by the hardware vendor
➢ MKL http://software.intel.com/en-us/intel-mkl
➢ ACML http://developer.amd.com/tools/cpu-development/
amd-core-math-library-acml/
➢ FFTW http://www.fftw.org/
➢ cuFFT https://developer.nvidia.com/cufft
C2115 Practical introduction to supercomputing Lesson 12 -44Matrix
multiplication via BLAS - dp
subroutine mult_matrices_blas(A,B,C)
implicit none
double precision :: A(:,:)
double precision :: B(:,:)
double precision :: C(:,:)
!----------------------------------------------------------
if( size(A,2) .ne. size(B,1) ) then
stop 'Error: Illegal shape of A and B matrices!'
end if
call dgemm('N','N',size(A,1),size(B,2),size(A,2),1.0d0, &
A,size(A,1),B,size(B,1),0.0d0,C,size(C,1))
end subroutine mult_matrices_blas
F77 interface of BLAS library does not contain information about argument type.
Programmer must enter all arguments in the correct order and type!!!!
Compilation:
$ gfortran -O3 mult_mat_blas_dp.f90 -O mult_mat_blas_dp -lblas
C2115 Practical introduction to supercomputing Lesson 12 -45Matrix
multiplication via BLAS - dp
subroutine mult_matrices_blas(A,B,C)
implicit none
real(4) :: A(:,:)
real(4) :: B(:,:)
real(4) :: C(:,:)
!----------------------------------------------------------
if( size(A,2) .ne. size(B,1) ) then
stop 'Error: Illegal shape of A and B matrices!'
end if
call sgemm('N','N',size(A,1),size(B,2),size(A,2),1.0, &
A,size(A,1),B,size(B,1),0.0,C,size(C,1))
end subroutine mult_matrices_blas
Compilation:
$ gfortran -O3 mult_mat_blas_sp.f90 -O mult_mat_blas_sp -lblas
C2115 Practical introduction to supercomputing Lesson 12 -46Naive
vs optimized solution
N – matrix dimension
Computingpower[MFLOPS]
Naive implementation
Optimized BLAS
C2115 Practical introduction to supercomputing Lesson 12 -47N
– matrix dimension
Computingpower[MFLOPS]
Naive implementation
Optimized BLAS
Naive vs optimized solution
~ 10x
C2115 Practical introduction to supercomputing Lesson 12 -48Exercise
4
1. Compile the program mult_mat_blas_dp.f90 with gfortran compiler, use -O3
optimization.
2. Run the program and display the obtained dependence of computing power depending
on the size of the matrix in the form of a graph (interactive mode gnuplot).
3. Determine the computational power for the optimization levels -O3 and -O0. Display
the obtained dependencies in one graph. Insert the graph into the protocol. Be sure to
specify the CPU type (command lscpu).
4. Compare computing power for native and blas approaches in the optimized version
(option -O3). Display the obtained dependencies in one graph. Insert the graph into the
protocol. Be sure to specify the CPU type (command lscpu).
5. Discuss obtained results.
Source codes:
/home/kulhanek/Documents/C2115/code/matrix
Compilation:
$ gfortran -O3 mult_mat_blas_dp.f90 -o mult_mat_blas_dp -lblas