C2115 Practical Introduction to Supercomputing 5th Lesson -1C2115
Practical Introduction to
Supercomputing
Petr Kulhánek, Jakub Štěpán
kulhanek@chemi.muni.cz
National Centre for Biomolecular Research, Faculty of Science
Masaryk University, Kotlářská 2, CZ-61137 Brno
CZ.1.07/2.2.00/15.0233
5th Lesson
C2115 Practical Introduction to Supercomputing 5th Lesson -2-
Contents
 Exercise LIII.3 solution
input, matrix multiplication
 Result explanation
computer architecture and its bottlenecks
 Optimized libraries usage
BLAS, LAPACK, LINPACK, result comparison
C2115 Practical Introduction to Supercomputing 5th Lesson -3Exercise
LIII.3 solution
 Input, matrix multiplication
C2115 Practical Introduction to Supercomputing 5th Lesson -4Exercise
LIII.3
1. Write program, that dynamically allocates two dimensional array A of size n x n. Items
will be initialized by random numbers from range <-10 ;20>. Print array to terminal.
2. Create two separate arrays (matrices) A and B of size n x n. Initialize arrays in same way
as in previous exercise. Write code for matrix A and B multiplication, save result to
matrix C.
3. How many floating point operations will be done during matrix multiplication?
Measure time necessary for matrix multiplication (do not include matrix initiation and
creation). Calculate approximate processor power in MFLOPS from operation number.
4. Calculate processor performance for different matrix A and B sizes. Create graph for
values of n in range 10 to 1000.
C2115 Practical Introduction to Supercomputing 5th Lesson -5Matrix
multiplication
A(n,m) B(m,k) C(n,k)
x =
C2115 Practical Introduction to Supercomputing 5th Lesson -6Matrix
multiplication
A(n,m) B(m,k) C(n,k)
x =



m
l
ljilij
BAC
1
Resulting C matrix item is scalar product of vectors given by i-th row of matrix A and j-th
column of matrix B
C2115 Practical Introduction to Supercomputing 5th Lesson -7Matrix
multiplication, program
subroutine mult_matrices(A,B,C)
implicit none
double precision :: A(:,:)
double precision :: B(:,:)
double precision :: C(:,:)
!---------------------------------------
integer :: i,j,k
!-------------------------------------------------------------------
if( size(A,2) .ne. size(B,1) ) then
stop 'Error: Illegal shape of A and B matrices!'
end if
do i=1,size(A,1)
do j=1,size(B,2)
C(i,j) = 0.0d0
do k=1,size(A,2)
C(i,j) = C(i,j) + A(i,k)*B(k,j)
end do
end do
end do
end subroutine mult_matrices
C2115 Practical Introduction to Supercomputing 5th Lesson -8Number
of operations
do i=1,size(A,1)
do j=1,size(B,2)
C(i,j) = 0.0d0
do k=1,size(A,2)
C(i,j) = C(i,j) + A(i,k)*B(k,j)
end do
end do
end do
N * N * N * (1 + 1) = 2*N3
Expect that matrices A and B are square matrices of NxN size:
Computing measures computational performance as number of FLOPS (FLoating-point
Operations Per Second), that is how many floating point operations are done in second.
C2115 Practical Introduction to Supercomputing 5th Lesson -9-
Results
wolf21: gfortran 4.6.3, optimization O3, Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz
N NR NOPs Time MFLOPS
----- ----- ---------------- ---------------- -------
50 50000 12500000000 6.1843858 2021.2
100 500 1000000000 0.5200334 1923.0
150 50 337500000 0.1760106 1917.5
200 50 800000000 0.4280272 1869.0
250 50 1562500000 0.8440533 1851.2
300 50 2700000000 1.4640903 1844.1
350 50 4287500000 2.3441458 1829.0
400 50 6400000000 5.7083569 1121.2
450 50 9112500000 5.9363708 1535.0
500 50 12500000000 10.3366470 1209.3
550 1 332750000 0.6880417 483.6
600 1 432000000 1.1600723 372.4
650 1 549250000 1.8601189 295.3
700 1 686000000 2.5881615 265.1
750 1 843750000 3.2762032 257.5
800 1 1024000000 3.8522377 265.8
850 1 1228250000 4.7883034 256.5
900 1 1458000000 5.6963577 256.0
950 1 1714750000 6.5044060 263.6
1000 1 2000000000 7.9444962 251.7
Legend:
N – matrix size
NR – number of iterations
NOPs – Floating Point operations
Time – runtime in seconds
MFLOPS – performance
C2115 Practical Introduction to Supercomputing 5th Lesson -10-
Results
C2115 Practical Introduction to Supercomputing 5th Lesson -11-
Results
Significant
performance drop
C2115 Practical Introduction to Supercomputing 5th Lesson -12Results
explanation
 Computer architecture
 Bottlenecks
C2115 Practical Introduction to Supercomputing 5th Lesson -13Architecture,
general overview
CPU
North bridge
South
bridge
USB
Mouse, keyboard
Real time clock
BIOS
Graphics
system
Memory
Memory controller
Peripheries with fast access over
PCI Express
Network (ethernet)Sound
PCI bus
Memory controller becomes part
of new processors
SATA controller
Hard drives
C2115 Practical Introduction to Supercomputing 5th Lesson -14Architecture,
bottleneck
CPU
Memory
Memory controller
Bottleneck: data transfer rate between memory and CPU is usually slower then speed that
CPU processes data.
C2115 Practical Introduction to Supercomputing 5th Lesson -15Hierarchy
memory model
Memory
L3 L2
L1
L1
CPU
Fast cache memory (cache), various levels with different access rates.
wolf21 – transfer rates (memtest86+, http://www.memtest.org/)
Type Size Rate
L1 32kB 89 GB/s
L2 256 kB 35 GB/s
L3 8192 kB 24 GB/s
Memory 8192 MB 12 GB/s
C2115 Practical Introduction to Supercomputing 5th Lesson -16Hierarchy
memory model
L3 L2
L1
L1
CPU
If problem size exceeds CPU cache
memory size, then transfer rate
between CPU and physical memory
becomes speed limiting factor.
N=600
600x600x3x8 = 8437 MB
A,B,C double precision
Type Size Rate
L1 32kB 89 GB/s
L2 256 kB 35 GB/s
L3 8192 kB 24 GB/s
Memory 8192 MB 12 GB/s
Memory
Fast cache memory (cache), various levels with different access rates.
wolf21 – transfer rates (memtest86+, http://www.memtest.org/)
C2115 Practical Introduction to Supercomputing 5th Lesson -17Optimized
libraries usage
 BLAS
 LAPACK
 LINPACK
 Result comparison
C2115 Practical Introduction to Supercomputing 5th Lesson -18Linear
algebra libraries
BLAS
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building
blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar,
vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations,
and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient,
portable, and widely available, they are commonly used in the development of high quality
linear algebra software, LAPACK for example.
LAPACK
LAPACK is written in Fortran 90 and provides routines for solving systems of simultaneous
linear equations, least-squares solutions of linear systems of equations, eigenvalue
problems, and singular value problems. The associated matrix factorizations (LU, Cholesky,
QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as
reordering of the Schur factorizations and estimating condition numbers. Dense and
banded matrices are handled, but not general sparse matrices. In all areas, similar
functionality is provided for real and complex matrices, in both single and double
precision.
http://netlib.org
C2115 Practical Introduction to Supercomputing 5th Lesson -19Optimized
libraries
Optimized libraries BLAS and LAPACK
 optimized by hardware producer
 ATLAS http://math-atlas.sourceforge.net/
 MKL http://software.intel.com/en-us/intel-mkl
 ACML http://developer.amd.com/tools/cpu-development/
amd-core-math-library-acml/
 cuBLAS https://developer.nvidia.com/cublas
Optimized libraries FFT (Fast Fourier Transform)
 optimized by hardware producer
 MKL http://software.intel.com/en-us/intel-mkl
 ACML http://developer.amd.com/tools/cpu-development/
amd-core-math-library-acml/
 FFTW http://www.fftw.org/
 cuFFT https://developer.nvidia.com/cufft
C2115 Practical Introduction to Supercomputing 5th Lesson -20Matrix
multiplication using BLAS
subroutine mult_matrices_blas(A,B,C)
implicit none
double precision :: A(:,:)
double precision :: B(:,:)
double precision :: C(:,:)
!----------------------------------------------------------
if( size(A,2) .ne. size(B,1) ) then
stop 'Error: Illegal shape of A and B matrices!'
end if
call dgemm('N','N',size(A,1),size(B,2),size(A,2),1.0d0, &
A,size(A,1),B,size(B,1),0.0d0,C,size(C,1))
end subroutine mult_matrices_blas
Compilation:
$ gfortran -03 mutl_mat.f90 -o mult_mat -lblas
C2115 Practical Introduction to Supercomputing 5th Lesson -21Naive
vs. optimized solution
C2115 Practical Introduction to Supercomputing 5th Lesson -22Naive
vs. optimized solution
~10x
C2115 Practical Introduction to Supercomputing 5th Lesson -23-
Summary
It is always appropriate to use existing library to solve
problem, because these are usually highly hardware
optimized.