C2115 Praktický úvod do superpočítání -1-XIII. lekce C2115 Practical introduction to supercomputing Petr Kulhanek kulhanek@chemi.muni.cz National Center for Biomolecular Research, Faculty of Science, Masaryk University, Kotlářská 2, CZ-61137 Brno Lesson 13 Revision 2 C2115 Praktický úvod do superpočítání -2-XIII. lekce Content ➢ Increasing performance of supercomputers SMP, multicore CPU, NUMA ➢ Parallelization of programs numerical integration ▪ OpenMP ▪ MPI parallelization pitfalls, Amdahl's law C2115 Praktický úvod do superpočítání -3-XIII. lekce Architecture of computer Increasing computational power C2115 Praktický úvod do superpočítání -4-XIII. lekce CPU Processor or CPU (Central Processing Unit) is an essential part of the computer; it is a very complex circuit that (sequentially) executes the machine code stored in the computer's memory. The machine code is composed of the instructions, which are loaded into operation memory. www.wikipedia.org CPU control unit ALU ALU (arithmetic and logic unit) executes arithmetic operations and evaluates logical conditions. Control unit reads machine code (instructions) and data and prepares them for execution on ALU. (Sequential) execution of machine code is controlled by internal clock. How does CPU (ALU) work with numerical values? C2115 Praktický úvod do superpočítání -5-XIII. lekce Increasing Computing Power physical CPU core control unit ALU physical CPU core control unit ALU ALU ALU ALU logical CPU cores CU#1 ALU ALU ALU ALU CU#2 Strategies: • increasing clock frequency – physical limitations (miniaturization, lowering voltage) • increasing number of ALUs and their specializations (out-of-order execution, speculative execution, vector instructions) – efficiency limited by executed code • sharing ALUs among control units (hyperthreading) – efficiency limited by executed code • multi-core processor – efficiency limited by executed code hyperthreading multi-core processor software optimization or new algorithms are necessary to benefit from these features C2115 Praktický úvod do superpočítání -6-XIII. lekce Symmetric Multiprocessing (SMP) processor (package, socket, circuit) CPU cores interconnect (vendor specific) RAM RAM computer Symmetric multiprocessing represents a system containing identical CPUs accessing shared memory. bus RAM CPU#1 CPU#2 processor (package, socket, circuit) new algorithms are required to employ the computing power Utilizations of more CPUs increases computing power of the system. C2115 Praktický úvod do superpočítání -7-XIII. lekce Processor Caches memory L3 L2 L1 L1 CPU Processor cache improves efficiency of CPU access into central memory (latency and bandwidth). processor PU = processing unit (=control unit)CPU core C2115 Praktický úvod do superpočítání -8-XIII. lekce Processor Caches two processing units per CPU core (hyperthreading) L1i – instruction cache, L1d – data cache L2, L3 – other caches Speed: L1d,L2d >> L2 > L3 C2115 Praktický úvod do superpočítání -9-XIII. lekce NUMA (Nonuniform Memory Access ) cache memory driver RAM#3 cache memory driver RAM#1 cache memory driver cache memory driver RAM#4RAM#2 NUMA links (vendor specific interconnect) multi-core processor #3 multi-core processor #1 multi-core processor #2 multi-core processor #4 Compare communication between Processor#1 <> RAM#1 and Processor#1 <> RAM#4. NUMA links can have various topologies to speedup CPU access to memory. C2115 Praktický úvod do superpočítání -10-XIII. lekce NUMA (Nonuniform Memory Access ) cache memory driver RAM#3 cache memory driver RAM#1 cache memory driver cache memory driver RAM#4RAM#2 NUMA links (vendor specific interconnect) multi-core processor #3 multi-core processor #1 multi-core processor #2 multi-core processor #4 faster slower Compare communication between Processor#1 <> RAM#1 and Processor#1 <> RAM#4. NUMA links can have various topologies to speedup CPU access to memory. C2115 Praktický úvod do superpočítání -11-XIII. lekce Exercise M1.1 1. Examine type and parameters of processor on your workstation (command lscpu, file /proc/cpus). 2. Examine NUMA topology on your workstation (command lstopo, module hwloc). 3. Does your CPU support hyperthreading? 4. What is a process? 5. What is difference between CPU intensive and data intensive tasks? 6. A parallel task is data intensive. Each its process works with different data sets. What is better for speeding up the calculation? 1. To double number of CPU cores. 2. To double number of processors (sockets). Užitečné příkazy: $ lscpu $ lstopo # module add hwloc $ cat /proc/cpuinfo $ ams-host # Infinity C2115 Praktický úvod do superpočítání -12-XIII. lekce Numerical integrationparallelization using OpenMP C2115 Praktický úvod do superpočítání -13-XIII. lekce Sequential implementation program integral implicit none integer(8) :: i integer(8) :: n double precision :: rl,rr,h,v,y,x !--------------------------------------------------- rl= 0.0d0 rr= 1.0d0 n = 2000000000 h = (rr-rl)/n v = 0.0d0 do i=1,n x = (i-0.5d0)*h + rl y = 4.0d0 / (1.0d0 + x**2) v = v + y*h end do write(*,*) 'integral = ',v end program integral rectangular method dx x I  + = 1 0 2 1 4 C2115 Praktický úvod do superpočítání -14-XIII. lekce Parallelization - OpenMP OpenMP is a system of directives for compiler and library procedures for parallel programming. It is a standard for programming computers with shared memory. OpenMP makes it easy to create multiple threaded programs in Fortran, C, and C++ programming languages. www.wikipedia.org Specifications: www.openmp.org process thread #1 (main thread) thread #2 (thread) … shared memory CPU core #1 CPU core #2 OpenMP is limited to SMP computing node and multicore CPU, alternatively their combination C2115 Praktický úvod do superpočítání -15-XIII. lekce OpenMP implementation ncpu = 1 !$ ncpu = omp_get_max_threads() write(*,*) 'Number of threads = ',ncpu !$omp parallel !$omp do private(i,x,y),reduction(+:v) do i=1,n x = (i-0.5d0)*h + rl y = 4.0d0/(1.0d0+x**2) v = v + y*d end do !$omp end do !$omp end parallel write(*,*) 'integral = ',v C2115 Praktický úvod do superpočítání -16-XIII. lekce OpenMP compilation $ gfortran -O3 integral.f90 -o integral $ ldd ./integral linux-vdso.so.1 => libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 /lib64/ld-linux-x86-64.so.2 $ gfortran -O3 -fopenmp integral.f90 -o integral $ ldd ./integral linux-vdso.so.1 => (0x00007fff593ff000) libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 /lib64/ld-linux-x86-64.so.2 C2115 Praktický úvod do superpočítání -17-XIII. lekce OpenMP launch $ export OMP_NUM_THREADS=4 $ ./integral Number of threads = 4 integral = 3.1415925965295672 number of threads that the application can use Note: if OMP_NUM_THREADS variable is not set, the maximum number of available CPU cores will be used (however, on the WOLF cluster, the default value of OMP_NUM_THREADS variable is explicitly set to 1) OpenMP and PBSPro batch system: ➢ based on configuration, batch system can set value of variable OMP_NUM_THREADS automatically (according to values of ncpus and mpiprocs in definition of block (chunk)) ➢ value of variable OMP_NUM_THREADS can be set explicitly: export OMP_NUM_THREADS=$PBS_NCPUS number of assigned CPUs, jib must request only one computing node C2115 Praktický úvod do superpočítání -18-XIII. lekce Exercise M2.1 1. Compile the program integral.f90 with optimization -O3 and without OpenMP support. 2. Measure the application run time required for integration. Use /usr/bin/time program to measure the time. 3. Compile the program integral.f90 with optimization -O3 and OpenMP support turned on. 4. Determine the number of CPU cores on your computer (lscpu). 5. Run the program sequentially for 1, 2, 3, up to N threads, where N is the maximum available number of CPU cores. Measure the run time for each run. Write the obtained data in the following table and evaluate it. 6. Does the number of CPU cores affect the resulting value of integral? Why is that so? N Treal [s] Speedup CPU effectivity [%] 1 27.8 1.0 100.0 2 14.7 1.9 94.8 3 11.0 2.5 84.1 4 8.2 3.4 84.7 real real T NT Speedup )1( = = 100 N Speedup vityCPUeffecti = measured time Source codes: /home/kulhanek/Documents/C2115/code/integral/openmp C2115 Praktický úvod do superpočítání -19-XIII. lekce Numerical integrationparallelization using MPI C2115 Praktický úvod do superpočítání -20-XIII. lekce Sequential implementation program integral implicit none integer(8) :: i integer(8) :: n double precision :: rl,rr,h,v,y,x !--------------------------------------------------- rl= 0.0d0 rr= 1.0d0 n = 2000000000 h = (rr-rl)/n v = 0.0d0 do i=1,n x = (i-0.5d0)*h + rl y = 4.0d0 / (1.0d0 + x**2) v = v + y*h end do write(*,*) 'integral = ',v end program integral rectangular method dx x I  + = 1 0 2 1 4 C2115 Praktický úvod do superpočítání -21-XIII. lekce Parallelization - MPI Message Passing Interface (hereinafter referred to as MPI) is a library implementing the specification (protocol) of the same name to support parallel solving of computational problems in computer clusters. Specifically, it is an application development interface (API) based on messaging between individual nodes. These are both point-to-point messages and global operations. The library supports both shared and distributed memory architectures. process #1 thread #1 (main thread) CPU core #1 process #2 thread #1 (main thread) CPU core #2 process #2 thread #1 (main thread) CPU core #1 WN #1 WN #2 MPI messages ➢ MPI jobs can be run on one or more WN ➢ MPI can be combined with OpenMP. MPI MPI C2115 Praktický úvod do superpočítání -22-XIII. lekce MPI implementation do i=1,n x = (i-0.5d0)*h + rl y = 4.0d0 / (1.0d0 + x**2) v = v + y*h end do ➢ Problem must be divided so that each process calculates part of loop. ➢ Processes do not share memory, so information about division of the task and the transfer of partial results must be solved by passing messages. ➢ One of the processes has the role of an organizer who manages the other processes and communicates with the environment. Source codes: /home/kulhanek/Documents/C2115/code/integral/mpi Recorded comment of the file integral.f90 C2115 Praktický úvod do superpočítání -23-XIII. lekce MPI compilation $ module add openmpi:3.1.5-gcc $ mpif90 -O3 integral.f90 -o integral activation of MPI development environment (OpenMPI) Fortran compiler with MPI support (compiled internally using gfortan) [kulhanek@wolf mpi]$ ldd integral linux-vdso.so.1 (0x00007ffe24944000) libmpi_mpifh.so.40 => /software/ncbr/softrepo/devel/openmpi/3.1.5-gcc/x86_64/para/lib/libmpi_mpifh.so.40 (0x0000149914f34000) libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x0000149914b55000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000149914764000) libmpi.so.40 => /software/ncbr/softrepo/devel/openmpi/3.1.5-gcc/x86_64/para/lib/libmpi.so.40 (0x0000149914453000) libopen-rte.so.40 => /software/ncbr/softrepo/devel/openmpi/3.1.5-gcc/x86_64/para/lib/libopen-rte.so.40 (0x000014991419e000) libopen-pal.so.40 => /software/ncbr/softrepo/devel/openmpi/3.1.5-gcc/x86_64/para/lib/libopen-pal.so.40 (0x0000149913ed9000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000149913cd5000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000149913acd000) libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00001499138ca000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00001499136ad000) libhwloc.so.5 => /software/ncbr/softrepo/devel/hwloc/1.11.13/x86_64/single/lib/libhwloc.so.5 (0x0000149913470000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00001499130d2000) libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x0000149912ec7000) libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2 (0x0000149912b06000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00001499128e7000) libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00001499126a7000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000014991248f000) /lib64/ld-linux-x86-64.so.2 (0x000014991538e000) libicuuc.so.60 => /usr/lib/x86_64-linux-gnu/libicuuc.so.60 (0x00001499120d7000) liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x0000149911eb1000) libicudata.so.60 => /usr/lib/x86_64-linux-gnu/libicudata.so.60 (0x0000149910308000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x000014990ff7f000) C2115 Praktický úvod do superpočítání -24-XIII. lekce MPI launch $ mpirun –np 2 ./integral number of process which the application uses for the calculation $ mpirun –np 2 –machinefile nodes ./integral file that contains a list of nodes on which processes run wolf01 slots=2 wolf02 slots=3 CPU count name of computational node Requirements: • ssh without password • Application must be in the same path on all nodes, where the processes are run MPI and PBSPro batch system: ➢ Correctly configured MPI is able to load assigned resources automatically from the batch system ➢ In manual mode, it is possible to use file with computing nodes (variable PBS_NODEFILE) and number of assigned CPUS in variable PBS_NCPU C2115 Praktický úvod do superpočítání -25-XIII. lekce Exercise M3.1 1. Compile the program integral.f90 with optimization -O3 and MPI support. 2. Run the program integral sequentially for 1, 2, 3, to N processes, where N is the maximum available number of CPU cores. Measure execution time for each run. Write the obtained data in a table and evaluate it as in exercise M2.1. 3. Run the program integral sequentially for 1, 2, 4, 8 to N processes (multiples of 2), where N is the maximum available number of CPUs on the two nodes of the WOLF cluster (select unoccupied nodes). Measure execution time for each run. Write the obtained data in table and evaluate it as in exercise M2.1. In another terminal, monitor running processes on both compute nodes with the top command. 4. Does the number of CPU processes affect the resulting value of integral? Why is that so? Source codes: /home/kulhanek/Documents/C2115/code/integral/mpi C2115 Praktický úvod do superpočítání -26-XIII. lekce Parallel jobs Concurrency of errors, numerical errors, running applications, efficiency of parallelization (Amdahl's law) C2115 Praktický úvod do superpočítání -27-XIII. lekce Parallelization pitfalls, concurrency errors Parallel jobs are prone to concurrency errors (race condition). Therefore, it is necessary to pay considerable attention to this pitfall when designing parallel application. v = 0.0d0 !$omp do private(i,x,y),reduction(+:v) do i=1,n x = (i-0.5d0)*h + rl y = 4.0d0/(1.0d0+x**2) v = v + y*d end do !$omp end do !!! key for functionality !!! do #1 do #2 v v v = 0 v =v+v parallel run (two threads) barrier (first wait for all threads to finish and only then calculate the result) Without barrier the variable "v" will be used in an undefined order with the possibility of overwriting intermediate results. Simplified (for illustration): do #1 do #2in inv =v v =v v =v v =v in in C2115 Praktický úvod do superpočítání -28-XIII. lekce Parallelization pitfalls, numerical errors The order in which arithmetic operations are performed may change over time, so you can obtain different result using different number of CPUs or repeating the same parallel task (typical for jobs with dynamic load balancing). Jobs with dynamic load balancing changes the division of the problem into individual CPUs (e.g., atoms which will be processed by the CPU in molecular dynamics) so that the task runs as fast as possible. [kulhanek@wolf openmp]$ OMP_NUM_THREADS=1 ./integral Number of threads = 1 integral = 3.1415926535894521 [kulhanek@wolf openmp]$ OMP_NUM_THREADS=2 ./integral Number of threads = 2 integral = 3.1415926535888206 [kulhanek@wolf openmp]$ OMP_NUM_THREADS=4 ./integral Number of threads = 4 integral = 3.1415926535898944 C2115 Praktický úvod do superpočítání -29-XIII. lekce Running parallel tasks ➢ For new jobs, always verify that you are actually running a parallel version of the application (program). ➢ If the task does not run in parallel, verify that you are running the correct version of the program and in the correct way (OpenMP vs MPI). Useful commands: ▪ Ldd lists the dynamic libraries that the program uses ▪ top lists the running processes ▪ pstree lists the running processes C2115 Praktický úvod do superpočítání -30-XIII. lekce Running parallel tasks, cont. OpenMP MPI 1 process and 4 threads 4 processes (+ service threads used by MPI) C2115 Praktický úvod do superpočítání -31-XIII. lekce Running parallel tasks, cover. process #1 thread #1 (main thread) CPU core #1 process #2 thread #1 (main thread) CPU core #2 process #2 thread #1 (main thread) CPU core #1 WN #1 WN #2 MPI ➢ Avoid running parallel jobs on multiple nodes (MPI). ➢ If necessary, use the fastest possible network connection (Infiniband vs Ethernet) network transmissions (bottleneck) MPI local transmissions (fast) C2115 Praktický úvod do superpočítání -32-XIII. lekce Parallelization efficiency 1 CPU 2 CPU Sequential parts are an integral (however unwanted) parts of a parallel application. The running of this part is not accelerated with increasing number of CPUs, which in turn leads to a decrease in the efficiency of utilization of individual CPUs. shortens stays the same C2115 Praktický úvod do superpočítání -33-XIII. lekce Amdahl's law https://en.wikipedia.org/wiki/Amdahl's_lawAmdahl's law expresses the maximum expected acceleration of the parallel job. 𝑆𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 1 1 − 𝑝 + 𝑝 𝑁 Theoretical acceleration: p - parallel part of the program (fraction) N - number of CPUs ➢ ALWAYS verify the efficiency of the parallel application run (especially for new problems or with change of their size). ➢ Verification is done according to exercise L13.M2.1, while the acceptable minimum efficiency per CPU is about 80%. C2115 Praktický úvod do superpočítání -34-XIII. lekce Exercise M4.1 1. Compile the program integral_rc.f90 with optimization -O3 and support for OpenMP. 2. Successively run the program integral_rc for 1, 2, 3, up to N threads, where N is the maximum available number of CPU cores. How does the result change and why? 3. Run the program repeatedly integral_rc on 2 threads. How does the result change and why? Source codes: /home/kulhanek/Documents/C2115/code/integral/openmp