E7441: Scientific computing in biology and biomedicine Introduction to parallel computing Vlad Popovici, Ph.D. Fac. of Science - RECETOX Outline 1 A historical perspective 2 Why parallel computing? 3 Principles of parallel computing Introduction Programming models Implementations Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 2 / 38 "I think there is a world market for maybe five computers." (Thomas Watson, chairman of IBM, 1943) "There is no reason for any individual to have a computer in their home." (Ken Olson, founder Digital Equipment Corporation, 1977) "640K of memory ought to be enough for anybody." Bill Gates, chairman of Microsoft, 1981 Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 3 / 38 ∼ 2500 BC: Babylon - the first abacus ∼ 100 BC: Antikythera device - believed to be the first mechanical computer first half of the 19th century: Charles Babbage’s differential machine (to tabulate polynomials) and analytical machine (only design) 1941: Z3 computer by Konrad Zuse: first programmable, fully automatic computing machine source: Wikipedia Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 4 / 38 ∼ 1840 Charles Babbage produces the differential machine, a mechanical computer. source: Wikipedia Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 5 / 38 1941: Z3 computer: electro-mechanical computer, ∼ 2000 relays, 22-bit words, operating at 5-10 Hz. source: Wikipedia Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 6 / 38 1946: ENIAC - Electronic Numerical Integrator And Computer used initially by US Army to compute tables for artilery. Uses vacuum tubes as switches. source: Wikipedia Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 7 / 38 1976: Cray-1 - the first successful supercomputer Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 8 / 38 ...fast forward: Tianhe-2 (top supercomputer as Nov. 2013): 33.86 PFlop/s, 3,120,000 cores; 1,024,000 GB, CPU: Intel Xeon Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 9 / 38 Moore’s law Gordon E. Moore (co-founder Intel): "Cramming More Components onto Integrated Circuits", Electronics Magazine, 1965 source: Wikipedia Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 10 / 38 Software and hardware Software crises: ’60s-’70s: assembly language difficult to use for large complex problems → Fortran, C: provide abstraction and portability for uniprocessors ’80s-’90s: problems in maintaining complex systems → object-oriented programming (C++, Java) ∼ 2000s: sequential performance lags behind Moore’s law → programmers are oblivious to hardware better compilers, higher level languages, virtual machines Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 11 / 38 parallel computing: using multiple execution units concurrently to solve a problem examples: ▶ multi-core processors: several processors (cores) in a chip ▶ shared memory processors (SMP): several processors interconnected through a shared memory ▶ cluster computer: several computers interconnected through high-speed network Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 12 / 38 Issues with the traditional model: power density (Ross: Why CPU Frequency Stalled, IEEE Spectrum Magazine, 2008) Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 13 / 38 Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 14 / 38 Issues cont’d: gains from implicit parallelism tapped out Example: instruction-level parallelism. Machine instruction: decomposed into 4-stages: fetch, decode, execute and write-back Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 15 / 38 Issues cont’d Other issues: increase in production costs (decrease in "chip yield") increase in amount of data to be processed Solution: explicit parallelism multi-core multi-processor multi-machine Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 16 / 38 Principles identifying parallelism granularity: more smaller or fewer larger tasks? locality: data and instruction location load balance: aim: no lost CPU cycles synchronization overhead Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 17 / 38 Identifying parallelism Amdahl’s law: Sn = T1 Tn ≤ 1 α + (1 − α)/n ≤ 1 α where α is the fraction of the program that is strictly sequential, Ti is the execution time on i processors and Si is the speed-up obtained by using i processors instead of 1. Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 18 / 38 Identifying parallelism implicit parallelism ▶ hardware level: superscalar processors, multi-core, cluster computing ▶ compiler level: parallelizing compilers explicit parallelism ▶ programming language level ▶ library level Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 19 / 38 Processing architectures Flynn’s taxonomy ("old way"): Single/Multiple Instruction × Single/Multiple Data Source: Wikipedia Examples: SISD: mainframes; SIMD: GPUs; MISD: fault tolerant systems; MIMD: most computers nowadays Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 20 / 38 Locality: a box in a box in a box... Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 21 / 38 Computing topologies Source: Grama - Introduction to Parallel Computing Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 22 / 38 Shared memory: multicore or multi-CPU machines Distributed memory: clusters with single CPUs nodes Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 23 / 38 Hybrid systems a limited number of CPUs have access to a pooled memory using more CPUs implies communication over network through message-passing Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 24 / 38 Hybrid systems with multicore CPUs extension of the hybrid model communication becomes increasingly complex many levels in the memory hierachy: cache(s), local main memory, other node’s memory, etc you can add accelerators: e.g, GPUs requires a new programming model, and different communication protocols Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 25 / 38 Load balancing aim: distribute evenly the load (work) on all available resources... ...and thus minimize the time a resource is idle causes of imbalanced load: ▶ insufficient paralelism ▶ unequal task size (poor design?) Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 26 / 38 Types of parallelism data parallelism: each processor performs the same task on different data (h/w: SIMD, MIMD) task parallelism: each processor performs a different task on the same data (h/w: MISD, MIMD) usually, both types of parallelism are present Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 27 / 38 Example: re-annotation of a microarray chip (embarrassingly parallel problem) Problem: map (BLAST) each probe from a microarray against the latest version of the human genome (RefSeq). Naive implementation on 2 CPUs: program : . . . i f CPU == ’CPU1’ then idx = 1 , . . ,Np/2 e l s e i f CPU == ’CPU2’ then idx = Np/2 + 1 , . . . ,N endif BLAST( Probes [ idx ] ) . . . program : . . . idx = 1 , . . ,Np/2 BLAST( Probes [ idx ] ) . . . program : . . . idx = Np/2 + 1 , . . . ,N BLAST( Probes [ idx ] ) . . . Better ways of distributing the data exists for this problem! Ex: distribute also the RefSeq... Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 28 / 38 Problem decomposition split the computations into concurrent tasks build the task-dependency graph there is no one-size-fits-all technique some methods: recursive decomposition, data-decomposition, exploratory decomposition and speculative decomposition Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 29 / 38 Recursive decomposition: example Problem: find the minimum of a vector proc serial_min (A, n ) min = A[ 1 ] f o r i = 2 to n do i f A[ i ] < min then min = A[ i ] end f o r return min end serial_min proc rec_min (A, i , j ) i f i == j then min = A[ i ] else lmin = rec_min (A, i , j / 2 ) rmin = rec_min (A, j /2+1 , j ) i f lmin < rmin then min = lmin else min = rmin end i f end i f return min end rec_min Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 30 / 38 Data decomposition: example Matrix multiplication: A · B = C. Write it as A11 A12 A21 A22 · B11 B12 B21 B22 = C11 C12 C21 C22 and distribute the four tasks: Task 1: C11 = A11B11 + A12B21 Task 2: C12 = A11B12 + A12B22 Task 3: C21 = A21B11 + A22B21 Task 4: C22 = A21B12 + A22B22 Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 31 / 38 Other decompositions exploratory decomposition: decompose the search space for the solution and search for a solution in each subspace; then choose among the solutions speculative decomposition: launch alternative computation branches in parallel while waiting for input for deciding which branch to use hybrid decompositions Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 32 / 38 Mapping techniques problem decomposition → tasks the tasks need to be allocated (mapped) to processors/processes objective: minimize the execution time overheads: time spent for everything else but actually solving the problem: ▶ inter-process interaction - synchronization and control ▶ time spent being idle - poor load balancing reduce the process inter-dependencies and communication: e.g. maximize data locality improve load balancing reduce blocking operations Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 33 / 38 Implementations on multihtread/multicore machines POSIX threads (pthreads): OS-level paralelism. ▶ threads: lightweight processes ▶ the same program runs on single- or multi-core machines ▶ OS has the responsibility of mapping the threads ▶ needs low-level programming, dedicated library OpenMP: built on top of pthreads for SIMD-kind of parallelism ▶ implemented through compiler directives ▶ easier to use than pthreads ▶ performance depends on compiler’s ’intelligence’ Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 34 / 38 OpenMP: how does it look like? ( i aibi) double a [N ] ; double sum = 0.0; int i , n , t i d ; #pragma omp p a r a l l e l shared ( a ) private ( i ) { t i d = omp_get_thread_num ( ) ; / * Only one of the threads do t h i s * / #pragma omp single { n = omp_get_num_threads ( ) ; p r i n t f ( "Number␣of␣threads␣=␣%d \ n" , n ) ; } / * I n i t i a l i z e a * / #pragma omp for for ( i =0; i < N; i ++) { a [ i ] = 1.0; } / * P a r a l l e l f o r loop computing the sum of a [ i ] * / #pragma omp for reduction ( + :sum) for ( i =0; i < N; i ++) { sum = sum + ( a [ i ] ) ; } } / * End of p a r a l l e l region * / Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 35 / 38 Implementations on distributed-memory systems MPI: Message Passing Interface ▶ de facto standard for distributed memory programming (clusters) ▶ data must be manually decomposed ▶ use special libraries ▶ based on sending and receiving messages: data and synchronization PVM: Parallel Virtual Machine ▶ previous library for cluster programming ▶ based on message-passing principle ▶ supplanted by MPI Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 36 / 38 MPI: how does it look like? #include int main ( int argc , char * argv [ ] ) { int numprocs , myid ; MPI_Init (&argc ,& argv ) ; MPI_Comm_size(MPI_COMM_WORLD,&numprocs ) ; MPI_Comm_rank(MPI_COMM_WORLD,&myid ) ; / * p r i n t out my rank and t h i s run ’ s PE size * / p r i n t f ( " Hello␣from␣%d␣of␣%d \ n" , myid , numprocs ) ; MPI_Finalize ( ) ; } Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 37 / 38 Implementations in R parallelism came as an after thought target: massive data applications tries to bring to R some of the libraries existing to other languages snow: for traditional clusters, supports PVM, MPI,...; is portable (UNIX, Windows) multicore: targets multi-core/-CPU machines; simple; does not run on Windows; does not handle parallel RNGs parallel: snow+multicore in new R (>=2.14); strange interactions with OS R+Hadoop: based on Hadoop cluster RHIPE: based on Hadoop, targets map-reduce operations Segue: apply-like calculations on Hadoop clusters, using Amazon’s Elastic MapReduce Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 38 / 38 Questions? Vlad Popovici, Ph.D. (Fac. of Science - RECETOX)E7441: Scientific computing in biology and biomedicine 39 / 38