FACULTY OF INFORMATICS Masaryk University PA039: Supercomputer Architecture and Intensive Computing Luděk Matýska Spring 2023 Luděk Matýska • Introduction • Spring 2023 1/69 FACULTY OF INFORMATICS Masaryk University Rules of engagement - proposal ■ Regular Lectures (Like this one) versus pre-recorded Lectures combined with an interactive seminar at the reguLar scheduLed time ■ Interactive seminar form: ■ Your questions ■ More detailed discussion around selected topics ■ Questions through SLi.do ■ Examination questions/subjects ■ Examination wiLL be in an Open Book format ■ IS MU ■ The first (mandatory) examination on 18th May at 4 PM (the time of the Last possible lesson in the semester) ■ Other terms as needed (for those who can't make it for serious reason) However, due to the Al ChatGPT (and derivatives), this form can't Ludek JfiXrstP^i5fr89S{ioy?^pring 2023 2/69 FACULTY OF INFORMATICS Masaryk University High Performance Computing ■ Formula One in the IT area ■ Extremely expensive machines, but with exceptional features (performance, memory,...) ■ Specific users'groups ■ Extensive simulations ■ Modelling (automobile, airplanes, physical phenomena,...) ■ Recently also Artificial Intelligence (DNN) ■ Appetite comes with eating ■ Requirements rise faster than performance ■ Also the complexity of processors rises Quality of programming (code generation) defines the usability and real performance Luděk Matýska • Introduction • Spring 2023 3/69 FACULTY OF INFORMATICS I Masaryk University High Performance Computing II ■ Processors ■ CISC ■ RISC ■ Vector processors ■ Streaming processors (e.g. GPU,TPU) ■ Special processors ■ Programmable - FPGA ■ Static - ASICs ■ Memories - performance (speed) Lagging behind processors Luděk Matýska • Introduction • Spring 2023 4/69 FACULTY OF INFORMATICS Masaryk University HPC-requirements ■ The ratio Theoretical vs. Actual performance decreases ■ Reaction: need to better understand ■ architecture of the used processor and computer ■ reasons, why a specific code is much faster than seemingly similar equivalent code ■ tools and methods how to measure real performance (of a program or a computer) Luděk Matýska • Introduction • Spring 2023 5/69 FACULTY OF INFORMATICS Masaryk University High Throughput Computing ■ Highest actual performance vs Highest utilization ■ long-term efficient use of computer systems ■ large number of smaller tasks ■ the processing time of a single task is not critical ■ the time of processing all tasks is critical ■ Efficiency ■ maximizing "investment" ■ total throughput of the system Luděk Matýska • Introduction • Spring 2023 6/69 FACULTY OF INFORMATICS Masaryk University Clouds and HPC ■ CLouds - virtuaLized infrastructure ■ higher flexibility (use as much as you need (and can pay for)) ■ robustness (high availability) ■ hidden and exposed heterogeneity ■ massive capacity ■ resembles High Throughput Computing goals ■ Basic scenario - overcommitment ■ not directly usable in the HPC environment ■ The flexibility and fast availability of resources may be the primary force for using clouds in the HPC environment ■ But it may broke some of the efficiency expectations ■ New features are needed ■ Area to follow (e.g. EOSC or GAIA X) Ludek Matyska • Introduction • Spring 2023 FACULTY OF INFORMATICS Masaryk University Fundamental aspects - what influences performance ■ Latency (delay) ■ processing and transmission of signals within processors and memories ■ transmission between processor and memory ■ intra-memory latency ■ Speed of (signal) recovery (cycle times) ■ speed of circuits switching ■ circuits frequency (internal "clock") ■ memory refresh (dynamic memories) ■ Throughput (speed of the data unit transfer) ■ speed of data transport on a chip ■ number of instructions per a cycle ■ transport speed between components ■ Granularity ■ density on a chip ■ memory density Ludek MatyBat* computerfamities (IBM 360, 370, VAX,...) Disadvantages: too complex instructions, increasingly complex instruction analysis, cross instruction relationships, backward compatibility cost (within a family) Luděk Matýska • Processors • Spring 2023 11/69 FACULTY OF INFORMATICS Masaryk University Performance increase ■ Clock cycles define processor's performance ■ Limited by contemporary technology ■ Impossible to continuously increase ■ dependencies between components ■ signal transport speed ■ Solution: parallelization Luděk Matýska • Processors • Spring 2023 12/69 FACULTY OF INFORMATICS Masaryk University Pipelining Overlap of instructions in different stages of processing instruction — 1 - 2 - 3 - 4 - 5 1 - 2 - 3 - 4 - 5 1 - 2 - 3 - 4 - 5 1 - 2 - 3 - 4 - 5 1 - 2 - 3 - 4 - 5 1 - 2 - 3 - 4 - 5 1 - 2 - 3 - 4 - 5 Three basic areas: 1. Instruction processing 2. Memory access 3. Floating point instructions results Luděk Matýska • Processors • Spring 2023 FACULTY OF INFORMATICS Masaryk University Pipelining II ■ "Standard" (cLassicaL) instruction decomposition (five-stage pipelining): Instruction Fetch instruction is Loaded from a memory Instruction Decode instruction is decoded (recognized) Operand Fetch operands are ready (fetched from registers and/or memory) Execute instruction is executed Writeback results are written back (to registers and/or memory) Individual stages are processed in parallel, shifted by one stage Luděk Matýska • Processors • Spring 2023 14/69 FACULTY OF INFORMATICS I Masaryk University Pipelines and memory ■ "Invisible" pipelines ■ Reading (writing) from (to) memory is moved ahead of the actual instruction that works with the data ■ "Visible" pipelines ■ Explicit instructions, with know number of cycles to complete ■ E.g. Intel 80860 Luděk Matýska • Processors • Spring 2023 15/69 FACULTY OF INFORMATICS Masaryk University Processors - RISC Reduced Instruction Set Computer ■ First RISC: CDC 6600 (Seymour Cray) ■ First half of sixties (1964) Explicit RISC concept during eighties ■ (Favourable) conditions for RISC processors ■ Introduction of caches ■ Dramatic decrease in the memory cost paralleling increase of memory size ■ Better pipelining ■ Improved compilers Luděk Matýska • Processors • Spring 2023 16/69 FACULTY OF INFORMATICS Masaryk University RISC conditions II ■ Architecture removed the speed of memory access bottleneck ■ use of caches ■ use of internal registers (decreased number of direct memory accesses) ■ Size of a program became irrelevant (even extensive code can fit into a memory) ■ Problem: stall when waiting for a next instruction execution finalization (too complex relationship between instructions, microcode,...) ■ Solution: complex instructions are not needed, microprograms can be replaced by explicit code ■ also, readability of code (assembler) no more critical Luděk Matýska • Processors • Spring 2023 17/69 FACULTY OF INFORMATICS I Masaryk University RISC characteristics ■ ALL instructions of the same size/Length (e.g. 4 bytes) ■ CarefuL seLection of reaLLy needed instructions ■ SimpLe addressing ■ Load/Store architecture ■ Sufficient number of internaL registers ■ "DeLayed'branches ■ ExampLes: ■ InitiaLLy some foreruners: MIPS (Stanford) a SUN SPARC (UoC, Berkeley) architectures ■ IBM and their Power Architecture (PowerPC, family of POWER processors) ■ HP with PA-RISC ■ DEC Alpha ■ Intel I860 and i960 or Motorola 88000 ■ ARC, ARM,... Luděk Matýska • Processors • Spring 2023 18/69 FACULTY OF INFORMATICS Masaryk University RISC - advanced design ■ First generation RISC ideal: ■ One instruction finished per each clock tick ■ Reality nowadays: ■ Several instructions graduated in a single clock tick Luděk Matyska • Processors • Spring 2023 FACULTY OF INFORMATICS Masaryk University New features ■ Superscalar ■ SuperpipeLine ■ (Very) Long Instruction Word, (V)LIW Luděk Matýska • Processors • Spring 2023 20/69 FACULTY OF INFORMATICS I Masaryk University Superscalar processors ■ Multiple processing units ■ Arithmetic (ALU), Floating point (FPU) and other ■ Examples: ■ RS/6000, SuperSPARC and newer, Motorola 88110, HP PA 7100 and newer, DEC Alpha, MIPS R8000 and newer, Intel processors, IBM POWER processor, ARM Luděk Matýska • Processors • Spring 2023 21/69 FACULTY OF INFORMATICS Masaryk University Superscalar processors - features ■ ParaLLeLism in a hardware ■ Sequential programs ■ "Automatic" parallelization (intra-processor parallelization) ■ Several instructions fetched to pipeline ■ MADD (Multiply Add) ■ Operation X*Y+Z Luděk Matýska • Processors • Spring 2023 22/69 FACULTY OF INFORMATICS Masaryk University Superpipeline ■ Another circuits simplification ■ More extensive pipeline decomposition ■ Faster execution of individual stages ■ resulting in faster processing ■ A different form of parallelism ■ These pipelines also called deep pipelines ■ 16 and more stages ■ instructions use only some of the whole set of stages Luděk Matýska • Processors • Spring 2023 23/69 FACULTY OF INFORMATICS Masaryk University 16 stage pipeline 16 STAGE PROCESSOR PIPELINE Intel Corf i Jontial IF1 IF2 IF3 ID1 ID2 ID3 SC IS IRF Instruction Fetch Instruction Decode Instruction Dispatch Operand Read AG DC1 DC2 EX1 FT1 FT2 IWB/DC1 Data Cache Access Execute Exceptions and MT handling Commit ■ 0.7D 1- Frequency (GHz) Luděk Matýska • Processors • Spring 2023 Silicon is capable of high core frequencies. (intel FACULTY OF INFORMATICS Masaryk University VLIW ■ Analogy of superscalar processors (many units) ■ Parallelization under compiler control ■ Increased complexity of compilers ■ Simplified hardware leads to higher performance ■ Decision which instructions can be run in parallel taken by the compiler ■ Advantages: ■ Simpler instructions ■ No complex control hardware needed ■ Lower energy consumption (at least a potential for it) ■ Examples: ■ Intel i860 ■ triMedia media processors ■ C6000 DSP family (Texas Instruments) ■ Itanium IA-64 EPIC (partially) ■ Crusoe processors from Transmeta LudekMaty»aRW&&^i^U^r,^^g)UterS ELbfUS 25/69 FACULTY OF INFORMATICS Masaryk University RISC - additional features ■ Register's bypass ■ Register's renaming ■ Branches ■ null operation ■ conditional assignment (a = b \ 8 instr 32B Predecode +Fusion/Prefix Instruction Table 512 entries Instruction Buffer 128 entries DecodefFuse 8 iop MMA Accelerator 2x512b I i 1 J Execution Execution Execution Execution Slice Slice Slice Slice 128b 128b 128b 128b Load Queue 128 entries (SMT) 64 entries (ST} Load Miss Queue 12 entries Prefetch 16 streams 2 Load EA 2 Store EA D miss 32B LD 32B LD L1 Data Cache 32k 8-way =2x | [ =4x | IBM POWER10 Luděk Matýska • Processors • Spring 2023 66/69 FACULTY OF INFORMATICS I Masaryk University Multiprocessor systems ■ Problems with further frequency increase ■ Heat dissipation ■ ParaUeLization ■ Performance increase via multiple cores Performance increase via multiple CPUs (sockets) Luděk Matýska • Processors • Spring 2023 FACULTY OF INFORMATICS Masaryk University Multiprocessor systems ■ Scaling ratio (number of sockets) for symmetric memory ■ 2-8 for AMD and Intel, 16-32 for IBM POWER family ■ Special solutions up to hundreds (SGI, now HPE) ■ Distributed memory ■ Centralized (symmetric) memory a bottleneck ■ NUMA (Non-Uniform Memory Architecture) ■ Clusters with huge number of multiprocessor nodes ■ Symmetric memory within a node ■ Distributed memory across nodes Luděk Matýska • Processors • Spring 2023 68/69 FACULTY OF INFORMATICS Masaryk University Multiprocessor systems ■ Cache coherency ■ I see what I wrote ■ I see what anyone write before me ■ Order of writes is globally identical ■ Cache row state ■ uncached, shared, modified,... ■ Cache coherency protocols Luděk Matýska • Processors • Spring 2023 69/69