PA152: Efficient Use of DB 2. Data Storage Vlastislav Dohnal 512B Data Exchange – Overview PA152, Vlastislav Dohnal, FI MUNI, 2023 2 Block device File system Operating system DBMS Disk RAM (buffers) CPU (cache) cache SW level: HW level: 4KiB 8KiB 512B4KiB PA152, Vlastislav Dohnal, FI MUNI, 2023 3 Optimize Disk I/Os ◼ Access Techniques Minimize random accesses ◼ Data volume Block size ◼ Storage Organization Disk array PA152, Vlastislav Dohnal, FI MUNI, 2023 4 Techniques of Accessing Data ◼ App: Double buffering ◼ OS: Prefetching ◼ OS: Defragmentation  Arrange blocks in the order of processing  File system ◼ Addressed at the file level ◼ Allocate multiple blocks at once; disk defragmentation tool ◼ HW: Plan accesses (elevator algorithm)  Head movements in one direction  Re-order disk requests ◼ Writes to battery-backup cache or log PA152, Vlastislav Dohnal, FI MUNI, 2023 5 Single Buffer ◼ Task  Read block B1 → buffer  Process data in buffer  Read block B2 → buffer  Process data in buffer  … ◼ Costs  P = processing time of a block  R = time to read a block  n = number of blocks to process ◼ Single buffer time = n(R+P) PA152, Vlastislav Dohnal, FI MUNI, 2023 6 Double Buffering ◼ Two buffers in memory; used alternately A B C D GE F Memory Disk PA152, Vlastislav Dohnal, FI MUNI, 2023 7 Double Buffering A A B C D GE F Memory Disk reading PA152, Vlastislav Dohnal, FI MUNI, 2023 8 Double Buffering A B A B C D GE F Memory Disk processing reading PA152, Vlastislav Dohnal, FI MUNI, 2023 9 Double Buffering C B A B C D GE F Memory Disk processing reading PA152, Vlastislav Dohnal, FI MUNI, 2023 10 Double Buffering ◼ Costs P = processing time of a block R = time to read a block n = number of blocks to process ◼ Single buffer time = n(R+P) ◼ Double buffer time = R + nP Assuming P ≥ R Otherwise ◼ = nR + P PA152, Vlastislav Dohnal, FI MUNI, 2023 11 Optimize Disk I/Os ◼ Access Techniques Minimize random accesses ◼ Data volume Block size ◼ Storage Organization Disk array PA152, Vlastislav Dohnal, FI MUNI, 2023 12 Block Size ◼ Big block → amortize I/O costs BUT ◼ Big block → read in more useless data; takes longer to read ◼ Trend As memory prices drop, blocks get bigger Block Size ◼ ATTO Disk Benchmark 256MB read sequentially block by block No caching Queue length 4 PA152, Vlastislav Dohnal, FI MUNI, 2023 13 Western Digital 10EZEX 1TB, SATA3, 7200 RPM, sustained transfer rate 150 MB/s KiB I/Os per second ◼ IOPS dle HD Tune Pro 5.50 Reading 4KiB blocks PA152, Vlastislav Dohnal, FI MUNI, 2023 14 Western Digital 10EZEX 1TB, SATA3, 7200 RPM, sustained transfer rate 150 MB/s Blocks & IOPS ◼ Same tests for SSD PA152, Vlastislav Dohnal, FI MUNI, 2023 15 Kingston V300 120GB PA152, Vlastislav Dohnal, FI MUNI, 2023 16 Optimize Disk I/Os ◼ Access Techniques Minimize random accesses ◼ Data volume Block size ◼ Storage Organization Disk array PA152, Vlastislav Dohnal, FI MUNI, 2023 17 ◼ Multiple disks arranged in one logical drive Increased capacity Parallelized read / write No change in header seek time typically ◼ Techniques block striping mirroring Disk Array Logical Physical PA152, Vlastislav Dohnal, FI MUNI, 2023 18 Mirroring ◼ Increases reliability through replication Logical disk formed by two disks Writes performed on both Reading can be done from any ◼ Data available of a disk failure Data loss when both disks fail → unlikely ◼ Beware of dependent failures Fire, electric shock, damage of hardware array controller, … PA152, Vlastislav Dohnal, FI MUNI, 2023 19 Data Striping ◼ Aims Increase transfer rate by splitting data into multiple disks Parallelize long reads to reduce response time Load balancing → increase throughput Decreasing reliability PA152, Vlastislav Dohnal, FI MUNI, 2023 20 Data Striping ◼ Bit-level striping Distribute bits of each byte to among disks Access time worse than that of one disk Rarely used ◼ Block-level striping n disks A block i is stored on disk (i mod n)+1 Reading of different blocks is parallelized ◼ If on different disks Large reading may utilize all disks PA152, Vlastislav Dohnal, FI MUNI, 2023 21 RAID ◼ Redundant Arrays of Independent Disks ◼ Different variants for different requirements Different performance Different availability ◼ Combinations RAID1+0 (or RAID10) ◼ RAID0 built over RAID1 arrays PA152, Vlastislav Dohnal, FI MUNI, 2023 22 RAID0, RAID1 ◼ RAID0  Block striping, non-redundant  High performance, non-increased data availability  No reduced capacity ◼ RAID1  Mirrored disks ◼ Sometimes limited to two disks  Capacity 1/n; fast reading; writing as of 1 disk  Suitable for DB logs, etc. ◼ when writes are sequential  RAID1E – combines mirroring and striping … PA152, Vlastislav Dohnal, FI MUNI, 2023 23 RAID2, RAID3 ◼ RAID2  Bit-striping, Hamming Error-Correcting-Code  Recovers from of 1 disk failure  Error is not detected by the drive! ◼ RAID3  Byte-striping with parity  1 parity disk, errors detected by the drive!  Writing: calculate and store parity  Restoring disk data ◼ XOR of bits of the other disks PA152, Vlastislav Dohnal, FI MUNI, 2023 24 RAID4 ◼ Uses block-striping (compared to RAID3) Parity blocks on a separate disk Writing: calculate and store parity Restoring disk data ◼ XOR of bits of the other disks Faster than RAID3 ◼ Block read from 1 disk only → can parallelize ◼ Disks may not be fully synchronized PA152, Vlastislav Dohnal, FI MUNI, 2023 25 RAID4 (cont.) ◼ Block write → calculation of parity block Take the original parity, the original block and the new block (2 reads and 2 writes) Or calculate the new parity of all blocks (n-1 reads and 2 writes) Efficient for large sequential writes ◼ Parity disk is a bottleneck! Writing a block induces writing the parity block ◼ RAID3, RAID4 – at least 3 disks (2+1) Capacity decreased by the parity disk PA152, Vlastislav Dohnal, FI MUNI, 2023 26 RAID5 ◼ Block-Interleaved Distributed Parity Splits data and also parity among n disks Load on parity disk of RAID4 removed Parity block for i-th block is on disk Τ𝑖 𝑛−1 mod 𝑛 ◼ Example with 5 disks Parity for block i is on Τ𝑖 4 mod 5 PA152, Vlastislav Dohnal, FI MUNI, 2023 27 RAID5 (cont.) ◼ Faster than RAID4 Writing parallel if to different disks Replaces RAID4 ◼ Same advantages, but removes disadvantage of separate parity disk ◼ Frequently used solution PA152, Vlastislav Dohnal, FI MUNI, 2023 28 RAID6 ◼ P+Q Redundancy scheme Similar to RAID5, but stores extra information to recover from failures of more disks Two parity disks (dual distributed parity) ◼ Min. 4 disks in array (capacity shrunk by 2 disks) Error-correcting codes (Hamming codes) ◼ Repairs failure of two disks Q Q Q RAID Combinations ◼ Different variants combined into one system An array assembled from physical disks A resulting array built over these arrays ◼ Used to increase performance and reliability ◼ Example RAID5+0 over 6 physical disks ◼ Each 3 disks create a RAID5 array ◼ RAID5 arrays forms one RAID0 PA152, Vlastislav Dohnal, FI MUNI, 2023 29 1TB 1TB 1TB RAID5 1TB 1TB 1TB RAID5 RAID0 Zdroj: Wikipedia RAID1+0 vs. RAID0+1 ◼ RAID1+0 more resistant to failures failure of a disk in any RAID1: OK ◼ RAID0+1 Failure of a disk in 1st RAID0 and failure of a disk in 2nd RAID0  data loss PA152, Vlastislav Dohnal, FI MUNI, 2023 30 PA152, Vlastislav Dohnal, FI MUNI, 2023 31 RAID Summary ◼ RAID0 – data availability not important  Data can be easily and quickly restored ◼ from backup,… ◼ RAID2, 3 and 4 superseded by RAID5  bit-/byte-striping leads to utilization of all disks for any read/write access; non-distributed parity ◼ RAID6 – less used than RAID5  RAID1 and 5 provide sufficient reliability ◼ RAID6 for very high-capacity disks ◼ Combinations used: RAID1+0, RAID5+0 ◼ Choosing between RAID1 and RAID5 PA152, Vlastislav Dohnal, FI MUNI, 2023 32 RAID Summary (cont.) ◼ RAID1+0  Much faster writing than RAID5  For applications with a large number of writes  More expensive than RAID5 (lower capacity) ◼ RAID5  Each writing requires 2 reads and 2 writes typically ◼ RAID1+0 needs just 2 writes  Suitable for apps with fewer number of writes  Check the “chunk” size ◼ Requirements of current apps in I/Os  Very high (e.g., web servers, …)  Need to buy many disks to fulfill the requirements ◼ If their capacity is sufficient, use RAID1 (no further costs) ◼ Preferably RAID1+0 PA152, Vlastislav Dohnal, FI MUNI, 2023 33 RAID Summary (cont.) ◼ Does not substitute backups!!! ◼ Implementation  SW – supported in almost any OS  HW – special disk controller ◼ Necessary to use battery-backup cache or non-volatile RAM ◼ Double-check controller’s CPU performance – can be slower than SW implementation!!! ◼ Hot-swapping  Usually supported in HW implementations  No problem in SW implementation, if HW supports ◼ Spare disks  Presence of extra disks in array PA152, Vlastislav Dohnal, FI MUNI, 2023 34 Disk Failures ◼ Intermittent failure Error during read/write → repeat → OK ◼ Medium defect Permanent fault of a sector Modern disks detect and correct ◼ Allocated from a spare capacity ◼ Permanent failure Total damage → replace the disk PA152, Vlastislav Dohnal, FI MUNI, 2023 35 Coping with Disk Failures ◼ Detection Checksum ◼ Correction by redundancy Stable storage ◼ Disk array ◼ Store at more places of the same disk  super-block; ZFS on data ◼ Journal (log / journal of modifications) Error correcting codes (ECC) ◼ Hamming codes, … PA152, Vlastislav Dohnal, FI MUNI, 2023 36 Stable Storage and Databases ◼ Operating system ◼ Database system Logical block Copy1 Copy2 Current DB write Backup DB log Error Correcting Codes ◼ Parity bit = even / odd parity Used in RAID3,4,5 ◼ Example of even parity RAID4 over 4 disks, block no. 1: PA152, Vlastislav Dohnal, FI MUNI, 2023 37 Disk 1: 11110000… Disk 2: 10101010… Disk 3: 00111000… Disk P: 01100010… Disk 1: 11110000… Disk 2: ????????… Disk 3: 00111000… Disk P: 01100010… failure Error Correcting Codes ◼ Algebra with operator sum modulo-2  Even parity, i.e., adding 1 to make even number of 1’s  Ԧ𝑥 ⊕ Ԧ𝑦 = Ԧ𝑦 ⊕ Ԧ𝑥  Ԧ𝑥 ⊕ Ԧ𝑦 ⊕ Ԧ𝑧 = Ԧ𝑥 ⊕ Ԧ𝑦 ⊕ Ԧ𝑧  Ԧ𝑥 ⊕ 0 = Ԧ𝑥  Ԧ𝑥 ⊕ Ԧ𝑥 = 0 ◼ If Ԧ𝑥 ⊕ Ԧ𝑦 = Ԧ𝑧, then Ԧ𝑦 = Ԧ𝑥 ⊕ Ԧ𝑧 Add Ԧ𝑥 to both sides… PA152, Vlastislav Dohnal, FI MUNI, 2023 38 Error Correcting Codes ◼ Hamming code Example to recover from 2 crashes ◼ 7 disks -> four data disks Redundancy schema: ◼ Parity disk contains even parity ◼ Parity computed from data disks denoted by 1 PA152, Vlastislav Dohnal, FI MUNI, 2023 39 Data Parity Disk No: 1 2 3 4 5 6 7 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 Error Correcting Codes (cont.) ◼ Hamming code Content sample and writing PA152, Vlastislav Dohnal, FI MUNI, 2023 40 Disk 1: 11110000… Disk 2: 10101010… Disk 3: 00111000… Disk 4: 01000001… Disk 5: 01100010… Disk 6: 00011011… Disk 7: 10001001… Data Parity Disk No: 1 2 3 4 5 6 7 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 Disk 1: 11110000… Disk 2: 00001111… Disk 3: 00111000… Disk 4: 01000001… Disk 5: 11000111… Disk 6: 10111110… Disk 7: 10001001… Writing to disk 2: 00001111… Error Correcting Codes (cont.) ◼ Hamming code Disk failure PA152, Vlastislav Dohnal, FI MUNI, 2023 41 Disk 1: 11110000… Disk 2: ????????… Disk 3: 00111000… Disk 4: 01000001… Disk 5: ????????… Disk 6: 10111110… Disk 7: 10001001… Data Parity Disk No: 1 2 3 4 5 6 7 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 Disk 1: 11110000… Disk 2: Disk 3: 00111000… Disk 4: 01000001… Disk 5: Disk 6: 10111110… Disk 7: 10001001… Recovery of disk 2 (row in redund. schema with 0 for disk 5) 00001111… 11000111… Recovery of disk 5 Error Correcting Codes (cont.) ◼ Definition of Hamming Code  A code of length n is a set of n-bit vectors (code words).  Hamming distance is the count of different values in two n-bit vectors.  Minimum distance of a code is the smallest Hamming distance between any different code words.  Hamming code is a code with min. dist. “3” ◼ Up to two bit-flips can be detected (but not corrected). ◼ 1 bit-flip is detected and corrected. PA152, Vlastislav Dohnal, FI MUNI, 2023 42 Error Correcting Codes (cont.) ◼ Generating Hamming Code (n,d); p=n-d  Number bits from 1, write them in cols in binary  Every column with one ‘X’ (single bit set) is parity  Row shows the sources for parity computation  Column shows which parity bits cover data bit. 43 Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Data/parity bits p1 p2 d1 p3 d2 d3 d4 p4 d5 d6 d7 d8 d9 d10 d11 p5 d12 Parity bit covera ge p1 20 X X X X X X X X X p2 21 X X X X X X X X p3 22 X X X X X X X X p4 23 X X X X X X X X p5 24 X X PA152, Vlastislav Dohnal, FI MUNI, 2023 Error Correcting Codes (cont.) ◼ Store data bits 1010 in Hamming Code (7,4) ◼ To correct errors in data read from storage:  Check all parity bits.  Sum the positions of bad ones to get address of the wrong bit. ◼ Examples:  1111010  1011110  1001110  1110000 44 Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Data/parity bits p1 p2 d1 p3 d2 d3 d4 p4 d5 d6 d7 d8 d9 d10 d11 p5 d12 Parity bit covera ge p1 20 X X X X X X X X X p2 21 X X X X X X X X p3 22 X X X X X X X X p4 23 X X X X X X X X p5 24 X X → three bits were flipped here → two bits were flipped, so it cannot distinguish 2-bit and 1-bit error. → one bit was flipped, so it can be corrected. PA152, Vlastislav Dohnal, FI MUNI, 2023 Error Correcting Codes (cont.) ◼ Extended Hamming Code  Add an extra parity bit over all bits ◼ To tell even or odd number of error ◼ Store data bits 1010 in Extended Hamming Code (7,4) ◼ Detect/correct error if any:  01111010  01011110  01001110  01110000 PA152, Vlastislav Dohnal, FI MUNI, 2023 45 Bit position 0 1 2 3 4 5 6 7 Data/parity bits pX p1 p2 d1 p3 d2 d3 d4 Parity bit coverage p1 20 X X X X p2 21 X X X X p3 22 X X X X pX 0 X X X X X X X X → 2 bits were flipped; but no clue which. → odd number of bits (>1) were flipped; but no clue which. Error Correcting Codes (cont.) ◼ Reed-Solomon Code (n,d) ECC adding t check bits to data bits Can detect up to t bit errors Can correct up to Τ𝑡 2 errors So, min. Hamming distance is t = 𝑛 − 𝑑 + 1. PA152, Vlastislav Dohnal, FI MUNI, 2023 46 PA152, Vlastislav Dohnal, FI MUNI, 2023 47 Failures ◼ Mean Time To Failure (MTTF) ◼ Also: Mean Time Between Failures (MTBF) Corresponds to failure likelihood Average operating time between failures ◼ Half of disks fails during the period ◼ Assumes uniform distribution of failures Decreases with disk age Usually, 1 000 000 hours and more ◼  114 years ◼ i.e., it fails with 100% in 228 years  Pfailure in a year=0,44% ◼  Annualized Failure Rate (AFR) PA152, Vlastislav Dohnal, FI MUNI, 2023 48 Failures (cont.) ◼ Example MTTF = 1 000 000 hours  population of 2 000 000 disks ◼ One fails per hour, i.e., 8,760 disks a year ◼  probability of failure in a year = 0,44% PA152, Vlastislav Dohnal, FI MUNI, 2023 49 Failures (cont.) ◼ Alternative measure Annualized Failure Rate (AFR) Component Design Life ◼ Annual Replacement Rate (ARR) or Annualized Return Rate Not all failures caused by disk faults ◼ Defective cable, etc. It is stated ◼ 40% of ARR is “No Trouble Found” (NTF) ◼ AFR = ARR*0.6 ARR = AFR / 0.6 PA152, Vlastislav Dohnal, FI MUNI, 2023 50 Failures and Manufacturers ◼ Seagate http://www.seagate.com/docs/pdf/whitepaper/drive_reliability.pdf (November 2000) Savvio® 15K.2 Hard disks – 73 GB ◼ AFR = 0,55% Seagate estimates MTTF for a disk as the number of power-on hours (POH) per year divided by the first-year AFR. AFR is derived from Reliability-Demonstration Tests (RDT) ◼ RDT at Seagate = hundreds of disks operating at 42ºC ambient temperature Failures and Manufacturers ◼ Influence of temperature to MTTF during 1st year Seagate PA152, Vlastislav Dohnal, FI MUNI, 2023 51 Adjusted MTTF Failures and Manufacturers ◼ Seagate Barracuda ES.2 Near-Line Serial ATA disk PA152, Vlastislav Dohnal, FI MUNI, 2023 52 Note1: Weibull – stat. method for modeling progress of failures Note2: 2400 hours/yr => 6.5 hrs a day! Failures – practice ◼ Google http://research.google.com/archive/disk_failures.pdf (FAST conference, 2007) Test on 100,000 disks ◼ SATA, PATA disks; 5400-7200 rpm; 80-400 GB PA152, Vlastislav Dohnal, FI MUNI, 2023 53 Failures – practice ◼ Study on 100,000 SCSI, FC, SATA disks http://www.cs.cmu.edu/~bianca/fast07.pdf (FAST conference, 2007) PA152, Vlastislav Dohnal, FI MUNI, 2023 54 HPC3: 3064x SCSI disk, 15k rpm, 146GB 11000x SATA disk, 7200 rpm, 250GB 144x SCSI disk, 15k rpm, 73GB Avg. ARR AFR=0.88 AFR=0.58 Rate(%) Failures – practice ◼ Conclusion:  AFR increases with temperature raising ◼ Not confirmed in data by Google  SMART parameters are well-correlated with higher failure probabilities ◼ Google  After the first scan error, a disk is 39 times more likely to fail within 60 days.  First errors in reallocations, offline reallocations, and probational counts are strongly correlated to higher failure probabilities.  Appropriate to use AFR 3-4% in evaluations ◼ If you plan on AFR that is 50% higher than MTTF suggests, you’ll be better prepared.  Be ready to replace after 3 yrs of operation PA152, Vlastislav Dohnal, FI MUNI, 2023 55 PA152, Vlastislav Dohnal, FI MUNI, 2023 56 Failure Recovery ◼ We know AFR = 1 / (2*MTTF) ◼ Mean Time To Repair (MTTR)  Time from failure to recovery of operation  = time to replace the failing unit + data recovery  PFailure During Repair = PFDR = (2*MTTR) / 1 year ◼ Assuming: very short time ◼ Mean Time To Data Loss (MTTDL)  Depends on MTTF and MTTR  Mean time between two data-loss events  For one disk (i.e., data stored in one disk) ◼ MTTDL = MTTF PA152, Vlastislav Dohnal, FI MUNI, 2023 57 Failure Recovery – Set of disks ◼ Assumption  Failure of each disk is equally probable and independent of each other ◼ Example for RAID0  One disk ◼ AFR1 disk = 0,44% (MTTF = 1,000,000 hrs. = 114 yrs.)  System of 114 disks (MTTF100 disks = MTTF1 disk / 100) ◼ AFR100 disks = 44% (MTTF = 10,000 hrs. = 1.14 yrs.)  1 disk fails each year on average ◼ Probability (exactly 1 out of n vs. at least 1 out of n fail)  Pexactly 1 of 100 fails = 28,43% Pat least 1 of 100 fails = 35,66%  P exactly 1 of 10 fails = 4,23% P at least 1 of 10 fails = 4,31% ◼ AFRn disks= AFR1 disk * n ◼ MTTDL = 0.5 / AFR  e.g., MTTDL100 disks = 0.5/0.44 = 1.136 yrs PA152, Vlastislav Dohnal, FI MUNI, 2023 58 RAID1: Example of Reliability ◼ 2 mirrored 500GB disks  AFR of each = 3% ◼ Replacement of failed and array recovery in 3 hrs  MTTR = 3 hrs (at 100MB/s copying takes 1.5 hrs.) ◼ Probability of data loss:  P1-disk failure = AFR = 0.03  P1-out-of-2 failure = 0.06  PFDR = 2 * MTTR / 1 year = 2*3 / 8760 = 0,000 685  Pdata loss = P1-out-of-2 failure * PFDR * P1-disk failure = 0,000 001 233  MTTDL = 0.5 / Pdata loss = 405 515 yrs PA152, Vlastislav Dohnal, FI MUNI, 2023 59 RAID0: Example of Reliability ◼ 1 disk AFR = 3% (P1-disk failure) ◼ RAID0 – two disks, striping Pdata loss = P1-out-of-2 failure = 6% MTTDL = 0.5 / (0.06) = 8.3 yrs ◼ i.e., AFRarray = 6% PA152, Vlastislav Dohnal, FI MUNI, 2023 60 RAID4: Example of Reliability ◼ 1 disk AFR = 3% (P1-disk failure) ◼ RAID4 – repairs failure of 1 disk 4 disks (3+1) MTTR = 3 hrs ◼ PFDR = 2*3 / 8760 = 0,000 685 Pdata loss = P1-out-of-4 failure * PFDR * P1-out-of-3 failure Pdata loss = 4*0,03 * 2/2920 * 3*0,03 = 0,000 007 397 ◼ which is AFR of this array MTTDL = 0.5 / Pdata loss = 67 592 yrs RAID6: Example of Reliability ◼ RAID6 – repairs failure of 2 disks 4 disks (2+2) Pdata loss = P1-out-of-4 failure * PFDR * PRAID4over3 PA152, Vlastislav Dohnal, FI MUNI, 2023 61 Array Reliability ◼ n disks disks in array in total (incl. parity disks) ◼ 1 parity disk ensures data redundancy AFRarray1p = n*AFR1 disk * PFDR * (n-1)*AFR1 disk MTTDL = 0.5 / AFRarray ◼ 2 parity disks AFRarray2p = n*AFR1 disk * PFDR * AFRarray1p PA152, Vlastislav Dohnal, FI MUNI, 2023 62 For N-1 disks! Example of Reliability: RAID Combinations ◼ Combination of arrays Evaluate MTTDL for individual components ◼ Use it as MTTF of a virtual disk Evaluate the final MTTDL PA152, Vlastislav Dohnal, FI MUNI, 2023 63 1TB 1TB 1TB RAID5 1TB 1TB 1TB RAID5 RAID0 Example of Reliability: RAID Combinations ◼ RAID5+0 1 disk: AFRdisk 1) Evaluate AFRRAID5 2) Evaluate AFRRAID0 = 2 * AFRRAID5 3) MTTDLRAID5+0 = 0.5 / AFRRAID0PA152, Vlastislav Dohnal, FI MUNI, 2023 64 1TB 1TB 1TB RAID5 1TB 1TB 1TB RAID5 RAID0 1TB 1TB 1TB RAID5 1TB 1TB 1TB RAID5 RAID0 RAID5 RAID0 RAID5 Example of Reliability: RAID Combinations ◼ RAID4+0 over 8 disks 1 disk AFR=3%, MTTR = 3 hrs. Assemble one RAID4 over every 4 disks ◼ AFRRAID4 = 4*AFR * PFDR * 3*AFR = … = 7.4*10-6 Assemble two RAID4s into RAID0 ◼ AFRRAID4+0 = 2 * AFRRAID4 = 1.48*10-6 ◼ MTTDL = 0.5 / AFRRAID4+0 = 33 796 yrs. PA152, Vlastislav Dohnal, FI MUNI, 2023 65 RAID0 RAID4 1TB 1TB 1TB 1TB RAID4 1TB 1TB 1TB 1TB Failures: „Write Hole“ Phenomenon ◼ = Data is not written to all disks. ◼ Severity Can be unnoticed Discoverable during array reconstruction ◼ Solution UPS Journaling ◼ but with “data written” commit message (-: Synchronize the array Special file system (ZFS) ◼ uses "copy-on-write" to provide write atomicity PA152, Vlastislav Dohnal, FI MUNI, 2023 66 File Systems ◼ Storing a data block: 1. Add an unused block to list of used space 2. Write data block 3. Write file metadata referencing that data block ◼ Modern FS uses journaling  Start transaction in journal  Store info about steps 1.-3. to journal  Do steps 1.-3.  End transaction in journal PA152, Vlastislav Dohnal, FI MUNI, 2023 67 File System Tuning ◼ FS block size  DB block size  ZFS has 128KB by default! ◼ DB journal (WAL in PostgreSQL)  ext2; ext3/4 with data=writeback (journal off) ◼ DB data  ext3/4 with data=ordered (only metadata journaled) ◼ Switch off file access times (noatime) ◼ Eliminate swapping (vm.swappiness = 0) ◼ Process memory allocation (vm.overcommit_memory = 2) ◼ … PA152, Vlastislav Dohnal, FI MUNI, 2023 68 RAID over SSD ◼ SSD – issue of wearing Limited writes is handled by moving writes to other areas, i.e., wear-leveling Consequence: total failure after some time ◼ RAID over SSD Worse data availability/reliability ◼ Almost sure that SSDs fail at once Diff-RAID ◼ Distributes parity unevenly ◼ After replacing a failed SSD with a brand-new one, parity is moved primarily to the most worn-out drive. PA152, Vlastislav Dohnal, FI MUNI, 2023 69 Recommended Reading ◼ Dual parity  https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf ◼ Software RAID in SUSE  https://www.suse.com/documentation/sles10/stor_admin/data/raidevms.html  Sections: ◼ Managing Software RAIDs with EVMS ◼ Managing Software RAIDs 6 and 10 with mdadm ◼ SSD on Wikipedia  https://en.wikipedia.org/wiki/Solid-state_drive ◼ Živě.cz: Ze světa: kolik reálně zapsaných dat vydrží moderní SSD? (in Czech)  http://m.zive.cz/ze-sveta-kolik-realne-zapsanych-dat-vydrzi-moderni-ssd/a-177557/?textart=1 ◼ Chunk size and performance  https://raid.wiki.kernel.org/index.php/Performance PA152, Vlastislav Dohnal, FI MUNI, 2023 70 Takeaways ◼ Understating of IOPS ◼ Failures – meaning and terms MTTR, MTTF, MTTDL, AFR ◼ Computation of array reliability incl. nested (combined) arrays ◼ Write-hole phenomenon ◼ Implications of SSDs in arrays PA152, Vlastislav Dohnal, FI MUNI, 2023 71