PA152: Efficient Use of DB
2. Data Storage
Vlastislav Dohnal
512B
Data Exchange – Overview
PA152, Vlastislav Dohnal, FI MUNI, 2023 2
Block device
File system
Operating
system
DBMS
Disk
RAM
(buffers)
CPU
(cache)
cache
SW level: HW level:
4KiB
8KiB
512B4KiB
PA152, Vlastislav Dohnal, FI MUNI, 2023 3
Optimize Disk I/Os
◼ Access Techniques
Minimize random accesses
◼ Data volume
Block size
◼ Storage Organization
Disk array
PA152, Vlastislav Dohnal, FI MUNI, 2023 4
Techniques of Accessing Data
◼ App: Double buffering
◼ OS: Prefetching
◼ OS: Defragmentation
 Arrange blocks in the order of processing
 File system
◼ Addressed at the file level
◼ Allocate multiple blocks at once; disk defragmentation tool
◼ HW: Plan accesses (elevator algorithm)
 Head movements in one direction
 Re-order disk requests
◼ Writes to battery-backup cache or log
PA152, Vlastislav Dohnal, FI MUNI, 2023 5
Single Buffer
◼ Task
 Read block B1 → buffer
 Process data in buffer
 Read block B2 → buffer
 Process data in buffer
 …
◼ Costs
 P = processing time of a block
 R = time to read a block
 n = number of blocks to process
◼ Single buffer time = n(R+P)
PA152, Vlastislav Dohnal, FI MUNI, 2023 6
Double Buffering
◼ Two buffers in memory; used alternately
A B C D GE F
Memory
Disk
PA152, Vlastislav Dohnal, FI MUNI, 2023 7
Double Buffering
A
A B C D GE F
Memory
Disk
reading
PA152, Vlastislav Dohnal, FI MUNI, 2023 8
Double Buffering
A B
A B C D GE F
Memory
Disk
processing
reading
PA152, Vlastislav Dohnal, FI MUNI, 2023 9
Double Buffering
C B
A B C D GE F
Memory
Disk
processing
reading
PA152, Vlastislav Dohnal, FI MUNI, 2023 10
Double Buffering
◼ Costs
P = processing time of a block
R = time to read a block
n = number of blocks to process
◼ Single buffer time = n(R+P)
◼ Double buffer time = R + nP
Assuming P ≥ R
Otherwise
◼ = nR + P
PA152, Vlastislav Dohnal, FI MUNI, 2023 11
Optimize Disk I/Os
◼ Access Techniques
Minimize random accesses
◼ Data volume
Block size
◼ Storage Organization
Disk array
PA152, Vlastislav Dohnal, FI MUNI, 2023 12
Block Size
◼ Big block → amortize I/O costs
BUT
◼ Big block → read in more useless data;
takes longer to read
◼ Trend
As memory prices drop, blocks get bigger
Block Size
◼ ATTO Disk Benchmark
256MB read sequentially block by block
No caching
Queue length 4
PA152, Vlastislav Dohnal, FI MUNI, 2023 13
Western Digital 10EZEX 1TB, SATA3, 7200 RPM, sustained transfer rate 150 MB/s
KiB
I/Os per second
◼ IOPS dle HD Tune Pro 5.50
Reading 4KiB blocks
PA152, Vlastislav Dohnal, FI MUNI, 2023 14
Western Digital 10EZEX 1TB, SATA3, 7200 RPM, sustained transfer rate 150 MB/s
Blocks & IOPS
◼ Same tests for SSD
PA152, Vlastislav Dohnal, FI MUNI, 2023 15
Kingston V300 120GB
PA152, Vlastislav Dohnal, FI MUNI, 2023 16
Optimize Disk I/Os
◼ Access Techniques
Minimize random accesses
◼ Data volume
Block size
◼ Storage Organization
Disk array
PA152, Vlastislav Dohnal, FI MUNI, 2023 17
◼ Multiple disks arranged in one logical drive
Increased capacity
Parallelized read / write
No change in header seek time typically
◼ Techniques
block striping
mirroring
Disk Array
Logical Physical
PA152, Vlastislav Dohnal, FI MUNI, 2023 18
Mirroring
◼ Increases reliability through replication
Logical disk formed by two disks
Writes performed on both
Reading can be done from any
◼ Data available of a disk failure
Data loss when both disks fail → unlikely
◼ Beware of dependent failures
Fire, electric shock, damage of hardware
array controller, …
PA152, Vlastislav Dohnal, FI MUNI, 2023 19
Data Striping
◼ Aims
Increase transfer rate by splitting data into
multiple disks
Parallelize long reads to reduce response time
Load balancing → increase throughput
Decreasing reliability
PA152, Vlastislav Dohnal, FI MUNI, 2023 20
Data Striping
◼ Bit-level striping
Distribute bits of each byte to among disks
Access time worse than that of one disk
Rarely used
◼ Block-level striping
n disks
A block i is stored on disk (i mod n)+1
Reading of different blocks is parallelized
◼ If on different disks
Large reading may utilize all disks
PA152, Vlastislav Dohnal, FI MUNI, 2023 21
RAID
◼ Redundant Arrays of
Independent Disks
◼ Different variants for different
requirements
Different performance
Different availability
◼ Combinations
RAID1+0 (or RAID10)
◼ RAID0 built over RAID1 arrays
PA152, Vlastislav Dohnal, FI MUNI, 2023 22
RAID0, RAID1
◼ RAID0
 Block striping, non-redundant
 High performance, non-increased data availability
 No reduced capacity
◼ RAID1
 Mirrored disks
◼ Sometimes limited to two disks
 Capacity 1/n; fast reading; writing as of 1 disk
 Suitable for DB logs, etc.
◼ when writes are sequential
 RAID1E – combines mirroring and striping
…
PA152, Vlastislav Dohnal, FI MUNI, 2023 23
RAID2, RAID3
◼ RAID2
 Bit-striping, Hamming Error-Correcting-Code
 Recovers from of 1 disk failure
 Error is not detected by
the drive!
◼ RAID3
 Byte-striping with parity
 1 parity disk, errors detected by the drive!
 Writing: calculate and store parity
 Restoring disk data
◼ XOR of bits of the other disks
PA152, Vlastislav Dohnal, FI MUNI, 2023 24
RAID4
◼ Uses block-striping (compared to RAID3)
Parity blocks on a separate disk
Writing: calculate and store parity
Restoring disk data
◼ XOR of bits of the other disks
Faster than RAID3
◼ Block read from 1 disk only → can parallelize
◼ Disks may not be fully synchronized
PA152, Vlastislav Dohnal, FI MUNI, 2023 25
RAID4 (cont.)
◼ Block write → calculation of parity block
Take the original parity, the original block and
the new block (2 reads and 2 writes)
Or calculate the new parity of all blocks
(n-1 reads and 2 writes)
Efficient for large sequential writes
◼ Parity disk is a bottleneck!
Writing a block induces writing the parity block
◼ RAID3, RAID4 – at least 3 disks (2+1)
Capacity decreased by the parity disk
PA152, Vlastislav Dohnal, FI MUNI, 2023 26
RAID5
◼ Block-Interleaved Distributed Parity
Splits data and also parity among n disks
Load on parity disk of RAID4 removed
Parity block for i-th block is on disk
Τ𝑖
𝑛−1 mod 𝑛
◼ Example with 5 disks
Parity for block i is on
Τ𝑖
4 mod 5
PA152, Vlastislav Dohnal, FI MUNI, 2023 27
RAID5 (cont.)
◼ Faster than RAID4
Writing parallel if to different disks
Replaces RAID4
◼ Same advantages, but removes disadvantage of
separate parity disk
◼ Frequently used solution
PA152, Vlastislav Dohnal, FI MUNI, 2023 28
RAID6
◼ P+Q Redundancy scheme
Similar to RAID5, but stores extra information
to recover from failures of more disks
Two parity disks (dual distributed parity)
◼ Min. 4 disks in array (capacity shrunk by 2 disks)
Error-correcting codes (Hamming codes)
◼ Repairs failure of two disks
Q Q
Q
RAID Combinations
◼ Different variants combined into one system
An array assembled from physical disks
A resulting array built over these arrays
◼ Used to increase performance and reliability
◼ Example
RAID5+0 over 6 physical disks
◼ Each 3 disks create a RAID5 array
◼ RAID5 arrays forms one RAID0
PA152, Vlastislav Dohnal, FI MUNI, 2023 29
1TB 1TB 1TB
RAID5
1TB 1TB 1TB
RAID5
RAID0
Zdroj: Wikipedia
RAID1+0 vs. RAID0+1
◼ RAID1+0
more resistant to failures
failure of a disk in any RAID1: OK
◼ RAID0+1
Failure of a disk in 1st RAID0
and failure of a disk in 2nd RAID0
 data loss
PA152, Vlastislav Dohnal, FI MUNI, 2023 30
PA152, Vlastislav Dohnal, FI MUNI, 2023 31
RAID Summary
◼ RAID0 – data availability not important
 Data can be easily and quickly restored
◼ from backup,…
◼ RAID2, 3 and 4 superseded by RAID5
 bit-/byte-striping leads to utilization of all disks for
any read/write access; non-distributed parity
◼ RAID6 – less used than RAID5
 RAID1 and 5 provide sufficient reliability
◼ RAID6 for very high-capacity disks
◼ Combinations used: RAID1+0, RAID5+0
◼ Choosing between RAID1 and RAID5
PA152, Vlastislav Dohnal, FI MUNI, 2023 32
RAID Summary (cont.)
◼ RAID1+0
 Much faster writing than RAID5
 For applications with a large number of writes
 More expensive than RAID5 (lower capacity)
◼ RAID5
 Each writing requires 2 reads and 2 writes typically
◼ RAID1+0 needs just 2 writes
 Suitable for apps with fewer number of writes
 Check the “chunk” size
◼ Requirements of current apps in I/Os
 Very high (e.g., web servers, …)
 Need to buy many disks to fulfill the requirements
◼ If their capacity is sufficient, use RAID1 (no further costs)
◼ Preferably RAID1+0
PA152, Vlastislav Dohnal, FI MUNI, 2023 33
RAID Summary (cont.)
◼ Does not substitute backups!!!
◼ Implementation
 SW – supported in almost any OS
 HW – special disk controller
◼ Necessary to use battery-backup cache or non-volatile RAM
◼ Double-check controller’s CPU performance – can be slower
than SW implementation!!!
◼ Hot-swapping
 Usually supported in HW implementations
 No problem in SW implementation, if HW supports
◼ Spare disks
 Presence of extra disks in array
PA152, Vlastislav Dohnal, FI MUNI, 2023 34
Disk Failures
◼ Intermittent failure
Error during read/write → repeat → OK
◼ Medium defect
Permanent fault of a sector
Modern disks detect and correct
◼ Allocated from a spare capacity
◼ Permanent failure
Total damage → replace the disk
PA152, Vlastislav Dohnal, FI MUNI, 2023 35
Coping with Disk Failures
◼ Detection
Checksum
◼ Correction by redundancy
Stable storage
◼ Disk array
◼ Store at more places of the same disk
 super-block; ZFS on data
◼ Journal (log / journal of modifications)
Error correcting codes (ECC)
◼ Hamming codes, …
PA152, Vlastislav Dohnal, FI MUNI, 2023 36
Stable Storage and Databases
◼ Operating system
◼ Database system
Logical
block
Copy1 Copy2
Current
DB
write
Backup
DB
log
Error Correcting Codes
◼ Parity bit = even / odd parity
Used in RAID3,4,5
◼ Example of even parity
RAID4 over 4 disks, block no. 1:
PA152, Vlastislav Dohnal, FI MUNI, 2023 37
Disk 1: 11110000…
Disk 2: 10101010…
Disk 3: 00111000…
Disk P: 01100010…
Disk 1: 11110000…
Disk 2: ????????…
Disk 3: 00111000…
Disk P: 01100010…
failure
Error Correcting Codes
◼ Algebra with operator sum modulo-2
 Even parity, i.e., adding 1 to make even number of 1’s
 Ԧ𝑥 ⊕ Ԧ𝑦 = Ԧ𝑦 ⊕ Ԧ𝑥
 Ԧ𝑥 ⊕ Ԧ𝑦 ⊕ Ԧ𝑧 = Ԧ𝑥 ⊕ Ԧ𝑦 ⊕ Ԧ𝑧
 Ԧ𝑥 ⊕ 0 = Ԧ𝑥
 Ԧ𝑥 ⊕ Ԧ𝑥 = 0
◼ If Ԧ𝑥 ⊕ Ԧ𝑦 = Ԧ𝑧, then Ԧ𝑦 = Ԧ𝑥 ⊕ Ԧ𝑧
Add Ԧ𝑥 to both sides…
PA152, Vlastislav Dohnal, FI MUNI, 2023 38
Error Correcting Codes
◼ Hamming code
Example to recover from 2 crashes
◼ 7 disks -> four data disks
Redundancy schema:
◼ Parity disk contains even parity
◼ Parity computed from data disks denoted by 1
PA152, Vlastislav Dohnal, FI MUNI, 2023 39
Data Parity
Disk No: 1 2 3 4 5 6 7
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
Error Correcting Codes (cont.)
◼ Hamming code
Content sample and writing
PA152, Vlastislav Dohnal, FI MUNI, 2023 40
Disk 1: 11110000…
Disk 2: 10101010…
Disk 3: 00111000…
Disk 4: 01000001…
Disk 5: 01100010…
Disk 6: 00011011…
Disk 7: 10001001…
Data Parity
Disk No: 1 2 3 4 5 6 7
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
Disk 1: 11110000…
Disk 2: 00001111…
Disk 3: 00111000…
Disk 4: 01000001…
Disk 5: 11000111…
Disk 6: 10111110…
Disk 7: 10001001…
Writing to disk 2:
00001111…
Error Correcting Codes (cont.)
◼ Hamming code
Disk failure
PA152, Vlastislav Dohnal, FI MUNI, 2023 41
Disk 1: 11110000…
Disk 2: ????????…
Disk 3: 00111000…
Disk 4: 01000001…
Disk 5: ????????…
Disk 6: 10111110…
Disk 7: 10001001…
Data Parity
Disk No: 1 2 3 4 5 6 7
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
Disk 1: 11110000…
Disk 2:
Disk 3: 00111000…
Disk 4: 01000001…
Disk 5:
Disk 6: 10111110…
Disk 7: 10001001…
Recovery of disk 2
(row in redund. schema with 0 for disk 5)
00001111…
11000111…
Recovery of disk 5
Error Correcting Codes (cont.)
◼ Definition of Hamming Code
 A code of length n is a set of n-bit vectors (code
words).
 Hamming distance is the count of different values
in two n-bit vectors.
 Minimum distance of a code is the smallest
Hamming distance between any different code
words.
 Hamming code is a code with min. dist. “3”
◼ Up to two bit-flips can be detected (but not corrected).
◼ 1 bit-flip is detected and corrected.
PA152, Vlastislav Dohnal, FI MUNI, 2023 42
Error Correcting Codes (cont.)
◼ Generating Hamming Code (n,d); p=n-d
 Number bits from 1, write them in cols in binary
 Every column with one ‘X’ (single bit set) is parity
 Row shows the sources for parity computation
 Column shows which parity bits cover data bit.
43
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Data/parity bits p1 p2 d1 p3 d2 d3 d4 p4 d5 d6 d7 d8 d9 d10 d11 p5 d12
Parity
bit
covera
ge
p1 20 X X X X X X X X X
p2 21 X X X X X X X X
p3 22 X X X X X X X X
p4 23 X X X X X X X X
p5 24 X X
PA152, Vlastislav Dohnal, FI MUNI, 2023
Error Correcting Codes (cont.)
◼ Store data bits 1010 in Hamming Code (7,4)
◼ To correct errors in data read from storage:
 Check all parity bits.
 Sum the positions of bad ones to get address of the wrong bit.
◼ Examples:
 1111010
 1011110
 1001110
 1110000
44
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Data/parity bits p1 p2 d1 p3 d2 d3 d4 p4 d5 d6 d7 d8 d9 d10 d11 p5 d12
Parity
bit
covera
ge
p1 20 X X X X X X X X X
p2 21 X X X X X X X X
p3 22 X X X X X X X X
p4 23 X X X X X X X X
p5 24 X X
→ three bits were flipped here
→ two bits were flipped, so it cannot distinguish 2-bit and 1-bit error.
→ one bit was flipped, so it can be corrected.
PA152, Vlastislav Dohnal, FI MUNI, 2023
Error Correcting Codes (cont.)
◼ Extended Hamming Code
 Add an extra parity bit over all bits
◼ To tell even or odd number of error
◼ Store data bits 1010 in Extended Hamming Code (7,4)
◼ Detect/correct error if any:
 01111010
 01011110
 01001110
 01110000
PA152, Vlastislav Dohnal, FI MUNI, 2023 45
Bit position 0 1 2 3 4 5 6 7
Data/parity bits pX p1 p2 d1 p3 d2 d3 d4
Parity bit
coverage
p1 20 X X X X
p2 21 X X X X
p3 22 X X X X
pX 0 X X X X X X X X
→ 2 bits were flipped; but no clue which.
→ odd number of bits (>1) were flipped; but no clue which.
Error Correcting Codes (cont.)
◼ Reed-Solomon Code (n,d)
ECC adding t check bits to data bits
Can detect up to t bit errors
Can correct up to Τ𝑡
2 errors
So, min. Hamming distance is t = 𝑛 − 𝑑 + 1.
PA152, Vlastislav Dohnal, FI MUNI, 2023 46
PA152, Vlastislav Dohnal, FI MUNI, 2023 47
Failures
◼ Mean Time To Failure (MTTF)
◼ Also: Mean Time Between Failures (MTBF)
Corresponds to failure likelihood
Average operating time between failures
◼ Half of disks fails during the period
◼ Assumes uniform distribution of failures
Decreases with disk age
Usually, 1 000 000 hours and more
◼  114 years
◼ i.e., it fails with 100% in 228 years
 Pfailure in a year=0,44%
◼  Annualized Failure Rate (AFR)
PA152, Vlastislav Dohnal, FI MUNI, 2023 48
Failures (cont.)
◼ Example
MTTF = 1 000 000 hours
 population of 2 000 000 disks
◼ One fails per hour, i.e., 8,760 disks a year
◼  probability of failure in a year = 0,44%
PA152, Vlastislav Dohnal, FI MUNI, 2023 49
Failures (cont.)
◼ Alternative measure
Annualized Failure Rate (AFR)
Component Design Life
◼ Annual Replacement Rate (ARR)
or Annualized Return Rate
Not all failures caused by disk faults
◼ Defective cable, etc.
It is stated
◼ 40% of ARR is “No Trouble Found” (NTF)
◼ AFR = ARR*0.6 ARR = AFR / 0.6
PA152, Vlastislav Dohnal, FI MUNI, 2023 50
Failures and Manufacturers
◼ Seagate http://www.seagate.com/docs/pdf/whitepaper/drive_reliability.pdf
(November 2000)
Savvio® 15K.2 Hard disks – 73 GB
◼ AFR = 0,55%
Seagate estimates MTTF for a disk as the
number of power-on hours (POH) per year
divided by the first-year AFR.
AFR is derived from Reliability-Demonstration
Tests (RDT)
◼ RDT at Seagate = hundreds of disks operating at
42ºC ambient temperature
Failures and Manufacturers
◼ Influence of temperature to MTTF during
1st year
Seagate
PA152, Vlastislav Dohnal, FI MUNI, 2023 51
Adjusted MTTF
Failures and Manufacturers
◼ Seagate Barracuda ES.2 Near-Line Serial ATA disk
PA152, Vlastislav Dohnal, FI MUNI, 2023 52
Note1: Weibull – stat. method for modeling progress of failures
Note2: 2400 hours/yr => 6.5 hrs a day!
Failures – practice
◼ Google http://research.google.com/archive/disk_failures.pdf (FAST conference, 2007)
Test on 100,000 disks
◼ SATA, PATA disks; 5400-7200 rpm; 80-400 GB
PA152, Vlastislav Dohnal, FI MUNI, 2023 53
Failures – practice
◼ Study on 100,000 SCSI, FC, SATA disks
http://www.cs.cmu.edu/~bianca/fast07.pdf (FAST conference, 2007)
PA152, Vlastislav Dohnal, FI MUNI, 2023 54
HPC3: 3064x SCSI disk, 15k rpm, 146GB
11000x SATA disk, 7200 rpm, 250GB
144x SCSI disk, 15k rpm, 73GB
Avg. ARR
AFR=0.88
AFR=0.58
Rate(%)
Failures – practice
◼ Conclusion:
 AFR increases with temperature raising
◼ Not confirmed in data by Google
 SMART parameters are well-correlated with
higher failure probabilities
◼ Google
 After the first scan error, a disk is 39 times more likely to fail
within 60 days.
 First errors in reallocations, offline reallocations, and
probational counts are strongly correlated to higher failure
probabilities.
 Appropriate to use AFR 3-4% in evaluations
◼ If you plan on AFR that is 50% higher than MTTF
suggests, you’ll be better prepared.
 Be ready to replace after 3 yrs of operation
PA152, Vlastislav Dohnal, FI MUNI, 2023 55
PA152, Vlastislav Dohnal, FI MUNI, 2023 56
Failure Recovery
◼ We know AFR = 1 / (2*MTTF)
◼ Mean Time To Repair (MTTR)
 Time from failure to recovery of operation
 = time to replace the failing unit + data recovery
 PFailure During Repair = PFDR = (2*MTTR) / 1 year
◼ Assuming: very short time
◼ Mean Time To Data Loss (MTTDL)
 Depends on MTTF and MTTR
 Mean time between two data-loss events
 For one disk (i.e., data stored in one disk)
◼ MTTDL = MTTF
PA152, Vlastislav Dohnal, FI MUNI, 2023 57
Failure Recovery – Set of disks
◼ Assumption
 Failure of each disk is equally probable and
independent of each other
◼ Example for RAID0
 One disk
◼ AFR1 disk = 0,44% (MTTF = 1,000,000 hrs. = 114 yrs.)
 System of 114 disks (MTTF100 disks = MTTF1 disk / 100)
◼ AFR100 disks = 44% (MTTF = 10,000 hrs. = 1.14 yrs.)
 1 disk fails each year on average
◼ Probability (exactly 1 out of n vs. at least 1 out of n fail)
 Pexactly 1 of 100 fails = 28,43% Pat least 1 of 100 fails = 35,66%
 P exactly 1 of 10 fails = 4,23% P at least 1 of 10 fails = 4,31%
◼ AFRn disks= AFR1 disk * n
◼ MTTDL = 0.5 / AFR
 e.g., MTTDL100 disks = 0.5/0.44 = 1.136 yrs
PA152, Vlastislav Dohnal, FI MUNI, 2023 58
RAID1: Example of Reliability
◼ 2 mirrored 500GB disks
 AFR of each = 3%
◼ Replacement of failed and array recovery in 3 hrs
 MTTR = 3 hrs (at 100MB/s copying takes 1.5 hrs.)
◼ Probability of data loss:
 P1-disk failure = AFR = 0.03
 P1-out-of-2 failure = 0.06
 PFDR = 2 * MTTR / 1 year = 2*3 / 8760 = 0,000 685
 Pdata loss = P1-out-of-2 failure * PFDR * P1-disk failure
= 0,000 001 233
 MTTDL = 0.5 / Pdata loss = 405 515 yrs
PA152, Vlastislav Dohnal, FI MUNI, 2023 59
RAID0: Example of Reliability
◼ 1 disk AFR = 3% (P1-disk failure)
◼ RAID0 – two disks, striping
Pdata loss = P1-out-of-2 failure = 6%
MTTDL = 0.5 / (0.06) = 8.3 yrs
◼ i.e., AFRarray = 6%
PA152, Vlastislav Dohnal, FI MUNI, 2023 60
RAID4: Example of Reliability
◼ 1 disk AFR = 3% (P1-disk failure)
◼ RAID4 – repairs failure of 1 disk
4 disks (3+1)
MTTR = 3 hrs
◼ PFDR = 2*3 / 8760 = 0,000 685
Pdata loss = P1-out-of-4 failure * PFDR * P1-out-of-3 failure
Pdata loss = 4*0,03 * 2/2920 * 3*0,03
= 0,000 007 397
◼ which is AFR of this array
MTTDL = 0.5 / Pdata loss = 67 592 yrs
RAID6: Example of Reliability
◼ RAID6 – repairs failure of 2 disks
4 disks (2+2)
Pdata loss = P1-out-of-4 failure * PFDR * PRAID4over3
PA152, Vlastislav Dohnal, FI MUNI, 2023 61
Array Reliability
◼ n disks
disks in array in total (incl. parity disks)
◼ 1 parity disk
ensures data redundancy
AFRarray1p = n*AFR1 disk * PFDR * (n-1)*AFR1 disk
MTTDL = 0.5 / AFRarray
◼ 2 parity disks
AFRarray2p = n*AFR1 disk * PFDR * AFRarray1p
PA152, Vlastislav Dohnal, FI MUNI, 2023 62
For N-1 disks!
Example of Reliability: RAID Combinations
◼ Combination of arrays
Evaluate MTTDL for individual components
◼ Use it as MTTF of a virtual disk
Evaluate the final MTTDL
PA152, Vlastislav Dohnal, FI MUNI, 2023 63
1TB 1TB 1TB
RAID5
1TB 1TB 1TB
RAID5
RAID0
Example of Reliability: RAID Combinations
◼ RAID5+0
1 disk: AFRdisk
1) Evaluate AFRRAID5
2) Evaluate AFRRAID0 = 2 * AFRRAID5
3) MTTDLRAID5+0 = 0.5 / AFRRAID0PA152, Vlastislav Dohnal, FI MUNI, 2023 64
1TB 1TB 1TB
RAID5
1TB 1TB 1TB
RAID5
RAID0
1TB 1TB 1TB
RAID5
1TB 1TB 1TB
RAID5
RAID0
RAID5
RAID0
RAID5
Example of Reliability: RAID Combinations
◼ RAID4+0 over 8 disks
1 disk AFR=3%, MTTR = 3 hrs.
Assemble one RAID4 over every 4 disks
◼ AFRRAID4 = 4*AFR * PFDR * 3*AFR = … = 7.4*10-6
Assemble two RAID4s into RAID0
◼ AFRRAID4+0 = 2 * AFRRAID4 = 1.48*10-6
◼ MTTDL = 0.5 / AFRRAID4+0 = 33 796 yrs.
PA152, Vlastislav Dohnal, FI MUNI, 2023 65
RAID0
RAID4
1TB 1TB 1TB 1TB
RAID4
1TB 1TB 1TB 1TB
Failures: „Write Hole“ Phenomenon
◼ = Data is not written to all disks.
◼ Severity
Can be unnoticed
Discoverable during array reconstruction
◼ Solution
UPS
Journaling
◼ but with “data written” commit message (-:
Synchronize the array
Special file system (ZFS)
◼ uses "copy-on-write" to provide write atomicity
PA152, Vlastislav Dohnal, FI MUNI, 2023 66
File Systems
◼ Storing a data block:
1. Add an unused block to list of used space
2. Write data block
3. Write file metadata referencing that data block
◼ Modern FS uses journaling
 Start transaction in journal
 Store info about steps 1.-3. to journal
 Do steps 1.-3.
 End transaction in journal
PA152, Vlastislav Dohnal, FI MUNI, 2023 67
File System Tuning
◼ FS block size  DB block size
 ZFS has 128KB by default!
◼ DB journal (WAL in PostgreSQL)
 ext2; ext3/4 with data=writeback (journal off)
◼ DB data
 ext3/4 with data=ordered (only metadata journaled)
◼ Switch off file access times (noatime)
◼ Eliminate swapping (vm.swappiness = 0)
◼ Process memory allocation
(vm.overcommit_memory = 2)
◼ …
PA152, Vlastislav Dohnal, FI MUNI, 2023 68
RAID over SSD
◼ SSD – issue of wearing
Limited writes is handled by moving writes to
other areas, i.e., wear-leveling
Consequence: total failure after some time
◼ RAID over SSD
Worse data availability/reliability
◼ Almost sure that SSDs fail at once
Diff-RAID
◼ Distributes parity unevenly
◼ After replacing a failed SSD with a brand-new one,
parity is moved primarily to the most worn-out drive.
PA152, Vlastislav Dohnal, FI MUNI, 2023 69
Recommended Reading
◼ Dual parity
 https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
◼ Software RAID in SUSE
 https://www.suse.com/documentation/sles10/stor_admin/data/raidevms.html
 Sections:
◼ Managing Software RAIDs with EVMS
◼ Managing Software RAIDs 6 and 10 with mdadm
◼ SSD on Wikipedia
 https://en.wikipedia.org/wiki/Solid-state_drive
◼ Živě.cz: Ze světa: kolik reálně zapsaných dat
vydrží moderní SSD? (in Czech)
 http://m.zive.cz/ze-sveta-kolik-realne-zapsanych-dat-vydrzi-moderni-ssd/a-177557/?textart=1
◼ Chunk size and performance
 https://raid.wiki.kernel.org/index.php/Performance
PA152, Vlastislav Dohnal, FI MUNI, 2023 70
Takeaways
◼ Understating of IOPS
◼ Failures – meaning and terms
MTTR, MTTF, MTTDL, AFR
◼ Computation of array reliability
incl. nested (combined) arrays
◼ Write-hole phenomenon
◼ Implications of SSDs in arrays
PA152, Vlastislav Dohnal, FI MUNI, 2023 71