Estimating Drive Reliability in Desktop Computers and Consumer Electronics Systems Introduction Historically, desktop computers have been the primary application for hard disc storage devices. However, the market for disc drives in consumer electronic devices is growing rapidly. This paper presents a method for estimating drive reliability in desktop computers and consumer electronics devices using the results of Seagate’s standard laboratory tests. It provides a link between Seagate’s published reliability specifications and real-world drive reliability as experienced by the end-user. Definitions Seagate estimates the mean time between failures (MTBF) for a drive as the number of power-on hours (POH) per year divided by the first-year annualized failure rate (AFR). This is a suitable approximation for small failure rates, and we intend it to represent a first year MTBF. The annualized failure rate for a drive is derived from time-to-fail data collected during a reliability-demonstration test (RDT). Factory reliability-demonstration tests (FRDT) are similar but are performed on drives pulled from the volume production line. For the purposes of this paper, we assume that any concept that applies to an RDT also applies to an FRDT. Seagate Reliability Tests At Seagateâ Personal Storage Group in Longmont, Colorado, desktop disc drive reliability tests are normally conducted in ovens at 42ºC ambient temperature to provide accelerated failure rates. In addition, the drives are operated at the highest possible duty cycle (a drive’s duty cycle is defined by the number of seeks, reads and writes it performs over a specific time period). We do this to discover as many failure modes as possible during the product development cycle. By fixing any problems we may see at this stage, we can make sure that our customers won’t see the same problems. Estimating Weibull Parameters Let’s assume we have an RDT with 500 drives, all run for 672 hours at 42ºC ambient temperature. During this test, further assume that we observe three failures (at 12, 133 and 232 hours). This means that, of the 500 drives tested, 497 ran the entire test without failing. To analyze and extrapolate from the test results, we perform Weibull modeling using SuperSmith software from Fulton Findings.1 Specifically, we use the Maximum Likelihood method to estimate the Weibull-distribution parameters Beta (a shape parameter) and Eta (a scale parameter). In tests with five or fewer failures, the Beta parameter cannot be well defined by the test data. Because such cases are common in drive testing, we analyze the data using a WeiBayes2 approach. This approach requires that we estimate the Beta parameter using historical data. In the desktopproducts lab, we are currently assuming that Beta = 0.55. This value is based on the manufacturing data shown in the following table, which includes all desktop products tested prior to March 1999. 1. SuperSmith, Fulton Findings, WinSMITH and WinSMITH Weibull are trademarks of Fulton Findings, 1251 W. Sepulveda Blvd., #800, Torrance, CA 90502, USA 2. Abernethy, Dr. Robert B., The New Weibull handbook, Second Edition, published by the author, 1996, Chapter 5. From: Gerry Cole Seagate Personal Storage Group Longmont, Colorado Date: November 2000 Number: TP-338.1 INTELLIGENCE TECHNOLOGYPAPER i FROMSEAGATE Corporate Headquarters Asia/Pacific Headquarters Europe, Middle East and Africa Headquarters Scotts Valley, California, USA +1-831-438-6550 Singapore +65-488-7200 Boulogne-Billancourt, France +33 1-41 86 10 00 The graph below shows the results of both the Weibull and WeiBayes analysis. The solid line in the figure below shows Weibull Beta and Eta parameters (Beta = 0.443, Eta = 69331860) estimated using the Maximum Likelihood3 (MLE) approach on only 3 failures out of 500 drives. As mentioned before, these results are considered less accurate than those of the WeiBayes method for small failure rates. The results of the WeiBayes method (with Beta = 0.55) are shown as a dashed line in the figure below. Because 672 test hours at 42ºC should be a sufficiently long run time for an RDT, we use our internal test exit confidence level4 of 63.2 percent for the WeiBayes analysis. The WeiBayes calculations indicate that, at 42ºC, given a historical Beta = 0.55, a reasonable value for Eta is 3,787,073 hours. The next step in the analysis is to convert the value for Eta that was based on tests at 42ºC to a value that reflects our specified operational temperature (25ºC). Using the Arrhenius Model,5 an acceleration factor of 2.2208 can be used to account for this difference in temperature. Therefore, the value for Eta at 25ºC (Eta25) is assumed to be equal to the value for Eta at 42ºC (Eta42) times 2.2208, or 8,410,332 hours. 3. Abernethy, Dr. Robert B., The New Weibull handbook, Second Edition, published by the author, 1996, Appendix D. 4. Earlier in the RDT, a larger confidence level would be used to reflect the uncertainty in Weibull parameter estimation due to the limited run time. 5. Nelson, Wayne, Applied Life Data Analysis, John Wiley & Sons, 1982. 2 Desktop drive site Database Mean Beta Standard Deviation of Beta Longmont 37 RTD, 5 FRDT 0.546 0.176 Perai 2 RTD, 4 FRDT 0.617 0.068 Wuzi 1 RTD 0.388 n/a Pooled desktop data 49 Tests 0.552 0.167 Example of Weibull and WeiBayes Analysis CumulativePercentFailure Test Time at 42ºC (hours) .1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09 .2 .5 1 2 5 10 20 30 40 50 60 70 80 90 Observed Weibull fit via MLE WeiBayes fit W/mle \WB c=-63.2 YR2000 MO2D22 GFC Eta Beta n/s 69331860 0.443 500/497 3787073 0.55 3 Applying the Estimated Weibull Parameters to Estimate First-Year MTBF Using the temperature-adjusted estimated values of the Weibull Beta and Eta parameters, we can calculate the cumulative-percent-failure rate at any time. By subtracting the cumulative-percent-failure rates for two different times (t1 and t2), and using appropriate values for Beta and Eta25, we can estimate the percent of drives that are likely to fail at 25ºC during any time interval from t1 to t2. To estimate the AFR for the first year of drive operation in a desktop computer setting, we assume that the drive is used at a rate of 2,400 power-on-hours (POH) per customer year. In addition, we assume that drives are subjected to a 24-POH integration period by the device manufacturer. Because any drives that fail during this period are returned to Seagate and are not shipped to the end-user, they are not counted in the first year AFR and MTBF. Based on these assumptions (100% duty cycle, Eta25 = 8,410,332 hours, Beta = 0.55, and 2,400 POH per year) the percent failure rate in the first customer year after integration can be calculated as the percent failure rate between 24 hours (t1) and 2,424 hours (t2). The results of this calculation are shown in the table below, which derives a first-year MTBF from the RDT data. Input area: 2,400 hours per year Weibull shape factor (Beta): 0.55 Weibull scale factor (Eta): 8,410,332 P(fail), 0 to 2,424 POH per year: 1.123% P(fail), 0 to 24 hours: – 0.089% First-year AFR = 1.0338% (before rounding) POH per year: 2,400 First-year AFT: ÷ 0.010338 First-year Weibull MTBF = 232,140 Accounting for Actual User Conditions The calculations above suggest that if a customer were to use our drive at 25ºC and 2,400 POH per year, the expected customer MTBF in the first year would be 232,140. However, these conditions may not always apply to the consumer electronics environment. For example, in some consumer devices, the drive may be powered on almost 100 percent of the time and yearly usage rates may be much higher than 2,400 POH. In other devices, such as video game players, the POH per year may be relatively low. The following section describes how we can adjust the calculated MTBF so that it applies to various usage levels, duty cycles and ambient temperatures. Usage Levels To account for variation in MTBF due to different levels of usage, we may use the MTBF adjustment curve shown at right. For example, to adjust an MTBF from 2,400 POH per year to a maximum usage rate of 8,760 POH per year, the MTBF would be increased by 1.8 times. Conversely, for low-usage environments, as in some video games, the MTBF may be decreased by as much as a factor of two. 0.00 492 0.50 1.00 1.50 2.00 1128 1764 2400 3036 3672 4308 4944 5580 6216 6852 7488 8124 8760 Adjusted MTBF as Function of Expected POH per Year Expected POH per Year MTBFSpec.Multiplier 4 Temperature Next let’s look at the effects of elevated operating temperature. The same Arrhenius Model that we used to develop an acceleration factor may also be used to generate an MTBF temperature derating-factor (DF) curve. The following table shows the decrease in first-year MTBF (at 100% duty cycle) as ambient temperature increases above 25ºC. Temp (ºC) Acceleration Factor Derating Factor Adjusted MTBF 25 1.0000 1.00 232,140 26 1.0507 0.95 220,533 30 1.2763 0.78 181,069 34 1.5425 0.65 150,891 38 1.8552 0.54 125,356 42 2.2208 0.45 104,463 46 2.6465 0.38 88,123 50 3.1401 0.32 74,284 54 3.7103 0.27 62,678 58 4.3664 0.23 53,392 62 5.1186 0.20 46,428 66 5.9779 0.17 39,464 70 6.9562 0.14 32,500 From the table above, it is clear that as the ambient temperature rises, the derating factor and the adjusted MTBF become significantly smaller. For example, at 42ºC, we find the 2.2208 acceleration factor referred to previously in this analysis. Its reciprocal, 0.45, is the DF value, which indicates that the MTBF at 42ºC is less than half as long as the MTBF at 25ºC. Duty Cycle Most disc drives in PCs are operated at duty cycles of 20 percent to 30 percent. However, consumer electronics devices may have lower or higher duty cycles. Seagate has measured average daily data-transfer rates on existing consumer electronics devices and found duty cycles as low as 2.5 percent. To compare the effect of a 2.5 percent duty cycle with that of a 100 percent duty cycle (used in RTD testing), we can examine the effect of duty-cycle-dependent components in the drive relative to other components. The number of duty-cycle-dependent components in a hard disc drive is proportional to the number of discs in the drive. The relationship between disc count and AFR is shown in the following figure. In this graph, the area below the dotted line indicates the “base” or nonduty-cycle-dependent failure rate for a hypothetical drive with no discs (or a drive that is not reading, writing or seeking). The solid line indicates estimated failure rates as a function of the number of discs present. 0 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1 2 3 4 Effect of Disc Count on Total and Base AFR Disc Count (4 is max) NormalizedAFR Total AFR Base AFR 5 From the previous graph it is clear that reducing a drive’s duty cycle reduces only the duty-cycle-dependent failures (those between the dotted and solid line). Using the ratio between duty-cycle-dependent and total failures, we can estimate the effect of duty cycle on AFR. For example, consider a four-disc drive with a total AFR of 1.4 percent and a base AFR of 0.6 percent. Reducing the duty cycle would reduce the failures by the factor [(1.4 – 0.6)/1.4] = 57 percent. In accounting for reduced duty cycle on a four-disc drive, therefore, we can only reduce 57 percent of the failures; the remainder are treated as independent of duty cycle. The resulting MTBF multipliers for drives with different numbers of discs are shown in the following figure. Combining Multiple Factors To continue the analysis, we combine a range of duty cycles and temperature derating factors (DF) for several different drives. The figure on the left shows MTBF multipliers at a variety of duty cycles and temperatures for a high-capacity, 4-disc drive. The figure on the right shows the same multipliers as applied to a drive with only one disc. As shown in these figures, depending on the duty cycle and the ambient temperature of the drive in the customer’s PC, the first-year effective MTBF may be greater than, equal to, or less than the MTBF that we estimate based on in-house testing. For the one-disc drive, the effects of varying duty cycles are less significant and the MTBF multipliers tend to be significantly smaller. 1.00 100% 1.20 1.40 1.60 1.80 2.00 2.20 90% 80% 70% 60% 50% 40% 30% 20% 10% MTBF Multiplier vs Duty Cycle and Platter Count Duty Cycle MTBFMultiplier 1-disk Minimum Capacity MTBF Multiplier 2-disk MTBF Multiplier 3-disk MTBF Multiplier 4-disk Maximum Capacity MTBF Multiplier 0.00 0.50 1.00 1.50 2.00 2.50 26 30 34 38 42 46 50 54 58 62 66 70 Thermal Derating for a Range of Duty Cycles (for Maximum capacity, 4-disc drive) Ambient Temp ºC MTBFMultiplier(DF) DF @ 100% Duty Cycle DF @ 30% Duty Cycle DF @ 20% Duty Cycle DF @ 10% Duty Cycle DF @ 5% Duty Cycle DF @ 1% Duty Cycle 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 26 30 34 38 42 46 50 54 58 62 66 70 Thermal Derating for a Range of Duty Cycles (for Minimum capacity, 1-disc drive) Ambient Temp ºC MTBFMultiplier(DF) DF @ 100% Duty Cycle DF @ 30% Duty Cycle DF @ 20% Duty Cycle DF @ 10% Duty Cycle DF @ 5% Duty Cycle DF @ 1% Duty Cycle 6 Reliability after the First Year The Weibull distribution of time-to-failure, with a Beta less than one, is a distribution of decreasing failure probability over time. Because of this, MTBF values for a drive’s first year in the field are likely to be lower than for subsequent years. What would the failure rate or MTBF look like if averaged over the entire useful lifetime of the drive? Three possible methods for estimating reliability over a drive lifetime are listed below: • We could use the Weibull [Beta, Eta25] analysis to estimate failures after the first year. However, this would require extending the RDT test results up to an order of magnitude beyond the duration of the test. This would not be a very conservative practice. • We could use data from the Seagate warranty-return database, from which we may estimate the returns in the second and third years relative to the number of drives returned in the first year. This data is only applicable to the first three years, which is the limit of most current Seagate desktop-drive warranties, but it has the advantage of being based on only Seagate desktop products. • We could assume a model that would “flatline,” or maintain a constant failure rate after the end of the first year. In other words, we could assume that after the first year, all yearly failure rates would all be equal to the second-year failure rate. Since failure rates would, if anything, decline over time, this would be a conservative estimate of averaged MTBF for the life of the drive. These models are compared in the table below. Year Cumulative Yearly Cumulative Yearly Cumulative Yearly Cumulative power-on hours failure rate failure rate failure rate failure rate failure rate failure rate 1 2,400 1.20% 1.20% 1.20% 1.20% 1.20% 1.20% 2 4,800 0.55% 1.75% 0.78% 1.98% 0.55% 1.75% 3 7,200 0.43% 2.18% 0.39% 2.37% 0.55% 2.30% 4 9,600 0.37% 2.55% 0.55% 2.86% 5 12,000 0.33% 2.88% 0.55% 3.41% 6 14,400 0.30% 3.18% 0.55% 3.96% 7 16,800 0.28% 3.46% 0.55% 4.51% 8 19,200 0.26% 3.72% 0.55% 5.06% 9 21,600 0.24% 3.96% 0.55% 5.62% 10 24,000 0.23% 4.19% 0.55% 6.17% Weibull Warranty Data (OEM only) Flatline Model MODEL: 7 To further illustrate the differences between these models, let’s look at the cumulative percent failure rates for the three different models, each assuming a 200,000-hour first-year MTBF: As the graph above shows, the “flatline” model is less aggressive than the pure Weibull model, and comes close to the model based on Seagate warranty returns in the first three years. For simplicity, and to provide a conservative estimate, we have chosen to use the flatline model for our calculations. Using the flatline model, the results of lifetime-averaged MTBF compared to first-year MTBF may be summarized as follows: Average values for years 1 through 3: Failures per year 0.768% MTBF 312,500 Improvement over noncorrected MTBF (232,140 hours) 1.56 Average values for years 1 through 5: Failures per year 0.682% MTBF 352,113 Improvement over noncorrected MTBF (232,140 hours) 1.76 Average values for years 1 through 10: Failures per year 0.617% MTBF 389,105 Improvement over noncorrected MTBF (232,140 hours) 1.95 These calculations indicate that you multiply the first-year MTBF (at the appropriate duty cycle and ambient temperature) by 1.56 to estimate the averaged MTBF over a three-year drive lifetime. Similarly, to estimate the average MTBF over a drive lifetime of five or ten years, multiply the first year MTBF by 1.76 or 1.95, respectively. 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 1 2 3 4 5 6 7 8 9 10 Cumulative Yearly Failure Rate by Customer Year, Weibull and Flatline Models Compared to Warranty Returns Customer Year CumFailureRateper CustomerYear Weibull analysis “Flatline” model Model based on OEM drive warranty data Putting it All Together By combining the multipliers and derating factors described above, we can convert the Seagate-specified MTBF (first year, at 25ºC ambient temperature, 2,400 POH per year and 100 percent duty cycle) into an MTBF that applies to a drive in a customer’s device at an appropriate ambient temperature and duty cycle. We can then estimate the average MTBF over the drive’s lifetime. The following example demonstrates the calculation of first-year and drive-lifetime MTBF for a drive operated at 2,400 POH per year at an ambient operating temperature of 38ºC, a duty cycle of 30 percent and a fiveyear useful life. First-year MTBF: 232,140 hours (based on Weibull parameters: Beta, Eta25) ´ 0.90 (temp derating for 38ºC and 30% duty cycle) Customer first-year MTBF: 208,926 hours Customer MTBF: 208,926 hours ´ 1.76 (factor for averaging over five-year lifetime) Customer drive-lifetime MTBF: 367,710 hours As a final example, consider the case of a 1-disc Seagate drive with a specified first-year MTBF of 500,000 hours, which is being operated in a consumer electronics device for a usage rate of 2,920 POH per year (eight hours a day, seven days a week), an ambient temperature of 42ºC and a duty cycle of 5 percent. First-year MTBF: 500,000 hours (based on Weibull parameters: Beta, Eta25) ´ 1.09 (adjustment for 2,920 POH per year) ´ 0.59 (derating for temperature of 42ºC and 5% duty cycle) ´ 1.95 (factor for averaging over 10-year drive lifetime) Customer average MTBF: 627,023 hours Conclusion The method outlined above allows us to use Seagate laboratory test data to estimate the reliability of drives in desktop computers and consumer electronic devices in real-world settings. The method can be summarized as follows: • Use Weibull or historical RDT/FRDT test data to estimate Weibull parameters for drive tests. • Use WeiBayes analysis of test data for a specific type of drive to estimate first-year AFR and MTBF under RDT test conditions. • Correct for any differences from the assumed usage rate of 2,400 POH per hour. • Correct these values to take into account differences between RDT conditions and the real-life temperature and duty-cycles experienced by the drive after it reaches the customer. • Extend the first-year customer reliability estimates over a three- to ten-year drive lifetime, using the conservative assumption that failure rates remain constant after the drive’s first year in the field. In conclusion, this method provides a mathematically reasonable method for using Seagate test results to estimate drive reliability in consumer electronics. 8