RAID? Was: PATA hard disks, anyone?

Fred Cisin cisin at xenosoft.com
Wed Mar 28 18:43:43 CDT 2018


On Wed, 28 Mar 2018, Richard Pope via cctalk wrote:
>    I have been kind of following this thread. I have a question about MTBF. 
> I have four HGST UltraStar Enterprise 2TB drives setup in a Hardware RAID 10 
> configuration. If the the MTBF is 100,000 Hrs for each drive does this mean 
> that the total MTBF is 25,000 Hrs?

<pedantic sadistics>
Probably NOT.
It depends extremely heavily on the shape of the curve of failure times.
MEAN Time Before Failure, of course, means that for a large enough sample, 
half the drives fail before 100,000 hours, and half after.  Thus, at 
100,000 hours, half are dead.

But, how evenly distributed are the failures?
Besides the MTBF, it would help to know the variance or standard 
deviation.
It is unlikely that the failures follow a "normal distribution" 
(or "Laplace-Gauss") bell curve.  And, other distributions are 
certainly not ABnormal :-)

If the curve is symmetrical, then the mean, median, and mode will all be 
the same.  If it is not symmetrical, then they won't be.  Hence the use of 
MEDIAN - at that point half are dead, half are still alive.
In toxicology, there is a concept of an LD-50 dosage - the dosage that 
will kill half, since for example, antibiotic resistant bacteria might 
require an incredibly large dosage to get that last one, but LD-50 
provides a convenient way to get a single number.
100,000 hours is the LD-50 of those drives.


If it turns out that the drives last 100,000 hours, plus or minus 10%, 
then you have a curve with a very steep slope.  It is still half dead at 
100,000, but maybe hardly any dead until 90,000, hardly any left alive at 
110,000.

OTOH, if the failures were evenly distributed throughout a life of 0 to 
200,000 hours, with the same number going every day, then that also would 
have a MTBF of 100,000.   In THAT case, then yes, the MTBF of first 
failure may well be 25,000.


They rarely work that way.  Often our devices will have what is sometimes 
called a "bathtub curve".  There are a few failures IMMEDIATELY ("infant 
mortality") falling off rapidly, and then few failures for quite a while, 
and then, as random parts start to wear out, the failures rise. 
In fact, with the same MTBF of 100,000, it could be that once the early 
demise ones are discarded, that the MTBF of the REMAINDER might be 
200,000.

IFF you are willing to deal with the DOA and infant mortality cases, then 
by discarding or ignoring those outlying numbers, you might get a more 
realistic evaluation of what to expect.
</pedantic sadistics>

--
Grumpy Ol' Fred     		cisin at xenosoft.com


More information about the cctalk mailing list