Fred,
I appreciate the explanation. So with out a 1,000, 10,000, or even
100,000 drives there is no way to know how long my drives in the RAID
will last. All I know for sure is that I can lose anyone drive and the
RAID can be rebuilt.
GOD Bless and Thanks,
rich!
On 3/28/2018 4:43 PM, Fred Cisin via cctalk wrote:
On Wed, 28 Mar 2018, Richard Pope via cctalk wrote:
I have been kind of following this thread. I
have a question about
MTBF. I have four HGST UltraStar Enterprise 2TB drives setup in a
Hardware RAID 10 configuration. If the the MTBF is 100,000 Hrs for
each drive does this mean that the total MTBF is 25,000 Hrs?
<pedantic sadistics>
Probably NOT.
It depends extremely heavily on the shape of the curve of failure times.
MEAN Time Before Failure, of course, means that for a large enough
sample, half the drives fail before 100,000 hours, and half after.
Thus, at 100,000 hours, half are dead.
But, how evenly distributed are the failures?
Besides the MTBF, it would help to know the variance or standard
deviation.
It is unlikely that the failures follow a "normal distribution" (or
"Laplace-Gauss") bell curve. And, other distributions are certainly
not ABnormal :-)
If the curve is symmetrical, then the mean, median, and mode will all
be the same. If it is not symmetrical, then they won't be. Hence the
use of MEDIAN - at that point half are dead, half are still alive.
In toxicology, there is a concept of an LD-50 dosage - the dosage that
will kill half, since for example, antibiotic resistant bacteria might
require an incredibly large dosage to get that last one, but LD-50
provides a convenient way to get a single number.
100,000 hours is the LD-50 of those drives.
If it turns out that the drives last 100,000 hours, plus or minus 10%,
then you have a curve with a very steep slope. It is still half dead
at 100,000, but maybe hardly any dead until 90,000, hardly any left
alive at 110,000.
OTOH, if the failures were evenly distributed throughout a life of 0
to 200,000 hours, with the same number going every day, then that also
would have a MTBF of 100,000. In THAT case, then yes, the MTBF of
first failure may well be 25,000.
They rarely work that way. Often our devices will have what is
sometimes called a "bathtub curve". There are a few failures
IMMEDIATELY ("infant mortality") falling off rapidly, and then few
failures for quite a while, and then, as random parts start to wear
out, the failures rise. In fact, with the same MTBF of 100,000, it
could be that once the early demise ones are discarded, that the MTBF
of the REMAINDER might be 200,000.
IFF you are willing to deal with the DOA and infant mortality cases,
then by discarding or ignoring those outlying numbers, you might get a
more realistic evaluation of what to expect.
</pedantic sadistics>
--
Grumpy Ol' Fred cisin at
xenosoft.com