On Mar 27, 2018, at 8:51 PM, Fred Cisin via cctalk
<cctalk at classiccmp.org> wrote:
Well outside my realm of expertise (as if I had a realm!), . . .
How many drives would you need, to be able to set up a RAID, or hot swappable RAUD
(Redundant Array of Unreliable Drives), that could give decent reliability with such
drives?
How many to be able to not have data loss if a second one dies before the first casualty
is replaced?
How many to be able to avoid data loss if a third one dies before the first two are
replaced?
These are straightforward questions of probability math, but it takes some time to get the
details right. For one thing, you need believable numbers for the underlying error
probabilities. And you have to analyze the cases carefully.
The basic assumption is that failures are "fail stop", i.e., a drive refuses to
deliver data. (In particular, it doesn't lie -- deliver wrong data. You can build
systems that deal with lying drives but RAID is not such a system.) The failure may be
the whole drive ("it's a door-stop") or individual blocks (hard read
errors).
In either case, RAID-1 and RAID-5 handle single faults. RAID-6 isn't a single
well-defined thing but as normally defined it is a system that handles double faults. So
a RAID-1 system with a double fault may fail to give you your data. (It may also be ok --
it depends on where the faults are.) RAID-5 ditto.
The tricky part is what happens when a drive breaks. Consider RAID-5 with a single dead
drive, and the others are 100% ok. Your data is still good. When the broken drive is
replaced, RAID rebuilds the bits that belong on that drive. Once that rebuild finishes,
you're once again fault tolerant. But a second failure prior to rebuild completion
means loss of data.
So one way to look at it: given the MTBF, calculate the probability of two drives failing
within N hours (where N is the time required to replace the failed drive and then rebuild
the data onto the new drive). But that is not the whole story.
The other part of the story is that drives have a non-zero probability of a hard read
error. So during rebuild, you may encounter a sector on one of the remaining drives that
can't be read. If so, that sector is lost.
The probability of hard read error varies with drive technology. And of course, the
larger the drive, the greater the probability (all else being equal) of having SOME sector
be unreadable. For drives small enough to have PATA interfaces, the probability of hard
read error is probably low enough that you can *usually* read the whole drive without
error. That translates to: RAID-1 and RAID-5 are generally adequate for PATA disks.
On the very large drives currently available, it's a different story, and the
published drive specs make this quite clear. This is why RAID-6 is much more popular now
than it was earlier. It isn't the probability of two nearly simultaneous drive
failures, but rather the probability of a hard sector read error while a drive has failed,
that argues for the use of RAID-6 in modern storage systems.
paul