Jules Richardson wrote:
Don Y wrote:
For 64 bit memory (assuming you want to treat it
as a 64 bit
entity) you need 7 bits (minimum) to "correct a single bit error".
Doesn't that mean that there's roughly a 10% chance that a memory error
is going to be in the ECC bits rather than the main operating memory -
and that a problem there could result in the system correcting an
imaginary fault?
No. The code words represent the only "legitimate" way of encoding
a particular "value" (data bits plus check bits). So, an error in
a check bit is likewise detected (and corrected -- assuming it
is a single error).
My explanation wasn't meant to be thorough. Rather, just a simple
aid to figuring out how many bits you need (at a minimum) for a
*single* bit correction.
If you *really* want me to dig out formal definitions for
all of these, I can -- but *you* can probably google an
explanation just as quickly! :>
Do systems generally do exhaustive tests periodically
on the ECC bits
only in order to minimise such problems? Or is the memory for the ECC
bits somehow made from more reliable (but more expensive) memory ICs?
No. To paraphrase a (bad) commercial: "bits is bits".
The memory controller checks each "entity" fetched from
the memory array. It dynamically determines what the
check bits *should* be for those data bits and compares
to the check bits read. If a discrepancy exists, the
controller figures out WHICH bit(s) need to be corrected
and makes the adjustment before presenting the corrected
*data* bits to the host/cpu.
On each write operation, the check bits corresponding to the
data being written are synthesized and stored in the memory
array alongside the data bits.
(how errors are reported/handled is immaterial to the controller)
Note that ECC is not infallible. It just increases the likelihood
of getting good data *if* the number of instantaneous failures
never exceeds the maximum number for which the code was designed.
E.g., typically, you can detect two bit errors and correct *one*
(whereas simple parity detects one and corrects zero). Note that
a system designed to "detect 2, correct 1" will often gladly
report *3* errors as "no errors" (just like flipping *two*
bits in a simple parity scheme results in NO parity error -- even
if one of those bits is the parity bit)