I spent 3 hours this morning attempting to figure out where to start
looking.
The problem shows up as a difference between files which are almost
certainly identical - except in one copy which I am confident is different
all of the time by the specified byte (caused at the time the copy was
made).
Rather than attempting to hide that this is current PC hardware, it is
probably best to list:
Hardware: Intel E8200 CPU (2.66 GHz), 2 * 2 GB memory in
ASUS P5B motherboard with 2 * Seagate 320 GB SATA II hard drives
Software: Windows XP with probably SP3
The problem shows up between 5% to 10% of the time with very
large files of greater than 1/2 GB to 2 GB in total size. The command:
COPY /B D:*.GHO G:*.GHx
was used to create 4 files each time with x = A,B,C,D,E
FC /B D:*.GHO G:*.GHx
is used to compare the files 4 at a time. It takes about 2 1/2 minutes to
compare the group of 4 files for a total of 5 GB being compared with
5 GB (which is why I said it must be current hardware). The result
often (between 5% and 10% of the time) shows a single difference
between the files at byte xxxxx35A (the xxxxx is random but the last
3 characters of the address are always the same) with the two hex
characters for each byte being different by 1 bit:
e.g. 22 vs 32 about 25% of the time with an extra bit 4
e.g. 94 vs 84 about 75% of the time with a missing bit 4
I probably performed between 75 and 150 FC commands
during which about 200 comparisons were made. I collected
about 15 cases when a difference was noted.
I can also perform an MD5 valuation of each file. So long
as the available memory is exceeded (relative to the size
of the previous files that were just tested), the MD5 valuation
takes about 40 seconds for a 2 GB file and the disk sense
light is on the whole time. If the same file is repeated for
the MD5 valuation, the disk sense light is always off and the
MD5 valuation takes about 9 seconds. The MD5 valuation
is probably incorrect between 5% and 10% of the time. Since
repeating the MD5 valuation on a different copy of the same file
provides a cross check, when what seems like an incorrect MD5
valuation appears, checking with a different copy will almost always
(19 times out of 20) show the correct MD5 valuation, then when
the MD5 valuation is redone on the same file, the probably
correct MD5 valuation appears, i.e. different from the just previous
MD5 valuation on the same file.
However, if the MD5 valuation is done again without flushing the
cached copy, then the same memory contents seem to be used
each subsequent time yielding the same (probably incorrect if
that was the situation) MD5 valuation from the last time that the
disk drive was actually read. My assumption is that as the (now
incorrect) byte was read into memory, one of the bytes of RAM
(possibly the same one each time) gets an extra bit set or misses
one of the bits being set - always that same bit 4 of course. Other
reasons may also have caused the byte in RAM to be incorrect.
However, once incorrect, it seems to stay the same until it is
modified again. So it does not seem to be a problem with
reading RAM.
So (FINALLY) this is my question:
If the cached copy (probably incorrect) of a hard drive file (which
is used to perform the MD5 valuation) stays
the same, what is the probability that it is a problem with RAM
memory as opposed to a hard drive or controller error? Since
the error suggests that it is a single bit causing the problem (the
same bit seems to be either on or off), my intuition would seem
to suggest that the RAM memory has a problem. However, since
my hardware experience is minimal, I am asking for suggestions
as to what can be done and the order they should be attempted:
(a) Replace the RAM memory
(b) Replace the disk drive(s)
(c) Replace the motherboard with the controller problem
(d) Something else I have not thought of
I know that Tony Duell would want to fix the component, but
I doubt that is even possible in this case.
Does anyone have any different suggestions?
Thank you for any help and please suggest any different tests
that might help to locate or better identify the cause of the problem.
Finally, just to make sure, does this problem constitute a serious
difficulty which should (must?) be fixed before any actual use is made
of this new system?
Sincerely yours,
Jerome Fine