----- Original Message -----
From: "Pete Turnbull" <pete(a)dunnington.u-net.com>
To: <classiccmp(a)classiccmp.org>
Sent: Friday, December 15, 2000 8:27 PM
Subject: Re: The debate on what per say is a mini...
To solve the problem properly (after adding a second
PSU
and moving all the
disks to separate housings) it was decided that
we'd buy a
replacement
system, from another Computer Company of high repute
(this
one with six
letters in its name). The new, overkill, spec was for
dual-processor, two
banks of ECC RAM, triple redundant hot-swappable power
supplies,
hot-swappable RAID disks, two network interfaces, and
a
UPS. And just for
good measure, we wanted a pair of these, linked
directly
by a crossover
cable on the second network interfaces, and running
some
smart software
that allowed one to mirror the other. In theory, if
the
"live" server
failed, the other would adopt its IP address and take
over.
In theory, theory and practice are the same. In practice,
they are
different.
In practice, our network turns out not to like duplicate
IP addresses, that
is, two devices with different MAC addresses but using
the
same IP address
-- and the second machine was not always perfectly
silent.
In practice,
the backup server was always a bit too enthusiastic.
The
live server would
see a glitch on the RAID disks and report it, and the
backup would try to
take over. But the live one wouldn't let go, and
they'd
fight. Almost
daily, partly because the RAID system was perfectly
capable of correcting
errors much of time, but its controller was perfectly
capable of generating
them as well.
In the end, we found it better to switch one off. The
live one fails only
occasionally, usually when doing an overnight backup.
And
we have a heavy
box to prop open the machine room door. Or run
VMware
from time to time.
Moral: there is such a thing as overkill, and such a thing
as
over-engineering.
Saw a presentation on this last week from Red Hat. Their
piranha and STONITH protocol deal with this. We set up a
demonstration with 3 bad-end web servers and 2-front end
machines. We then alternately unplugged various machines
and all the while kept serving the web pages.
The second machine when it detected trouble with the primary
would power off the first machine (Shoot The Other Node In
The Head). It was quit impressive but had me thinking of
VMS and MVS and wishing everyone would talk to each other
instead of reinventing the wheel.