On Dec 15, 11:51, Sellam Ismail wrote:
You can also get fairly robust PCs, with dual hot-swap
power supplies,
dual (or more) CPUs, mirrored hard drives (or RAID), error-correcting
RAM, etc.
Indeed. I can tell you a story about that...
A few years ago, our Department tendered for a Novell server, to act as
file and mail server for our student network. The tender was won by A
Well-Known PC Company with 4 letters in their name (how appropriate, as it
turned out) who supplied the hardware and on-site maintenance.
Well, the on-site maintenance came in the form of trained monkeys who were
moderately capabable of, say, swapping a disk drive, and not very good at
solving problems when the system went wrong -- as it did, rather
frequently. So frequently, in fact, that for a while the Department
insisted a rota for support staff "on call", something which as far as I
know, has never been done before. After many "discussions" with the
trained monkeys and their management, who insisted that our experienced and
qualified staff must not taper with "their" hardware, we stopped calling
them, and did some tests. Basically, the power supply couldn't cope with
the sum of the disks and RAM.
To solve the problem properly (after adding a second PSU and moving all the
disks to separate housings) it was decided that we'd buy a replacement
system, from another Computer Company of high repute (this one with six
letters in its name). The new, overkill, spec was for dual-processor, two
banks of ECC RAM, triple redundant hot-swappable power supplies,
hot-swappable RAID disks, two network interfaces, and a UPS. And just for
good measure, we wanted a pair of these, linked directly by a crossover
cable on the second network interfaces, and running some smart software
that allowed one to mirror the other. In theory, if the "live" server
failed, the other would adopt its IP address and take over.
In theory, theory and practice are the same. In practice, they are
different.
In practice, our network turns out not to like duplicate IP addresses, that
is, two devices with different MAC addresses but using the same IP address
-- and the second machine was not always perfectly silent. In practice,
the backup server was always a bit too enthusiastic. The live server would
see a glitch on the RAID disks and report it, and the backup would try to
take over. But the live one wouldn't let go, and they'd fight. Almost
daily, partly because the RAID system was perfectly capable of correcting
errors much of time, but its controller was perfectly capable of generating
them as well.
In the end, we found it better to switch one off. The live one fails only
occasionally, usually when doing an overnight backup. And we have a heavy
box to prop open the machine room door. Or run VMware from time to time.
Moral: there is such a thing as overkill, and such a thing as
over-engineering.
--
Pete Peter Turnbull
Network Manager
Dept. of Computer Science
University of York