Yeah, I re-read that later (see my later mails).
The whole set up sounds like a huge lesson in project mis-management,
even leaving out the OS decision. (NT was not really ready for
mission-critical tasks. My UNIX experience at the time (1998?) was with
Linux, and I don't know I'd trust it to the task either. Maybe
Solaris? No idea. What was the recommended high-availability UNIX OS
at the time?)
But back to the Yorktown: So you're the designer, and you're telling me
that:
1) You're putting commodity PC hardware (regardless of the OS) in the
_sole_ position of controlling important functions of the ship, without
ANY sort of backup?
1.5) Really?
2) You're putting said system in place in a very ad-hoc manner -- quote
from wired article: "They rushed this stuff on the
ship, there was no
real prototype, and then they tried to make things work as they
went
along," (And it does sound very ad-hoc -- the database does no data
validation? ("a crew member entered a zero into a /database/ field
causing a divide by zero...") The client app doesn't bother to check the
data either? There's no backup system? There's _no_ backup system of
ANY kind?)
Yeah, that sounds like a recipe for success no matter how you slice it :).
I admit, my experience/bias with Windows NT 4.0 makes me want to know
whether the divide by zero / buffer overflow was a _complete_ system
crash (i.e. a BSOD) which is what everyone who hates Windows assumes it
was, or whether it was more along the lines of : 1) operator puts
invalid data in the database, 2) all client machines pick up and use new
invalid data without validations, 3) all client ship-controlling apps
crash due to bad data, bringing down the ship system.
I say this, because as many problems as NT had, it wasn't _all_ that
easy to write a client app that would BSOD it :). (Don't get me started
about a couple of in-box hardware drivers, though...) Oh, there was
that infamous CSRSS.exe bug, but you'd have to WANT to trigger that one :).
There's a lot of bias in the articles I read on the Yorktown fiasco,
generally anti-Windows (see: the wired article I'm quoting above) I've
never seen any _real_ information on the crash other than generic
clauses like "the system(s) crashed" which could mean anything between
"an outright BSOD of every system on the network" and "a poorly
written/specified/designed app/distributed system going down". If there
is a real post-mortem analysis of this that's been publicly released,
I'd be interested in reading it.
Just my (at this point, way more than) two cents.
And I keep promising myself I won't get involved in these kinds of
off-topic discussions. I'm a bad boy. I'm done now :).
- Josh
Sridhar Ayengar wrote:
Josh Dersch wrote:
Apparently. Go look up the news archives for the Yorktown story.
Read the facts before you make assumptions and accuse people of not
knowing what they're talking about just because you don't like what
they're saying.
I've never seen an explanation of what the failure actually was, just
lots of articles stating that Windows NT was being used as the OS and
-something- went wrong. It could just as easily have been buggy
client software that crashed.
The explanation at the time was that one of the operators put a zero
into a database field he shouldn't have, which caused a divide-by-zero
problem which led to a buffer overrun which cascaded to all of the
workstations on the network.
Peace... Sridhar