I know there's been discussion in the not too distant past about troubleshooting a
Xerox Alto, but I just want to add a couple of words on the subject.
Ours went dead suddenly: it had been left running with the monitor turned off and, when I
turned the monitor on, the cursor was on the screen in the upper left corner and was
unresponsive. I rebooted with the button on the back of the keyboard and the screen went
blank white. There didn't appear to be any response (seek behavior) from the disk
(the Diablo 31 shakes things pretty well as it seeks). I shut down the machine and began
to strategize.
First I wanted to be sure the basics were OK. I used a scope and DVM to check all the
power supplies, both for voltage level and noise: good. I scoped the clocks: good. Damn,
nothing easy.
I checked with the man who had sold us the machine (and who had originally restored it) to
see if he had any advice on where to start. He wished me luck. :-) Since the Alto was a
research platform, neither the documentation nor the engineering drawings were created
with the idea of troubleshooting. I suspect the original Xerox repair strategy was,
"Ask so-and-so at PARC."
I took a hard look at the disk subsystem, since according to the documentation
(AltoSubsystems.pdf) the first thing the firmware does on reboot is load a sector from the
disk. I scoped signals that go into and came out from the disk and saw that a key enable
line from the Alto, *RDGATE, was not asserting. The documentation for the Diablo 31 was
helpful here.
There were two possible reasons: it was not being 'told' to assert by the
microcode, or there was a logic flaw along the combinatorial path. Recalling that the
cursor was still being generated (a microcode task) when I turned off the machine, I
initially figured the processor was OK. I traced through the disk controller's logic
with the scope but found no smoking gun (or chip). So I reversed myself on the processor
- time to get out the logic analyzer.
Now don't get me wrong, I love the logic analyzer (HP 1630G). It's a powerful
tool, and it's geek fun to play with one. But it's also a pain to hook up all
those little leads, especially if, like me, you have "mature eyes." Further,
the Alto doesn't make it all that easy: there are no extension boards available for
its 122-pin backplane. (I plan to make some.) Fortunately, many key signals are present
on the backplane, and the disk interface card is in the bottom slot of the card cage with
several empty slots above it. It was tight in there, but I managed to get DIP clips on
ICs and hook up analyzer lines.
First I looked at the processor bus. In AltoSubsystems.pdf, it's documented that the
microcode looks at the word coming from the keyboard to determine which disk sector to
load (or whether to net-boot). At first, it looked like I had a stuck bit on the
processor bus. Examining the drawings, I learned that the processor bus is a pretty
conventional wire-OR design with termination on one end at the Memory Extension And
Termination (MEAT) board, and on the other end on... the disk interface board! I verified
the seating of pull-up and pull-down resistor packs in their sockets, tested some more -
and ultimately figured out that one bit was bad on the analyzer pod! <sigh> Using
a different pod, I was now seeing activity on all 16 lines (remember, bit 0 is MSB). I
was able to capture the assertion of the keyboard word on the bus as well as the changing
response with various key presses, giving me confidence that the microcode engine was
running and sane. But *RDGATE still wasn't asserting. Damn, back into the disk board
logic.
From there it was a simple but tedious matter of
tracing back from the disk interface toward where the microcode was consumed to drive the
assertion of the control line. With the logic analyzer, I traced the failure to A41, a
74H11 AND gate that wasn't AND-ing (output stuck low). Problem solved.
So I think the takeaways are that it's important to know the processor is working, and
looking for the appearance of the keyboard word address (177034 octal) on boot is one way
to know that the firmware is being run. (The blank white screen is meaningless - the
display controller H/V oscillators free-run until they're brought under control by the
microcode.)
Also, nearly every signal line is a complex combination of both active logic driven by the
microcode and "control" logic representing various inputs from hardware elements
that run concurrently with but separately from the microcode engine. IMHO this makes it
hard to find even moderately involved failures with a scope alone. The engineering
drawings (AltoIIMaintSchem_1978.pdf, for our machine) document the combinatorial logic
pretty well, but the alphabet soup of the signal names may take a bit of puzzling to
decipher sometimes.
One critical set of signal lines is from the task priority encoder: unlike a modern
system, the tasks are cooperative and are represented by signal lines that enable parts of
the control logic. A particular value on e.g. the processor bus will "mean"
something completely different depending on which task's signal line is asserted. The
hardware manual (Alto_Hardware_Manual_May79.pdf) does contain a lot of information about
this, but you need to read both the "Microprocessor" section and the
"Control RAM, ROM, and S Registers" section to get the whole picture.
The documents I've cited are all available on (and were retrieved from) BitSavers -
thanks, Al!
I hope my experience is helpful for someone out there.... -- Ian
UNIX is user friendly. It's just selective about who its friends are.
Ian S. King, Sr. Vintage Systems Engineer
Living Computer Museum
A project of Vulcan, Inc.
http://www.livingcomputermuseum.org