In early 1971, I started working as an on-site software tech on a rather large (for the
time), dual-processor Burroughs B6500. It was early days for that system, and we had a lot
of problems with it in the field. We got those ironed out pretty well and eventually got
the system up to an acceptable level of operation.
About a year later, a major upgrade to the MCP (the OS) was released. A major component of
that release was a completely rewritten I/O subsystem with much higher reliability and
much, much better performance, especially in "Logical I/O," the interface
between user programs and the memory buffers and physical I/O mechanism. Soon after
installing this release, we started to get fatal system crashes in Logical I/O.
Describing what was happening requires a little background on the system architecture. The
B6500, like its predecessor the B5500, is known as an ALGOL-oriented stack machine, but it
is less well-known as a type of capability system. To support that, it used a segmented
memory model and tagged memory. Each word had an extra three bits, not accessible to user
programs, that identified the type of data in a word. For example, tag 0 was ordinary
data, tag 2 indicated a double-precision word, tag 5 was a data descriptor through which
data segments were addressed, and tag 7 was a Program Control Word (PCW) that effectively
addressed a location in an object-code segment. It was used primarily as a procedure
(subroutine) entry-point address.
In reading the dumps from the crashes and the MCP source code, we started to learn how the
new Logical I/O mechanism worked. It cleverly used the stack addressing of the system to
implement a very object-oriented interface. The "methods" of this interface were
small procedures that were customized to handle very specific cases of record handling --
random vs. sequential I/O, blocked vs. unblocked I/O, translation or no translation, etc.
There were scores of these methods. The idea was do a little as possible for each user
request and to avoid making as many decisions as possible and to optimize the buffer
handling in each case. There were about a half-dozen different types of user requests, and
the methods for those were accessed through a branch table in the FIB (control block) for
each open file. That branch table contained PCWs for the appropriate methods needed by
that file. The table was set up during file open, but could be changed as the nature of
the user program's requests changed, e.g., from sequential to random access.
We discovered that the crash was caused by some PCWs in the file-level branch tables
having tags of 5 instead of tags of 7. Attempting to call a procedure using a tag-5 word
was a no-no that was trapped by the hardware, hence the fatal dump. Then we discovered
that the branch tables were loaded from a master array of PCWs for all of the possible
methods, and when we looked at that array in the dump, ALL of the PCWs in the array had
tags of 5! We know that array initially had to have had words with tags of 7, because the
system had run for quite a while before crashing, so how could all of the words in the
array suddenly have changed to tags of 5? There wasn't any straightforward way to do
that in software.
That master array was loaded from from the OS image on disk at the initial boot, but then
we found out the array was overlayable -- it could be paged out and back in by the virtual
memory mechanism. So we began to suspect there was an issue with I/O. Normally, a disk
read stored words in memory with a constant tag of 0, but there was a special I/O mode,
termed "tag transfer," that would read and write the tag bits along with the
regular data bits.
Fortunately, the other tech on the site had worked in the MCP group for a while and knew
the I/O hardware pretty well, so he started writing some standalone programs to exercise
the hardware in specific ways. This system had two I/O Multiplexors (multi-channel DMA
units) numbered 0 and 1, and the disk drives were dual-ported so that either Mux could
address any of them. My colleague's programs tried doing I/Os with various
combinations of Mux and channel assignments. And as you might expect by now, he discovered
that doing a tag transfer read through Mux 1 always dropped the middle tag bit.
We turned that finding over to the on-site Field Engineers, who pulled out their
schematics and started tracing signals. It took several hours, but eventually they
discovered not a loose wire, but that Mux 1 was completely missing a wire. Of course, it
was the wire that carried the middle bit of the tag during tag transfer.
We finally deduced that the problem had been present since the system left the factory
floor. The original I/O software had been so poor that the system was seldom (if ever)
able to initiate more than one I/O to disk at a time, but the new version we had recently
installed was really good at initiating multiple I/Os. Mux 1 had a lower selection
priority than Mux 0, so under the old software it was seldom selected, and perhaps never
so for tag transfer I/Os, which are relatively rare. The new software allowed the system
to get busy enough that Mux 1 started to be used a lot more often, and eventually it got
busy enough that a paging I/O for that master PCW array got scheduled to Mux 1, and the
system just didn't survive for very long after that.