PDP-11/45 RSTS/E boot problem

Thu Feb 7 11:47:32 CST 2019

    > From: Fritz Mueller

    > is it possible for you deduce where Unix _should_ be placing these "bad"
    > bits (from file offset octal 4220)?

Yes, it's quite simple: just add the virtual address in the code to the
physical address of the bottom of the text segment (given in UISA0). The VA
is actually 04200, though: the 04220 includes 020 to hold the a.out
header at the start of the command file.

So, with UISA0 containing 01614, that gives us PA:161400 + 04200 = PA:165600,
I think. And it wound up at PA:171600 - off by 04000 (higher) - which is
obviously an interesting number.

    > Maybe a comparison of addresses where the bits should be, with
    > addresses where the "bad" copy ends up, could point us at some particular
    > failure modes to check in the KT11, CPU, or RK11...

Here's where it gets 'interesting'.

Executing a command with pure text on V6 is a very complicated process. The
shells fork()s a copy of itself, and does an exec() system call to overlay
the entire memory in the new process with a copy of the command (which sounds
fairly simple, at a high level) - but the code path to do the exec() with a
pure text is incredibly hairy, in detail. In particular, for a variety of
reasons, the memory of the process can get swapped in and out several times
during that. I apparently used to understand how this all worked, see this
message:

  https://minnie.tuhs.org/pipermail/tuhs/2018-February/014299.html

but it's so complicated it's going to take a while to really comprehend
it again. (The little grey cells are aging too, sigh...)

The interesting point is that when V6 first copies the text in from the file
holding the command (using readi(), Lions 6221 for anyone who's masochistic
enough to try and actually follow this :-), it reads it in starting from the
bottom, one disk block at a time (since in V6, files are not stored
contiguously).

So, if it starts from the bottom, and copies the wrong thing from low in the
file _up_ to VA:010200, when it later gets to VA:010200 in the file contents,
that _should_ over-write the stuff that got put there in the wrong place
_earlier_. Unless there's _another_ problem which causes that later write
to _also_ go somewhere wrong...

So, I'm not sure when this trashage is happening, but because of the above,
my _guess_ is that it's in one of the two swap operations on the text (out,
and then back in). (Although it might be interesting to look at PA:165600 and
see what's actually _there_.) Unix does swapping of pure texts in a single,
multi-block transfer (although not always as an integral number of blocks, as
we found out the hard way with the QSIC :-).

So my suspicions have now switched back to the RK11... One way to proceed
would be to stop the system after the pure text is first read in (say around
Lions 4465), and look to see what the text looks like in main memory at
_that_ point. (This will require looking at KT11 registers to see where it's
holding the text segment, first.)

If that all looks good, we'll have to figure out how to stop the system
after the pure text is read back in (which does not happen in exec(),
it's done by the normal system operation to swap in the text and data
of a process which is ready to run).

We could also stop the system after the text is swapped out, and key in
a short (~ a dozen words) program to read the text back in from the
swap device, and examine it - although we'd have to grub around in the
system a bit to figure out where it got written to. (It might be just
easier to stop it at, say, Lions 5196 and look at the arguments on the
kernel stack.)

    > a suggestion here to check the KT11 address translation adders ... A
    > bug in one of the carry lookahead generators used between the bit
    > slices of that adder could cause a mistranslation on only a fairly
    > selective subset of virtual addresses

This could be happening, but from the reasoning above about the order that
the blocks of the text are read in, something would have to interfere with
the later read of the higher memory blocks, too, no? So I'd discount the KT11
_for the moment_.

    > *IF* that's the case and we can chase the IR trace upstream to the
    > place of an unlucky mistranslation, it will be pretty easy to track
    > down then in the hw and fix.

It'll be interesting to look at the text after it's read in (i.e. before it's
swapped out). If it's OK there, that's pretty conclusive that it _can't_ be
the KT11 - because from then on, the kernel doesn't _do anything_ to that
binary, except swap it out and in with the RK11. And since those are both
single I/O operations (with swapping on the RK11, at least, which can do
multi-block transfers), _and_ the bottom of the text segment comes in OK (so
the RK11 is being set up with correct disk and main memory addresses for both
the out and in), I can't think of a fault _elsewhere_ in the system that
could cause that 'stuff winds up in the wrong place' error.

I know this is complicated, but look at the bright side: we started with
three apparently un-connected problems:

* R5 trashage
* an 'impossible' MM fault
* bad text data

The first one turned out to be non-existent (my fault in interpreting the
kernel stack in the process core dump), the second was also not really there
(although a hardware fault in the console gave us bad data, so there really
was a hardware issue there), and now we're down to one - albeit a tricky one.

So we were dealing with two un-related hardware problems - now we're down to
one, and hopefully soon will have it isolated to a single sub-system!

(And thanks to whoever gave us the voltage tip, that fixed the first one.)

	Noel