Tony Duell <ard(a)p850ug1.demon.co.uk> wrote:
The PERQ 1a and later have a very interesting feature.
A register index
register (!). These machines have 256 processor registers on the CPU
card. A microinstruction selects which registers to use by a couple of
8-bit fields in the microcode word. But on the 1a and later, there's an
8-bit register on the CPU card that's ORed with the register address from
these microcode fields if the instruction references one of the first 64
registers.
The AMD 29000 family of RISC processors have something similar, but arguably
even better. It's also similar to register windows as used on processors
like the SPARC, but it's more flexible.
They also use 8-bit fields for register access. Only a few special registers
are in the range of 0..63. 64..127 are global registers. But registers
128..255 are local registers, and the low seven bits of one of the special
registers, the local register pointer, is added to the low seven bits of any
register address in that range.
The high 25 bits of the local register pointer are not used by the processor,
but they play a key role in the software.
The local registers are used as a cache of the top stack frames. Only
complete stack frames can reside in the local registers, which means that
the maximum size of a stack frame is 128 32-bit words.
When a non-leaf function is called, the function prolog tests whether there are
a sufficient number of local registers available for its frame. If not,
one or more of the oldest stack frames are spilled to memory to make room.
Then the local register pointer is updated so that the new stack frame starts
at local register 128.
Note that even stack frames in the local registers have a corresponding area
in memory associated with them, into which they might get spilled by another
function call. Thus the local register pointer always contains the full
32-bit address of the memory assigned to the current stack frame.
The really clever thing that they did in the 29000 was that the function
prolog was only about three instructions long. One of the instructions
was an assert, which would cause a trap if a spill was necessary. But if
no spill was necessary, the assert only takes a single CPU cycle.
Similarly there is a epilog on function exit containing an assert that
will trap to fill an old stack frame back from memory into the local registers
if necessary.
The down side of all this is that if you need to do a context switch, you
have to save a huge number of registers.
There was a clever hack to avoid needing to save too many registers in
interrupt handlers. Note that an interrupt can occur during a function prolog
or epilog, though not in the spill or fill traps. But because the function
prolog and epilog are manipulating various pointers used by the spill/fill
mechanism, when the interrupt occurs those registers may be in an inconsistent
state. Daniel Mann of AMD wrote an application note in which he described a
method for the interrupt handler to detect whether a prolog or epilog had been
interrupted, and deal with it accordingly. But unfortunately I was not able
to get it to work. It turns out that if you don't need the interrupt handler
to be able to cause a context switch (i.e., you're not writing a preemptive OS
kernel), you can ignore much of this complexity. The interrupt handler's
assembly language glue can simply save the local register pointer, and load it
with a new value that points to an interrupt stack, and makes it look like the
local registers contain a maximum-sized stack frame, which I'll call the
pseudo-frame. Note that the pseudo-frame may actually contain any number of
user frames, and that they are not necessarily aligned in any particular order
since the local register pointer has been changed.
Now when the interrupt happens, the interrupt handler's function prolog will
have to cause a spill in order to make room for its own stack frame. But
because of the way the stack has been swapped, the needed registers will
spill not to the task's stack, but the interrupt stack.
Note that a spill does not necessarily write an entire stack frame to memory,
but only enough registers to make room for the new frame. However, a fill
has to make sure that at least one complete frame is available.
When the interrupt handler returns, its epilog will fill the pseudo-frame
back into the local registers, then the assembly language glue will reload
the orginal values of the local register pointer.
Another amusing feature of the 29000 is that some of AMD's software used
global register 4. I was stunned when I noticed this, because the
documentation denies the existence of global register 4. My own attempts to
use global register 4 failed. It turns out that like most RISC processors
(and even some CISCs), there is a feed-forward data path so that if the
result of one instruction is used by the next instruction, that data is fed
back around directly from the ALU output back to its input. Otherwise a
pipeline stall would be necessary so that the second instruction wouldn't
read stale data from the register file.
Because of this feed-forward mechanism, it is possible to use any unimplmented
register as a single-cycle temporary register. But this is only safe when
interrupts are disabled, because if an interrupt were to happen after the
first instruction of the pair, the data for the unimplemented register can't
be saved.
Eric