On Jun 13, 2024, at 6:22 PM, Jonathan Stone via cctalk
<cctalk(a)classiccmp.org> wrote:
On Thursday, June 13, 2024 at 03:00:22 PM PDT, Maciej W. Rozycki via cctalk
<cctalk(a)classiccmp.org> wrote:
The architecture designers cheated however even
in the original ISA in
that moves from the MD accumulator did interlock. I guess they figured
people (either doing it by hand or by writing a compiler) wouldn't get
that right anyway. ;)
I always assumed that was because the latency of multiply, let alone divide, was far too
many cycles for anyone to plausibly schedule "useful" instructions into.
Wasn't r4000 divide latency over 60 cycles? Wasn't r4000 divide latency more than
60 cycles?
Probably, because divide is inherently an interative operation, and usually is implemented
to produce one bit of result per cycle. A notable exception is the CDC 6600, which throws
a whole lot of logic at the problem to produce two bits of result per cycle. The usual
divide amounts to a trial subtraction and shift; the 6600 implementation does THREE trial
subtractions concurrently. Not cheap when you're using discrete transistor logic.
Multiply is an entirely different matter, that can be done in few cycles if you throw
enough logic at the problem. Signal processors are an extreme example of this because
multiply/add sequences are the essence of what they need to do. This is also why Alpha
omitted divide entirely and made programs do multiply by the reciprocal instead.
The best argument for doing interlocking in the hardware isn't that it's hard for
software to get right. Code generators can do it and that's a one time effort. But
the required are often dependent on variables that are not known at compile time, for
example load/store delays, or branches taken/not taken. Run time interlocks deal with the
actual conflicts as they occur, while compiler or programmer conflict avoidance has to use
the worst case scenarios.
paul