[cctalk] Re: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

15 Jun 2024

On 6/15/24 12:39, Paul Koning wrote:
...

 I learned it from OS code reading and adopted some of it for my own work, but not much
because I actually only worked on the 6500 -- which doesn't have multiple functional
units.
 Writing good code for those machines was further complicated by the fact that
instructions were either 1/4 or 1/2 word long, could not split across word boundaries, and
branches would only go to the start of the word.  So there tended to be NOPs to pad out
the word, which the assembler would supply.  Avoiding them would make the code go faster
and of course make it smaller.
 The other complication was a fairly limited set of registers, and the fact that loads
would go only to X1..X5 while stores could only come from X6 or X7.  So a memcpy would
involve a register to register transfer.  That takes 3 cycles on a 6600, so a skillful
memcpy implementation would use two load registers, both store registers, and two separate
functional units for the R-R move (one via the "boolean" unit and one via the
"shift" unit).  I remember my bafflement the first time I saw a shift (by zero)
used to do just a register to register move; on a 6500 you wouldn't have any reason to
write that.
 I once crashed the PLATO system in mid-day, when the load hit peak (600 users logged on)
because I had slowed down a critical terminal output processing step and the machinery
didn't have flow control there.  My bosses were NOT happy.  I solved the issue by
cleaning up that block of code to avoid all NOPs; the result was that it was both shorter
and faster than the previous version while still delivering the new feature.  :-) 
At CDC SSD SVLOPS, it was all big gummint stuff stuff, so we had
clusters of Cyber 74s and 73s (6600/6400) linked with a few million
words of ECS (we had a QSE that expanded it to 4M words).  6600/Cyber 74
programming was the rule.  A short loop was considered to be optimal, if
it kept the instruction issue to 1/cycle and kept the whole thing "in
stack" (basically an 8-word buffer, not really a cache) to avoid
accessing CM for instructions.   Lots of bit-twiddling fun!
The 6600 had an interesting feature we called "shortstop" where the
result of an operation was available for use by a subsequent instruction
1 cycle before it materialized in a register
On early 6600s, there was a so-called "store out of order" problem where
two closely-timed stores to the same location would result in the
earlier result overwriting the later ones.  An ECO fixed that--it was
pretty fundamental.
STAR initially mapped the user's low-memory to the 256-word register
file, such that one could have vectors occupying several registers
addressed by memory location, while referring to the registers by
register number.  That apparently resulted in some serious issues,
solved eventually by simply locking out access to the first 16Kbits
(recall that the STAR is bit-addressed) of memory.  The so-called "Rev
R" ECO, if my mind isn't playing tricks on me.
CDC had a pretty close relationship with Fairchild during this time;
initially for the silicon transistors in the 6600 and later the register
file for the STAR.
Fun times!
--Chuck

2025

2024

2023

2022

[cctalk] Re: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]