Pipelining and Dec Jupiter thoughts....

Thu May 6 21:35:14 CDT 2021

> Sort of.  But while a lot of things happen in parallel, out of order, speculatively, etc., the programming model exposed by the hardware still is the C sequential model.  A whole lot of logic is needed to create that appearance, and in fact you can see that all the way back in the CDC 6600 "scoreboard" and "stunt box".  Some processors occasionally relax the software-visible order, which tends to cause bugs, create marketing issues, or both -- Alpha comes to mind as an example.

Interesting to see this.

I've been reading a lot recently about the Jupiter/Dolphin project and 
the more I read the more I understand why it just could not be done. At 
the time (and to an extent even now) the only way to really improve a 
system's performance was to pipeline the processor, and the Pdp10 
instruction set just wasn't easy to do that with.

They had a great concept: An Instruction fetch/decode system (IBOX), an 
execution engine (EBOX), the obligitory vector processor or FPU (HBOX) 
and of course the memory system (MBOX). Break the process up into steps 
and have the parts all work in parallel to boost performance.

Unfortunately they started to find way too many cases where an indirect 
instruction would be fetched that would be based on the AC, which was 
being changed by another instruction in the EBOX. This would blow out 
all the prefetched work in the pipe, forcing the IBOX to do a costly 
reload.

Likewise branch prediction couldn't be done well because most branches 
and skips depended on the value in the AC which was once again usually 
being modified in the EBOX down the pipe. As soon as it was modified the 
pipe had to be flushed and reloaded. It looks like they tried to put 
that logic into the IBOX to catch these issues, but that resulted in a 
flat processor that wasn't going to benefit from any parallelism, an 
endless series of bugs, and an IBOX that was pretty much running with 
its own EBOX.

It got worse when they realized that the Extended memory segments in the 
2060 architecture totally wrecked the concept of an instruction 
decoder/execution box. There were just too many places where an indirect 
instruction to another section which was then based on the AC's would 
result in Ibox tossing the queue and invalidating the translation 
buffers. Increasing the translation buffer helped (I think that's one of 
the things they did on the final 2065 to make it faster) but they 
couldn't make that big and fast enough. I guess an indirect jump 
instruction based on comparing the AC to an indirect address pointing to 
an extended segment would be enough to make any decoder just cry.

It's sad to read, you can almost see then realizing it was doomed. The 
Foonly F1 was a screamer, but it was basically the KA10 instruction set 
and couldn't run extended memory segments like the 2060. And when they 
tried to do the same thing with the F4 it came out to be a little slower 
than a 2060. I used to think they put only one extended segment in the 
2020 to cripple the box, but maybe they started running into the same 
problem and ran out of microcode space to try and address it.

Couple this with the fact that much of the 20 series programs were built 
in assembler (and why not, it was an amazing thing to program) and you 
just had too many programs with cool bespoke code that would totally 
trash a pipeline. Fixing compilers to order instructions properly could 
have worked, but people just wrote in assembler it wasn't going to 
happen and they weren't about to re-code their app to please the new 
scheduler God.

The VAX instruction set was a lot less beautiful, but could be pipelined 
easier especially with the dedicated MMU so they took the people and 
pipelined the hell out of the 780 resulting in the nifty 8600/8650 and 
later the 8800's. Dec learned their lesson when they built Alpha, and 
even Intel realized that their instruction set needed to be pipelined 
for the Pentium Pro and above processors.

Ah well. I don't think it was evil marketing or VAX monsters that killed 
the KC10, it was simply the fact that the amazing instruction set 
couldn't be pipelined to make it more efficient for hardware and the 
memory management system wasn't as efficient as the pdp11/Vax MMU concept.