Pipelining and Dec Jupiter thoughts....

Fri May 7 04:55:24 CDT 2021

http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/

Has a *lot* of stuff. As I start cranking up my brain to drag BLT out of 
the shed and start working on it I'm finding this stuff to be a serious 
refresher.

C

On 5/7/2021 2:06 AM, Lee Courtney wrote:
> Chris - great and interesting overview. Do you have a reading list for 
> more details? Thanks!
>
> Lee Courtney
>
> On Thu, May 6, 2021 at 7:35 PM Chris Zach via cctalk 
> <cctalk at classiccmp.org <mailto:cctalk at classiccmp.org>> wrote:
>
>
>     > Sort of.  But while a lot of things happen in parallel, out of
>     order, speculatively, etc., the programming model exposed by the
>     hardware still is the C sequential model.  A whole lot of logic is
>     needed to create that appearance, and in fact you can see that all
>     the way back in the CDC 6600 "scoreboard" and "stunt box".  Some
>     processors occasionally relax the software-visible order, which
>     tends to cause bugs, create marketing issues, or both -- Alpha
>     comes to mind as an example.
>
>     Interesting to see this.
>
>     I've been reading a lot recently about the Jupiter/Dolphin project
>     and
>     the more I read the more I understand why it just could not be
>     done. At
>     the time (and to an extent even now) the only way to really improve a
>     system's performance was to pipeline the processor, and the Pdp10
>     instruction set just wasn't easy to do that with.
>
>     They had a great concept: An Instruction fetch/decode system
>     (IBOX), an
>     execution engine (EBOX), the obligitory vector processor or FPU
>     (HBOX)
>     and of course the memory system (MBOX). Break the process up into
>     steps
>     and have the parts all work in parallel to boost performance.
>
>     Unfortunately they started to find way too many cases where an
>     indirect
>     instruction would be fetched that would be based on the AC, which was
>     being changed by another instruction in the EBOX. This would blow out
>     all the prefetched work in the pipe, forcing the IBOX to do a costly
>     reload.
>
>     Likewise branch prediction couldn't be done well because most
>     branches
>     and skips depended on the value in the AC which was once again
>     usually
>     being modified in the EBOX down the pipe. As soon as it was
>     modified the
>     pipe had to be flushed and reloaded. It looks like they tried to put
>     that logic into the IBOX to catch these issues, but that resulted
>     in a
>     flat processor that wasn't going to benefit from any parallelism, an
>     endless series of bugs, and an IBOX that was pretty much running with
>     its own EBOX.
>
>     It got worse when they realized that the Extended memory segments
>     in the
>     2060 architecture totally wrecked the concept of an instruction
>     decoder/execution box. There were just too many places where an
>     indirect
>     instruction to another section which was then based on the AC's would
>     result in Ibox tossing the queue and invalidating the translation
>     buffers. Increasing the translation buffer helped (I think that's
>     one of
>     the things they did on the final 2065 to make it faster) but they
>     couldn't make that big and fast enough. I guess an indirect jump
>     instruction based on comparing the AC to an indirect address
>     pointing to
>     an extended segment would be enough to make any decoder just cry.
>
>     It's sad to read, you can almost see then realizing it was doomed.
>     The
>     Foonly F1 was a screamer, but it was basically the KA10
>     instruction set
>     and couldn't run extended memory segments like the 2060. And when
>     they
>     tried to do the same thing with the F4 it came out to be a little
>     slower
>     than a 2060. I used to think they put only one extended segment in
>     the
>     2020 to cripple the box, but maybe they started running into the same
>     problem and ran out of microcode space to try and address it.
>
>     Couple this with the fact that much of the 20 series programs were
>     built
>     in assembler (and why not, it was an amazing thing to program) and
>     you
>     just had too many programs with cool bespoke code that would totally
>     trash a pipeline. Fixing compilers to order instructions properly
>     could
>     have worked, but people just wrote in assembler it wasn't going to
>     happen and they weren't about to re-code their app to please the new
>     scheduler God.
>
>     The VAX instruction set was a lot less beautiful, but could be
>     pipelined
>     easier especially with the dedicated MMU so they took the people and
>     pipelined the hell out of the 780 resulting in the nifty 8600/8650
>     and
>     later the 8800's. Dec learned their lesson when they built Alpha, and
>     even Intel realized that their instruction set needed to be pipelined
>     for the Pentium Pro and above processors.
>
>     Ah well. I don't think it was evil marketing or VAX monsters that
>     killed
>     the KC10, it was simply the fact that the amazing instruction set
>     couldn't be pipelined to make it more efficient for hardware and the
>     memory management system wasn't as efficient as the pdp11/Vax MMU
>     concept.
>
>
>
> -- 
> Lee Courtney
> +1-650-704-3934 cell