Intel 8086 - 46 yrs. ago - cctalk - classiccmp.org

List overview All Threads
Download

Intel 8086 - 46 yrs. ago

Fwd: [GreenKeys] FW: [cca] Ed...

Heurikon HK68/M10 (Multibus)...

Murray McCullough

9 Jun 2024 9 Jun '24

11:40 a.m.

Intel introduced to the world the x86 processor: the CISC technology still with us. So what has changed other than speed and upward development? Happy computing, Murray 🙂

Reply

Show replies by date

Milo Velimirović

9 Jun 9 Jun

12:59 p.m.

Word length. :)

On Jun 9, 2024, at 10:40 AM, Murray McCullough via cctalk <cctalk(a)classiccmp.org> wrote: Intel introduced to the world the x86 processor: the CISC technology still with us. So what has changed other than speed and upward development? Happy computing, Murray 🙂

Reply

Chuck Guzis

1:16 p.m.

On 6/9/24 09:59, Milo Velimirović via cctalk wrote:

Word length. :)

Scarcely innovative. 64 bit architectures predated the 64-bit x86 by decades. Call it a natural evolution. Don't forget cheap memory. --Chuck

Reply

Dave Dunfield

13 Jun 13 Jun

6:30 p.m.

Chuck Guzis wrote:

Scarcely innovative. 64 bit architectures predated the 64-bit x86 by decades. Call it a natural evolution.

I'm kinda surprised that nobody has mentioned this ... But.. even less innovative than that! - the subject mentions "8086" and 46 years - the 8086 was only a 16 bit CPU and came out close to the time suggested. The x86 architecture didn't get 32 bits till the 386 which was IIRC around 1986 or 7 - so word length was not AT ALL architecturally significant - and Chuck is quite right that there were MANY bigger/better machines when the 8086 made it's debut. I think the 86 came at a good time/place because the 8080 series had become quite popular in microcomputers and designers were feeling the limits of a 8-bit architecture - the 86 provided a fairly powerful (for the time) and easy upgrade which was enough like the 8080 that most developers didn't have a tough time "figuring it out". (and it didn't hurt that minicomputer pricing wasn't involved) Dave My own entry into the "microprocessor" design fray was something I called the: C-FLEA A very tiny/simple 16 bit CPU that was very optimal as a target for my C compiler. Never did see it to silicon, but did quite a few "virtual machines" - this let me efficiently put C code into little cpus that were not reasonable candidates for higher level languages.

Reply

ben

7:50 p.m.

On 2024-06-13 4:30 p.m., Dave Dunfield via cctalk wrote:

I think the 86 came at a good time/place because the 8080 series had become quite popular in microcomputers and designers were feeling the limits of a 8-bit architecture - the 86 provided a fairly powerful (for the time) and easy upgrade which was enough like the 8080 that most developers didn't have a tough time "figuring it out". (and it didn't hurt that minicomputer pricing wasn't involved)

The Z8000 may be better. The 8088 was to be a better 8 bit cpu.

Dave

My own entry into the "microprocessor" design fray was something I called the: C-FLEA A very tiny/simple 16 bit CPU that was very optimal as a target for my C compiler. Never did see it to silicon, but did quite a few "virtual machines" - this let me efficiently put C code into little cpus that were not reasonable candidates for higher level languages.

Where? I just finished a 20 bit cpu, that seems to have all the features a 16 bit cpu's had, but not all in one machine. Moving from word to byte/word addressing add one opcode bit. Index registers 7? as general purpose reg. Add 3 opcode bits. Removing skips, add one opcode bit. Hmm 21 bits already.. Looking for a C-compiler that is easy to re-target, and a OS to go with it. Ben.

Reply

Dave Dunfield

14 Jun 14 Jun

1:29 p.m.

ben wrote:

My own entry into the "microprocessor" design fray was something I called the: C-FLEA A very tiny/simple 16 bit CPU that was very optimal as a target for my C compiler. Never did see it to silicon, but did quite a few "virtual machines" - this let me efficiently put C code into little cpus that were not reasonable candidates for higher level,languages.

Where?

I've recently posted much of the source code I've written over the years in a "Retirement project" section of my site. Various related documents and executables can be had from the packages in my main downloads, and at "Daves Old Computers". If you are interested in the C-FLEA CPU - the best place to see/try it is DVM (Dunfield Virtual Machine) this is a little VM I put together - available at the above locations. Here is a bit on information I put together about my compiler (Micro-C) with C-FLEA/DVM: (Sorry if spaces get munged - can't seem to post a simple "spaced" text document on must sites these days) My compiler is known for making small executables. One of the most common comments I received about it was "how can it make such small programs?". Here are the DVM application file sizes at the time of this writing: 3,281 CC.DVM - Compile command 6,320 MCP.DVM - Micro-C preprocessor 17,440 MCCDVM.DVM - Micro-C compiler 4,064 MCODVM.DVM - Micro-C optimizer 1,210 MCCILIB.DVM - Replaces calls to internal library 4,675 SLINK.DVM - Source linker 8,335 ASMDVM.DVM - DVM assembler 1,488 MCCVT.DVM - Convert .HEX from ASM into .DVM 8,656 EDT.DVM - My EDT editor 5,415 VLT.DVM - My large text file viewer 6,560 BASIC.DVM - My "BASIC" sample program Subtract 128 bytes from each of the above sizes because the free demo .DVMs have an integrated protection message. And here are the same program compiled with my PC/DOS compiler: 4,651 CC.COM 13,900 MCP.EXE \ 24,258 MCCDVM.EXE > As Micro-C was designed to be easily 12,792 MCODVM.COM > portable, these tools make very little 3,479 MCCILIB.COM > use of library functions. 12,274 SLINK.EXE / 17,650 ASMDVM.EXE 4,361 MCCVT.COM 17,154 EDT.EXE 7,444 VLT.COM 12,115 BASIC.EXE Dave Dunfield - https://dunfield.themindfactory.com

Reply

ben

9 Jun 9 Jun

7:28 p.m.

On 2024-06-09 10:59 a.m., Milo Velimirović via cctalk wrote:

Word length. :)

On Jun 9, 2024, at 10:40 AM, Murray McCullough via cctalk <cctalk(a)classiccmp.org> wrote: Intel introduced to the world the x86 processor: the CISC technology still with us. So what has changed other than speed and upward development? Happy computing, Murray 🙂

The CPU Price it keeps going UP ... :( 8008 $25 1975 8080 $75 MITS kit 1975 8088 $125 386 $130 (286 $20)

Reply

Joshua Rice

10 Jun 10 Jun

12:05 p.m.

On 10/06/2024 00:28, ben via cctalk wrote:

The CPU Price it keeps going UP ... :( 8008 $25 1975 8080 $75 MITS kit 1975 8088 $125 386 $130 (286 $20)

Hardly, you can pick up a new CPU today for less than $50. It's not going to be particularly fast, but it'll run circles round most decade old CPUs. I think the biggest change (other than speed, word length etc), is number of SKUs. For desktop systems, there are dozens of different SKUs per microarchitecture, each with varying cache size, core count, clockspeed, and present/absent integrated graphics. In the server space, the varying SKUs are amplified, with number of PCIe lanes, socket support etc almost multiplying the number of SKUs by an order of magnitude. Josh Rice

Reply

Chuck Guzis

12:25 p.m.

It's interesting and probably indicative of some mindset where a discussion of the evolution of a given architecture is being discussed that specific technical aspects are most often mentioned, even though most of those are holdovers from the 1960s, just made smaller. My take is "why were these advancements necessary"? In other words, what parallel non-CPU/societal developments caused the shift in thinking? I recall that when the 8086 was announced by Intel, it wasn't the first 16-bit CPU by a long shot, nor was Intel doing a hard-sell on it. 8-bit in the personal computer still reigned supreme and the prospect of a 64KB 16-bit system costing considerably more than a similarly-performing 8 bit system wasn't particularly attractive. What was the catalyst? My .02 on the matter, FWIW. --Chuck

Reply

ben

12 Jun 12 Jun

12:17 a.m.

On 2024-06-10 10:05 a.m., Joshua Rice via cctalk wrote:

On 10/06/2024 00:28, ben via cctalk wrote:

The CPU Price it keeps going UP ... :( 8008 $25 1975 8080 $75 MITS kit 1975 8088 $125 386 $130 (286 $20)

Hardly, you can pick up a new CPU today for less than $50. It's not going to be particularly fast, but it'll run circles round most decade old CPUs.

I thought a PI was $10.

I think the biggest change (other than speed, word length etc), is number of SKUs. For desktop systems, there are dozens of different SKUs per microarchitecture, each with varying cache size, core count, clockspeed, and present/absent integrated graphics. In the server space, the varying SKUs are amplified, with number of PCIe lanes, socket support etc almost multiplying the number of SKUs by an order of magnitude. Josh Rice

I got a headache from reading that. Really a minor point, architecture has not gotten better since the RISC/CISC debate, just faster clock speeds, newer graphic processors and the like, in my view. Ben. Gates's law "The speed of software halves every 18 months"

Reply

Fred Cisin

12:23 a.m.

On Tue, 11 Jun 2024, ben via cctalk wrote:

Gates's law "The speed of software halves every 18 months"

Isn't it more similar to Boyle's law? Software expands to require slightly more than the available memory.

Reply

Chuck Guzis

9 Jun 9 Jun

1:01 p.m.

On 6/9/24 08:40, Murray McCullough via cctalk wrote:

Intel introduced to the world the x86 processor: the CISC technology still with us. So what has changed other than speed and upward development?

The Internet? Really, it's always been my view that computational technical progress has long been driven by communication. Without communication, the microprocessor would largely be limited to commercial (e.g. CAD, finance, accounting) and a few niche applications. Many of those segments would be just fine with older technology. To put it in another context, what use would most people find with a PC that was limited to 300 bps modem communication? --Chuck

Reply

ben

7:30 p.m.

On 2024-06-09 11:01 a.m., Chuck Guzis via cctalk wrote:

On 6/9/24 08:40, Murray McCullough via cctalk wrote:

Intel introduced to the world the x86 processor: the CISC technology still with us. So what has changed other than speed and upward development?

The Internet? Really, it's always been my view that computational technical progress has long been driven by communication. Without communication, the microprocessor would largely be limited to commercial (e.g. CAD, finance, accounting) and a few niche applications. Many of those segments would be just fine with older technology. To put it in another context, what use would most people find with a PC that was limited to 300 bps modem communication? --Chuck

More reliable. Can you trust a CLOUD? How soon will your data be corrupted, or behind a paywall?

Reply

Scott Baker

8:47 p.m.

I think the biggest change is our compute resources stopped going faster in terms of raw cycles per second, and started going wider in terms of parallelism. It's now commonplace for me to run workloads that can actually use many CPU cores, and I'm starting to occasionally run workloads that are so parallel, that a GPU is a more suitable resource. At the same time as the surge in parallelism, there's also a focus on going greener. I think the last couple years have been particularly transformative. Scott On Sun, Jun 9, 2024 at 4:31 PM ben via cctalk <cctalk(a)classiccmp.org> wrote:

On 2024-06-09 11:01 a.m., Chuck Guzis via cctalk wrote:

On 6/9/24 08:40, Murray McCullough via cctalk wrote: > Intel introduced to the world the x86 processor: the CISC technology

still

with us. So what has changed other than speed and upward development?

The Internet? Really, it's always been my view that computational technical progress has long been driven by communication. Without communication, the microprocessor would largely be limited to commercial (e.g. CAD, finance, accounting) and a few niche applications. Many of those segments would be just fine with older technology. To put it in another context, what use would most people find with a PC that was limited to 300 bps modem communication? --Chuck

More reliable. Can you trust a CLOUD? How soon will your data be corrupted, or behind a paywall?

Reply

Tom Hunter

9 p.m.

Highly parallel workloads are an important niche in computing. On Mon, 10 June 2024, 8:48 am Scott Baker via cctalk, <cctalk(a)classiccmp.org> wrote:

I think the biggest change is our compute resources stopped going faster in terms of raw cycles per second, and started going wider in terms of parallelism. It's now commonplace for me to run workloads that can actually use many CPU cores, and I'm starting to occasionally run workloads that are so parallel, that a GPU is a more suitable resource. At the same time as the surge in parallelism, there's also a focus on going greener. I think the last couple years have been particularly transformative. Scott

Reply

Murray McCullough

10:45 p.m.

I agree that parallelism, or more accurately multiprocessing, has contributed a great deal to the advancement of 8086 technology. So to has speed: The first 8086 was clocked at 5Mhz.; now the speed is 6Ghz. The shrinkage of computer components in ULSIC technology has made this possible. But today I believe we're nearing an end to 8086 CISC technology because its science and technology will only take it so far. Murray. 🙂 On Sun, Jun 9, 2024 at 9:00 PM Tom Hunter via cctalk <cctalk(a)classiccmp.org> wrote:

Highly parallel workloads are an important niche in computing. On Mon, 10 June 2024, 8:48 am Scott Baker via cctalk, < cctalk(a)classiccmp.org> wrote:

I think the biggest change is our compute resources stopped going faster in terms of raw cycles per second, and started going wider in terms of parallelism. It's now commonplace for me to run workloads that can

actually

use many CPU cores, and I'm starting to occasionally run workloads that

are

so parallel, that a GPU is a more suitable resource. At the same time as the surge in parallelism, there's also a focus on going greener. I think the last couple years have been particularly transformative. Scott

Reply

Fred Cisin

11:04 p.m.

On Sun, 9 Jun 2024, Murray McCullough via cctalk wrote:

I agree that parallelism, or more accurately multiprocessing, has contributed a great deal to the advancement of 8086 technology. So to has speed: The first 8086 was clocked at 5Mhz.; now the speed is 6Ghz. The shrinkage of computer components in ULSIC technology has made this possible. But today I believe we're nearing an end to 8086 CISC technology because its science and technology will only take it so far.

There are theoretical limits. But, like the limits imposed by frequency modulation on modem speeds, each time there's a limit, clever ideas attempt to circumvent that limit. Parallelism/multiprocessing can go a ways past the "theoretical limit" of processor speed, simply because total throughput is the actual goal, not processor speed. Under Moore's Law, it kept doubling. But, it was obvious that it could not keep doing so infinitely. Moore is gone, so there is no enforcement, and the doubling is approaching its end. :-) -- Grumpy Ol' Fred cisin(a)xenosoft.com

Reply

dwight

10 Jun 10 Jun

12:54 a.m.

No one is mentioning multiple processors on a single die and cache that is bigger than most systems of that times complete RAM. Clock speed was dealt with clever register reassignment, pipelining and prediction. Dwight

Reply

Joshua Rice

12:18 p.m.

On 10/06/2024 05:54, dwight via cctalk wrote:

No one is mentioning multiple processors on a single die and cache that is bigger than most systems of that times complete RAM. Clock speed was dealt with clever register reassignment, pipelining and prediction. Dwight

Pipelining has always been a double edged sword. Splitting the instruction cycle into smaller, faster chunks that can run simultaneously is a great idea, but if the actual instruction execution speed gets longer, failed branch predictions and subsequent pipeline flushes can truly bog down the real-life IPS. This is ultimately what led the NetBurst architecture to be the dead-end it became. DEC came across another issue with the PDP-11 vs the VAX. Although the pipelined architecture of the VAX was much faster than the PDP-11, the actual time for a single instruction cycle was much increased, which led to customers requiring real-time operation to stick with the PDP-11, as it was much quicker in those operations. This, along with it's large software back-catalog and established platform led to the PDP-11 outliving it's successor. Josh Rice

Reply

Paul Koning

12:41 p.m.

On Jun 10, 2024, at 12:18 PM, Joshua Rice via cctalk <cctalk(a)classiccmp.org> wrote: On 10/06/2024 05:54, dwight via cctalk wrote:

No one is mentioning multiple processors on a single die and cache that is bigger than most systems of that times complete RAM. Clock speed was dealt with clever register reassignment, pipelining and prediction. Dwight

Pipelining has always been a double edged sword. Splitting the instruction cycle into smaller, faster chunks that can run simultaneously is a great idea, but if the actual instruction execution speed gets longer, failed branch predictions and subsequent pipeline flushes can truly bog down the real-life IPS. This is ultimately what led the NetBurst architecture to be the dead-end it became.

RISC can do pipelining much more easily (as Cray first demonstrated around 1964, with the CDC 7600). For one thing, "bypass" is doable, and widely used, in machines that use both pipelining and multiple functional units. I remember the SiByte 1250 and/or the Raza XLR (both MIPS64, early 2000s) but I assume it was done well before then.

DEC came across another issue with the PDP-11 vs the VAX. Although the pipelined architecture of the VAX was much faster than the PDP-11, the actual time for a single instruction cycle was much increased, which led to customers requiring real-time operation to stick with the PDP-11, as it was much quicker in those operations. This, along with it's large software back-catalog and established platform led to the PDP-11 outliving it's successor. Josh Rice

That reminds me of the Motorola 68040. I did the fastpath for an FDDI switch (doing packet switching in software) on one of those. I discovered that the VAX-like addressing modes that look so nice on the 68040 takes a bunch of cycles, but there was a "RISC subset" using just the simplest addressing modes that would produce single cycle execution. So I limited my code to just those. The other weirdness was branch prediction. The 68040 had no branch prediction cache, instead it would statically predict all branches to be taken. Note the difference from the usual practice, which is to predict backward branches as taken and forward ones as not taken. No problem either way, but it just meant that the assembly code looked a bit odd because an if/then/else block would have the unlikely case immediately after the branch (fall through, not the predicted case) and after that the likely case (branch taken, as predicted). It was fun to do 60k packets per second on a 25 MHz processor... paul

Reply

ben

11 Jun 11 Jun

11:52 p.m.

On 2024-06-10 10:18 a.m., Joshua Rice via cctalk wrote:

On 10/06/2024 05:54, dwight via cctalk wrote:

No one is mentioning multiple processors on a single die and cache that is bigger than most systems of that times complete RAM. Clock speed was dealt with clever register reassignment, pipelining and prediction. Dwight

Pipelining has always been a double edged sword. Splitting the instruction cycle into smaller, faster chunks that can run simultaneously is a great idea, but if the actual instruction execution speed gets longer, failed branch predictions and subsequent pipeline flushes can truly bog down the real-life IPS. This is ultimately what led the NetBurst architecture to be the dead-end it became.

The other gotya with pipelining, is you have to have equal size chunks. A 16 word register file seems to be right size for a 16 bit alu. 64 words for words for 32 bit alu. 256 words for 64 bit alu, as a guess. You never see a gate level delays on a spec sheet. Our pipeline is X delays + N delays for a latch. How Fast Can Computers Add? Scientific American Vol. 219, No. 4 (October 1968), pp. 93-101 (9 pages) I do not think that will change vs MORE's law, LESS's law, BIG MONEY's law.

DEC came across another issue with the PDP-11 vs the VAX. Although the pipelined architecture of the VAX was much faster than the PDP-11, the actual time for a single instruction cycle was much increased, which led to customers requiring real-time operation to stick with the PDP-11, as

Forget that, noise. PDP 11's dirt cheap compared to VAX.

it was much quicker in those operations. This, along with it's large software back-catalog and established platform led to the PDP-11 outliving it's successor. Josh Rice

Now that makes more sense.

Reply

Paul Koning

13 Jun 13 Jun

1:32 p.m.

On Jun 11, 2024, at 11:52 PM, ben via cctalk <cctalk(a)classiccmp.org> wrote: On 2024-06-10 10:18 a.m., Joshua Rice via cctalk wrote:

On 10/06/2024 05:54, dwight via cctalk wrote:

No one is mentioning multiple processors on a single die and cache that is bigger than most systems of that times complete RAM. Clock speed was dealt with clever register reassignment, pipelining and prediction. Dwight

Pipelining has always been a double edged sword. Splitting the instruction cycle into smaller, faster chunks that can run simultaneously is a great idea, but if the actual instruction execution speed gets longer, failed branch predictions and subsequent pipeline flushes can truly bog down the real-life IPS. This is ultimately what led the NetBurst architecture to be the dead-end it became.

The other gotya with pipelining, is you have to have equal size chunks. A 16 word register file seems to be right size for a 16 bit alu. 64 words for words for 32 bit alu. 256 words for 64 bit alu, as a guess.

Huh? There is no direct connection between word length, register count, and pipeline length. The natural pipeline length (for a given functional unit) is the number of steps needed to do the work, given a step that can be completed in a single clock cycle. That assumes a pipe that long is affordable; if not it gets shorter. Not all functional units will have the same pipeline length. The register count is a function of cost -- for the registers themselves and for the scoreboard logic to sort out register conflicts. In modern designs that would be die area; in older machines it would be cost in modules or transistors. For example, in the CDC 6600, the registers (8 x 60 bits, 8 x 18 bit address, 8 * 18 bit index/count) and their associated data path controls to/from all the functional units take up an entire chassis, 750-ish logic modules.

You never see a gate level delays on a spec sheet. Our pipeline is X delays + N delays for a latch.

Gate level delays are not interesting for the machine user to know. What is interesting is the detailed properties of the pipelines, including whether they can accept a new operation every cycle or just every N cycles (say, a multiplier that accepts operands every 2 cycles); how many cycles is the delay from input to output; and whether there are "bypass" data paths to reduce the delays from input or output conflicts. Often these details are hard to pry out of the manufacturer; often they are not documented in the standard data sheets or processor user manuals. But they are critical if you want to do work such as pipeline models to drive compiler optimizers. paul

Reply

Chuck Guzis

2 p.m.

On 6/13/24 10:32, Paul Koning via cctalk wrote:

Huh? There is no direct connection between word length, register count, and pipeline length.

Indeed. There are architectures with NO user-addressable registers. Some have memory-mapped registers, where a "register number" is merely shorthand for a memory address (i.e. implicit multiplier and base) --Chuck

Reply

Paul Koning

2:06 p.m.

On Jun 13, 2024, at 2:00 PM, Chuck Guzis via cctalk <cctalk(a)classiccmp.org> wrote: On 6/13/24 10:32, Paul Koning via cctalk wrote:

Huh? There is no direct connection between word length, register count, and pipeline length.

Indeed. There are architectures with NO user-addressable registers. Some have memory-mapped registers, where a "register number" is merely shorthand for a memory address (i.e. implicit multiplier and base) --Chuck

There are of course also machines that appear to have registers (in the sense that the instruction set refers to them) but the implementation is a chunk of memory. The PDP-6 is one (and early PDP-10s without the "fast registers" option). The Philips PR-8000 may be one as well; the ISA has 8 registers, times 8 because it has a separate set per interrupt priority level, but there is a variant of the store instruction that stores to any of these as if it were memory. I'm not actually sure if it was implemented that way; 64 registers 24 bits wide would be a substantial cost and bulk in a mid-1960s machine. paul

Reply

ben

7:37 p.m.

On 2024-06-13 12:06 p.m., Paul Koning via cctalk wrote:

On Jun 13, 2024, at 2:00 PM, Chuck Guzis via cctalk <cctalk(a)classiccmp.org> wrote: On 6/13/24 10:32, Paul Koning via cctalk wrote:

Huh? There is no direct connection between word length, register count, and pipeline length.

Indeed. There are architectures with NO user-addressable registers. Some have memory-mapped registers, where a "register number" is merely shorthand for a memory address (i.e. implicit multiplier and base) --Chuck

There are of course also machines that appear to have registers (in the sense that the instruction set refers to them) but the implementation is a chunk of memory. The PDP-6 is one (and early PDP-10s without the "fast registers" option). The Philips PR-8000 may be one as well; the ISA has 8 registers, times 8 because it has a separate set per interrupt priority level, but there is a variant of the store instruction that stores to any of these as if it were memory. I'm not actually sure if it was implemented that way; 64 registers 24 bits wide would be a substantial cost and bulk in a mid-1960s machine. paul

Index registers like on, the IBM 1130 worked that way. The PDP-5 I think had the fewest internal registers. (Never heard of Nuclear plan fail with PDP-5). B.

Reply

ben

7:31 p.m.

On 2024-06-13 11:32 a.m., Paul Koning via cctalk wrote: e up an entire chassis, 750-ish logic modules.

You never see a gate level delays on a spec sheet. Our pipeline is X delays + N delays for a latch.

Gate level delays are not interesting for the machine user to know. What is interesting is the detailed properties of the pipelines, including whether they can accept a new operation every cycle or just every N cycles (say, a multiplier that accepts operands every 2 cycles); how many cycles is the delay from input to output; and whether there are "bypass" data paths to reduce the delays from input or output conflicts. Often these details are hard to pry out of the manufacturer; often they are not documented in the standard data sheets or processor user manuals. But they are critical if you want to do work such as pipeline models to drive compiler optimizers.

But I want to know, how to compare machines if you can't compare the logic.

paul

I never did see much in optimization at the RTL level. You have to wait for data, regardless of fancy tricks.

Reply

Paul Koning

14 Jun 14 Jun

11:13 a.m.

On Jun 13, 2024, at 7:31 PM, ben via cctalk <cctalk(a)classiccmp.org> wrote: On 2024-06-13 11:32 a.m., Paul Koning via cctalk wrote: e up an entire chassis, 750-ish logic modules.

You never see a gate level delays on a spec sheet. Our pipeline is X delays + N delays for a latch.

Gate level delays are not interesting for the machine user to know. What is interesting is the detailed properties of the pipelines, including whether they can accept a new operation every cycle or just every N cycles (say, a multiplier that accepts operands every 2 cycles); how many cycles is the delay from input to output; and whether there are "bypass" data paths to reduce the delays from input or output conflicts. Often these details are hard to pry out of the manufacturer; often they are not documented in the standard data sheets or processor user manuals. But they are critical if you want to do work such as pipeline models to drive compiler optimizers.

But I want to know, how to compare machines if you can't compare the logic.

You compare machines by what they deliver. The purpose of computers is not to deliver logic circuits but to deliver computation, so comparing computational ability (speed and size) is meaningful, along with cost. How it's implemented under the covers is not. Yes, the implementation details affect the user's figures of merit, but those figures and not the logic choices made by the designers matter. paul

Reply

Chuck Guzis

11:23 a.m.

On 6/14/24 08:13, Paul Koning via cctalk wrote:

You compare machines by what they deliver. The purpose of computers is not to deliver logic circuits but to deliver computation, so comparing computational ability (speed and size) is meaningful, along with cost. How it's implemented under the covers is not. Yes, the implementation details affect the user's figures of merit, but those figures and not the logic choices made by the designers matter.

Well said.

Reply

Peter Corlett

12 Jun 12 Jun

4:02 a.m.

On Mon, Jun 10, 2024 at 05:18:56PM +0100, Joshua Rice via cctalk wrote: [...]

DEC came across another issue with the PDP-11 vs the VAX. Although the pipelined architecture of the VAX was much faster than the PDP-11, the actual time for a single instruction cycle was much increased, which led to customers requiring real-time operation to stick with the PDP-11, as it was much quicker in those operations. This, along with it's large software back-catalog and established platform led to the PDP-11 outliving it's successor.

Fun factoid: despite modern x86 being clocked ~1000x faster than ye olde 6502, there's not much in it between them when it comes to interrupt response time. If all goes well, x86 takes "only" a hundred-ish cycles to do its book-keeping and jump to the ISR, but if SMM is active (spoiler: it always is and you can't turn it off) then it introduces a massive amount of extra jitter and all bets are off.

Reply

Chuck Guzis

9:58 a.m.

On 6/12/24 01:02, Peter Corlett via cctalk wrote:

Fun factoid: despite modern x86 being clocked ~1000x faster than ye olde 6502, there's not much in it between them when it comes to interrupt response time. If all goes well, x86 takes "only" a hundred-ish cycles to do its book-keeping and jump to the ISR, but if SMM is active (spoiler: it always is and you can't turn it off) then it introduces a massive amount of extra jitter and all bets are off.

Which accounts for some members of the NEC V-series having up to 8 alternate sets of registers for fast context switching. I have no idea why Intel didn't follow suit with its 80186 variants. --Chuck

Reply

Jon Elson

10:52 a.m.

On 6/12/24 03:02, Peter Corlett via cctalk wrote:

Fun factoid: despite modern x86 being clocked ~1000x faster than ye olde 6502, there's not much in it between them when it comes to interrupt response time. If all goes well, x86 takes "only" a hundred-ish cycles to do its book-keeping and jump to the ISR, but if SMM is active (spoiler: it always is and you can't turn it off) then it introduces a massive amount of extra jitter and all bets are off.

Well, actually the Pentium classic was supposedly designed as the flight computer for the F-15, and had VERY good interrupt response time of around 5 us. We know all about this as we used it with real time Linux in CNC motion control systems. A big concern was what was the delay and jitter from the RTC triggering an interrupt to when the servo position counters were read. It has been a struggle to maintain this level of low jitter with newer processors, but we have found quite a few that can do it. Jon

Reply

Jon Elson

13 Jun 13 Jun

11:40 a.m.

On 6/12/24 09:52, Jon Elson via cctalk wrote:

On 6/12/24 03:02, Peter Corlett via cctalk wrote:

Fun factoid: despite modern x86 being clocked ~1000x faster than ye olde 6502, there's not much in it between them when it comes to interrupt response time. If all goes well, x86 takes "only" a hundred-ish cycles to do its book-keeping and jump to the ISR, but if SMM is active (spoiler: it always is and you can't turn it off) then it introduces a massive amount of extra jitter and all bets are off.

Well, actually the Pentium classic was supposedly designed as the flight computer for the F-15, and had VERY good interrupt response time of around 5 us. We know all about this as we used it with real time Linux in CNC motion control systems. A big concern was what was the delay and jitter from the RTC triggering an interrupt to when the servo position counters were read. It has been a struggle to maintain this level of low jitter with newer processors, but we have found quite a few that can do it.

AACK! Sorry, that was supposed to be F-16! Jon

Reply

Adrian Godwin

11:56 a.m.

Even without things like system management mode, there are lots of speed-up features on modern processors that result in variable execution times - things like caching and pipelining. With sufficient care these can sometimes be made predictable but there are certain common needs that have always found it better to have a dedicated peripheral to do these jobs. The first common one was possibly the Motorola TPU (time processing unit) on the 68332 and others. The target application was injection timing for automotive ECUs and although external dedicated logic such as FPGAs have also been used, the microprogrammable peripheral has continued to be a feature of some relatively common processors. The TI processor in the beaglebone has PRIs, the raspberry pi Pico RP2040 has PIOs, and there's also the parallax propellor which has 8 high speed parallel processors running at 80MHz and able to interleave operations on common ports : a newer version has iirc a 1Ghz clock. Another technique has 'abused' dma and serial data peripherals on esp8266, esp32 and Teensy processors to produce pulse trains for WS2812 LED control at IIS clock rates. They construct the desired output waveform in memory and DMA them out, though this may be unsuitable for servo control as the buffers are large, giving good precision but poor latency. On Thu, Jun 13, 2024 at 4:40 PM Jon Elson via cctalk <cctalk(a)classiccmp.org> wrote:

On 6/12/24 09:52, Jon Elson via cctalk wrote:

On 6/12/24 03:02, Peter Corlett via cctalk wrote:

Fun factoid: despite modern x86 being clocked ~1000x faster than ye olde 6502, there's not much in it between them when it comes to interrupt response time. If all goes well, x86 takes "only" a hundred-ish cycles to do its book-keeping and jump to the ISR, but if SMM is active (spoiler: it always is and you can't turn it off) then it introduces a massive amount of extra jitter and all bets are off.

Well, actually the Pentium classic was supposedly designed as the flight computer for the F-15, and had VERY good interrupt response time of around 5 us. We know all about this as we used it with real time Linux in CNC motion control systems. A big concern was what was the delay and jitter from the RTC triggering an interrupt to when the servo position counters were read. It has been a struggle to maintain this level of low jitter with newer processors, but we have found quite a few that can do it.

AACK! Sorry, that was supposed to be F-16! Jon

Reply

Mike Katz

12:07 p.m.

Even earlier than the TPU on the 68332 is the communications co-processor built into the 68302. This predate the entire CPU-32 family. On 6/13/2024 10:56 AM, Adrian Godwin via cctalk wrote:

Even without things like system management mode, there are lots of speed-up features on modern processors that result in variable execution times - things like caching and pipelining. With sufficient care these can sometimes be made predictable but there are certain common needs that have always found it better to have a dedicated peripheral to do these jobs. The first common one was possibly the Motorola TPU (time processing unit) on the 68332 and others. The target application was injection timing for automotive ECUs and although external dedicated logic such as FPGAs have also been used, the microprogrammable peripheral has continued to be a feature of some relatively common processors. The TI processor in the beaglebone has PRIs, the raspberry pi Pico RP2040 has PIOs, and there's also the parallax propellor which has 8 high speed parallel processors running at 80MHz and able to interleave operations on common ports : a newer version has iirc a 1Ghz clock. Another technique has 'abused' dma and serial data peripherals on esp8266, esp32 and Teensy processors to produce pulse trains for WS2812 LED control at IIS clock rates. They construct the desired output waveform in memory and DMA them out, though this may be unsuitable for servo control as the buffers are large, giving good precision but poor latency. On Thu, Jun 13, 2024 at 4:40 PM Jon Elson via cctalk <cctalk(a)classiccmp.org> wrote: > On 6/12/24 09:52, Jon Elson via cctalk wrote: >> On 6/12/24 03:02, Peter Corlett via cctalk wrote: >>> Fun factoid: despite modern x86 being clocked ~1000x >>> faster than ye olde >>> 6502, there's not much in it between them when it comes >>> to interrupt >>> response time. If all goes well, x86 takes "only" a >>> hundred-ish cycles to do >>> its book-keeping and jump to the ISR, but if SMM is >>> active (spoiler: it >>> always is and you can't turn it off) then it introduces a >>> massive amount of >>> extra jitter and all bets are off. >>> >> Well, actually the Pentium classic was supposedly designed >> as the flight computer for the F-15, and had VERY good >> interrupt response time of around 5 us. We know all about >> this as we used it with real time Linux in CNC motion >> control systems. A big concern was what was the delay and >> jitter from the RTC triggering an interrupt to when the >> servo position counters were read. It has been a struggle >> to maintain this level of low jitter with newer >> processors, but we have found quite a few that can do it. > AACK! Sorry, that was supposed to be F-16! > > Jon > >

Reply

CAREY SCHUG

12:15 p.m.

New subject: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

I think I recall an early processor that did out of order execution, without checking, meaning you could have add xxx to accumulator store accumulator in zzz and the store could happen before the add if there weren't sufficient instructions between the two. I *DON'T* recall if it was designed this way or a defect in the chip design. I think it was intended to be a real-time process control cpu and speed was more important than ease of programming. There was a assembler/compiler that warned of this case, afaik

Reply

Adrian Godwin

12:33 p.m.

New subject: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

I may be wrong, but wasn't that a feature of early RISC, possibly the Sparc ? You were compiling to microcode rather than CISC assembler, so you got to think about pipelining in the instruction stream. Just about feasible in assembler but perfectly sensible if the compiler was doing the work. To the limit of my knowledge, modern RISCs like ARM are much more like CISC in that the pipelining is hidden, the name is used only to imply there's a small and regular set of instruction codes. On Thu, Jun 13, 2024 at 5:15 PM CAREY SCHUG via cctalk < cctalk(a)classiccmp.org> wrote:

I think I recall an early processor that did out of order execution, without checking, meaning you could have add xxx to accumulator store accumulator in zzz and the store could happen before the add if there weren't sufficient instructions between the two. I *DON'T* recall if it was designed this way or a defect in the chip design. I think it was intended to be a real-time process control cpu and speed was more important than ease of programming. There was a assembler/compiler that warned of this case, afaik

Reply

Christian Kennedy

3:55 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On 6/13/24 09:33, Adrian Godwin via cctalk wrote:

I may be wrong, but wasn't that a feature of early RISC, possibly the Sparc ? You were compiling to microcode rather than CISC assembler, so you got to think about pipelining in the instruction stream. Just about feasible in assembler but perfectly sensible if the compiler was doing the work.

That sounds a lot like early MIPS processors, where you (or the compiler) had to schedule load and store delay slots as necessary. It made sense, given that the whole premise was to make the silicon simple by having the compiler do all of the bookkeeping. -- Christian Kennedy, Ph.D. chris(a)mainecoon.com AF6AP | DB00000692 | PG00029419 http://www.mainecoon.com PGP KeyID 108DAB97 PGP fingerprint: 4E99 10B6 7253 B048 6685 6CBC 55E1 20A3 108D AB97 "Mr. McKittrick, after careful consideration…"

Reply

Jonathan Stone

5:30 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Thursday, June 13, 2024 at 12:56:09 PM PDT, Christian Kennedy via cctalk <cctalk(a)classiccmp.org> wrote: [[ ...compiler, or human writing assembler, responsible for avoiding hazards in MIPS delay slots.... ]] MIPS is of course (allegedly) an acronym for "Microprocessor without Interlocked Pipeline Stages". No interlocking between pipeline stages mean no hardware avoidance (delays, pipeline bubbles) of hazards. So hardly surprising that authors of MIPS assembly-level code are/were responsible for scheduling that code to avoid what would otherwise be pipeline hazards.

Reply

Maciej W. Rozycki

6 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Thu, 13 Jun 2024, Jonathan Stone via cctalk wrote:

MIPS is of course (allegedly) an acronym for "Microprocessor without Interlocked Pipeline Stages". No interlocking between pipeline stages mean no hardware avoidance (delays, pipeline bubbles) of hazards. So hardly surprising that authors of MIPS assembly-level code are/were responsible for scheduling that code to avoid what would otherwise be pipeline hazards.

The architecture designers cheated however even in the original ISA in that moves from the MD accumulator did interlock. I guess they figured people (either doing it by hand or by writing a compiler) wouldn't get that right anyway. ;) Maciej

Reply

Jonathan Stone

6:22 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Thursday, June 13, 2024 at 03:00:22 PM PDT, Maciej W. Rozycki via cctalk <cctalk(a)classiccmp.org> wrote:

The architecture designers cheated however even in the original ISA in that moves from the MD accumulator did interlock. I guess they figured people (either doing it by hand or by writing a compiler) wouldn't get that right anyway. ;)

I always assumed that was because the latency of multiply, let alone divide, was far too many cycles for anyone to plausibly schedule "useful" instructions into. Wasn't r4000 divide latency over 60 cycles? Wasn't r4000 divide latency more than 60 cycles?

Reply

Henry Bent

6:37 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Thu, 13 Jun 2024 at 18:22, Jonathan Stone via cctalk < cctalk(a)classiccmp.org> wrote:

On Thursday, June 13, 2024 at 03:00:22 PM PDT, Maciej W. Rozycki via cctalk <cctalk(a)classiccmp.org> wrote:

The architecture designers cheated however even in the original ISA in that moves from the MD accumulator did interlock. I guess they figured people (either doing it by hand or by writing a compiler) wouldn't get that right anyway. ;)

I always assumed that was because the latency of multiply, let alone divide, was far too many cycles for anyone to plausibly schedule "useful" instructions into. Wasn't r4000 divide latency over 60 cycles? Wasn't r4000 divide latency more than 60 cycles?

The MIPS R4000 manual https://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_book_Ed2.pdf says that double precision divide is 36 cycles, and double precision square root is 112 (!). What's interesting about that is that GCC's model of the R4000 says that divide is 69 cycles; I'm not sure of the reason for the discrepancy. https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/mips/4000.md;h… -Henry

Reply

Maciej W. Rozycki

14 Jun 14 Jun

3:49 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Thu, 13 Jun 2024, Henry Bent via cctalk wrote:

I always assumed that was because the latency of multiply, let alone divide, was far too many cycles for anyone to plausibly schedule "useful" instructions into. Wasn't r4000 divide latency over 60 cycles? Wasn't r4000 divide latency more than 60 cycles?

The MIPS R4000 manual https://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_book_Ed2.pdf says that double precision divide is 36 cycles, and double precision square root is 112 (!).

Note that these figures are for floating-point arithmetic.

What's interesting about that is that GCC's model of the R4000 says that divide is 69 cycles; I'm not sure of the reason for the discrepancy. https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/mips/4000.md;h…

And this is for integer arithmetic, that's the reason. Maciej

Reply

Maciej W. Rozycki

13 Jun 13 Jun

6:49 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Thu, 13 Jun 2024, Jonathan Stone wrote:

The architecture designers cheated however even in the original ISA in that moves from the MD accumulator did interlock. I guess they figured people (either doing it by hand or by writing a compiler) wouldn't get that right anyway. ;)

I always assumed that was because the latency of multiply, let alone divide, was far too many cycles for anyone to plausibly schedule "useful" instructions into. Wasn't r4000 divide latency over 60 cycles?

Well, overflow and divide by zero checks will often take many cycles in parallel with MDU executing the operation, but you're of course correct in that the designers have made a reasonable decision there. I just put it differently. The net result however is the architecture has never been fully without pipeline interlocks, although indeed it used to be close. Performance figures for the R3000 would be more appropriate for the MIPS I initial ISA revision and reportedly said CPU executed a 32-bit division in 35 cycles. I can imagine the R4000 could need over 60 cycles to run a 64-bit division. Figures vary among more modern implementations, but MDU operations continue having significant latencies. Maciej

Reply

Paul Koning

14 Jun 14 Jun

11:09 a.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Jun 13, 2024, at 6:22 PM, Jonathan Stone via cctalk <cctalk(a)classiccmp.org> wrote: On Thursday, June 13, 2024 at 03:00:22 PM PDT, Maciej W. Rozycki via cctalk <cctalk(a)classiccmp.org> wrote:

The architecture designers cheated however even in the original ISA in that moves from the MD accumulator did interlock. I guess they figured people (either doing it by hand or by writing a compiler) wouldn't get that right anyway. ;)

I always assumed that was because the latency of multiply, let alone divide, was far too many cycles for anyone to plausibly schedule "useful" instructions into. Wasn't r4000 divide latency over 60 cycles? Wasn't r4000 divide latency more than 60 cycles?

Probably, because divide is inherently an interative operation, and usually is implemented to produce one bit of result per cycle. A notable exception is the CDC 6600, which throws a whole lot of logic at the problem to produce two bits of result per cycle. The usual divide amounts to a trial subtraction and shift; the 6600 implementation does THREE trial subtractions concurrently. Not cheap when you're using discrete transistor logic. Multiply is an entirely different matter, that can be done in few cycles if you throw enough logic at the problem. Signal processors are an extreme example of this because multiply/add sequences are the essence of what they need to do. This is also why Alpha omitted divide entirely and made programs do multiply by the reciprocal instead. The best argument for doing interlocking in the hardware isn't that it's hard for software to get right. Code generators can do it and that's a one time effort. But the required are often dependent on variables that are not known at compile time, for example load/store delays, or branches taken/not taken. Run time interlocks deal with the actual conflicts as they occur, while compiler or programmer conflict avoidance has to use the worst case scenarios. paul

Reply

Brian L. Stuart

1:17 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

But the required are often dependent on variables that are not> known at compile time, for example load/store delays, or branches> taken/not taken. Run time interlocks deal with the actual conflicts> as they occur, while compiler or programmer conflict avoidance> has to use the worst case scenarios.

The observation that interlocks are not needed when the delay isknown and predictable, but are needed when the delay is varialbehas been around since the beginning. The ENIAC didn't haveinterlocks for the multiplier or function tables, but did for the divider/square rooter and the card reader. (The card punch had whatamounted to a completion signal that enforced the timing.) BLS

Reply

Gavin Scott

15 Jun 15 Jun

12:14 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On 6/13/24 09:33, Adrian Godwin via cctalk wrote:

I may be wrong, but wasn't that a feature of early RISC, possibly the Sparc ? You were compiling to microcode rather than CISC assembler, so you got to think about pipelining in the instruction stream. Just about feasible in assembler but perfectly sensible if the compiler was doing the work.

PS-RISC has a "delay slot" after a branch and the instruction there will be executed before the branch target is reached for a taken branch. You can do fun things like put a branch in the delay slot of another branch and when both are taken, you end up executing a single instruction at the target of the delay slot branch before resuming at the target of the original branch. PA-RISC implements PC as PCQ, a two entry queue containing the current instruction address and the "following instruction address" which get manipulated by the branches to create this behavior.

Reply

Chuck Guzis

1:41 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

I'm certain that Paul has done his share of this, but an art on the CDC 6600 was hand-scheduling instruction execution. There was at least one class for this--and probably more. The CPU could issue one instruction every cycle, assuming that there were no conflicts. The 6600 had several functional units whose operation could overlap. But we've discussed this before... On the large vector STAR-100, operands were fetched via a 512-bit wide (not counting error checking bits) memory bus and pipelined vector units. The trick there was not so much scheduling of scalar instructions, but avoiding "bubbles" in the vector pipes. --Chuck

Reply

Paul Koning

3:39 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Jun 15, 2024, at 1:41 PM, Chuck Guzis via cctalk <cctalk(a)classiccmp.org> wrote: I'm certain that Paul has done his share of this, but an art on the CDC 6600 was hand-scheduling instruction execution. There was at least one class for this--and probably more. The CPU could issue one instruction every cycle, assuming that there were no conflicts. The 6600 had several functional units whose operation could overlap.

I learned it from OS code reading and adopted some of it for my own work, but not much because I actually only worked on the 6500 -- which doesn't have multiple functional units. Writing good code for those machines was further complicated by the fact that instructions were either 1/4 or 1/2 word long, could not split across word boundaries, and branches would only go to the start of the word. So there tended to be NOPs to pad out the word, which the assembler would supply. Avoiding them would make the code go faster and of course make it smaller. The other complication was a fairly limited set of registers, and the fact that loads would go only to X1..X5 while stores could only come from X6 or X7. So a memcpy would involve a register to register transfer. That takes 3 cycles on a 6600, so a skillful memcpy implementation would use two load registers, both store registers, and two separate functional units for the R-R move (one via the "boolean" unit and one via the "shift" unit). I remember my bafflement the first time I saw a shift (by zero) used to do just a register to register move; on a 6500 you wouldn't have any reason to write that. I once crashed the PLATO system in mid-day, when the load hit peak (600 users logged on) because I had slowed down a critical terminal output processing step and the machinery didn't have flow control there. My bosses were NOT happy. I solved the issue by cleaning up that block of code to avoid all NOPs; the result was that it was both shorter and faster than the previous version while still delivering the new feature. :-) paul

Reply

Chuck Guzis

5:22 p.m.

New subject: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On 6/15/24 12:39, Paul Koning wrote:

I learned it from OS code reading and adopted some of it for my own work, but not much because I actually only worked on the 6500 -- which doesn't have multiple functional units. Writing good code for those machines was further complicated by the fact that instructions were either 1/4 or 1/2 word long, could not split across word boundaries, and branches would only go to the start of the word. So there tended to be NOPs to pad out the word, which the assembler would supply. Avoiding them would make the code go faster and of course make it smaller. The other complication was a fairly limited set of registers, and the fact that loads would go only to X1..X5 while stores could only come from X6 or X7. So a memcpy would involve a register to register transfer. That takes 3 cycles on a 6600, so a skillful memcpy implementation would use two load registers, both store registers, and two separate functional units for the R-R move (one via the "boolean" unit and one via the "shift" unit). I remember my bafflement the first time I saw a shift (by zero) used to do just a register to register move; on a 6500 you wouldn't have any reason to write that. I once crashed the PLATO system in mid-day, when the load hit peak (600 users logged on) because I had slowed down a critical terminal output processing step and the machinery didn't have flow control there. My bosses were NOT happy. I solved the issue by cleaning up that block of code to avoid all NOPs; the result was that it was both shorter and faster than the previous version while still delivering the new feature. :-)

At CDC SSD SVLOPS, it was all big gummint stuff stuff, so we had clusters of Cyber 74s and 73s (6600/6400) linked with a few million words of ECS (we had a QSE that expanded it to 4M words). 6600/Cyber 74 programming was the rule. A short loop was considered to be optimal, if it kept the instruction issue to 1/cycle and kept the whole thing "in stack" (basically an 8-word buffer, not really a cache) to avoid accessing CM for instructions. Lots of bit-twiddling fun! The 6600 had an interesting feature we called "shortstop" where the result of an operation was available for use by a subsequent instruction 1 cycle before it materialized in a register On early 6600s, there was a so-called "store out of order" problem where two closely-timed stores to the same location would result in the earlier result overwriting the later ones. An ECO fixed that--it was pretty fundamental. STAR initially mapped the user's low-memory to the 256-word register file, such that one could have vectors occupying several registers addressed by memory location, while referring to the registers by register number. That apparently resulted in some serious issues, solved eventually by simply locking out access to the first 16Kbits (recall that the STAR is bit-addressed) of memory. The so-called "Rev R" ECO, if my mind isn't playing tricks on me. CDC had a pretty close relationship with Fairchild during this time; initially for the silicon transistors in the 6600 and later the register file for the STAR. Fun times! --Chuck

Reply

Paul Koning

13 Jun 13 Jun

1:57 p.m.

New subject: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Jun 13, 2024, at 12:15 PM, CAREY SCHUG via cctalk <cctalk(a)classiccmp.org> wrote: I think I recall an early processor that did out of order execution, without checking, meaning you could have add xxx to accumulator store accumulator in zzz and the store could happen before the add if there weren't sufficient instructions between the two. I *DON'T* recall if it was designed this way or a defect in the chip design. I think it was intended to be a real-time process control cpu and speed was more important than ease of programming. There was a assembler/compiler that warned of this case, afaik

MIPS, perhaps? It has "delay slots". The one that remains is the branch delay slots, which in modern designs is presumably merely an annoying crock that requires extra pain to implement but is actually required because it changes the meaning of the code. There also used to be load (?) delay slots, which sounds like what you're describing. That was ancient history by the time I started working on MIPS machines, fortunately. paul

Reply

Maciej W. Rozycki

6:28 p.m.

New subject: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Thu, 13 Jun 2024, Paul Koning via cctalk wrote:

MIPS, perhaps? It has "delay slots". The one that remains is the branch delay slots, which in modern designs is presumably merely an annoying crock that requires extra pain to implement but is actually required because it changes the meaning of the code. There also used to be load (?) delay slots, which sounds like what you're describing.

Yes, MIPS I load delay slots. Gone by the MIPS II/III ISAs, though coprocessor move delay slots remained (e.g. for moving data between the GPRs and the FPRs), having only been removed with the MIPS IV ISA. For the remaining missing interlocks there are now instructions in the ISA to clear the hazards resulting (i.e. possibly stall the pipeline) explicitly, so there is no need to count instructions anymore. Also various extensions and revisions of the ISA have added the so called compact branches that have no delay slot, starting with the MIPS16e ASE. And then the microMIPSr6 ISA encoding only has compact branches available.

That was ancient history by the time I started working on MIPS machines, fortunately.

MIPS I cores and variations were still around by late 1990s in embedded use, and of course the GNU toolchain continues to support all the arcana to this day, as do community-maintained OSes. Maciej

Reply

Eric Smith

21 Jun 21 Jun

5:06 a.m.

New subject: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

Pn Thu, Jun 13, 2024 at 11:57 AM Paul Koning via cctalk < cctalk(a)classiccmp.org> wrote:

MIPS, perhaps? It has "delay slots".

MIPS has delay slots for branches (two for Standord MIPS, one for commercial MIPS), but no delay slots for ALU operations. All MIPS implementations either interlocked the pipeline to avoid a race when an instruction that writes a register is followed by one that reads that register, or (more commonly after the early RISC days) has feed-forward in the pipeline.

Reply

Eric Smith

5:17 a.m.

New subject: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On Thu, Jun 13, 2024 at 10:15 AM CAREY SCHUG via cctalk < cctalk(a)classiccmp.org> wrote:

I think I recall an early processor that did out of order execution, without checking, meaning you could have add xxx to accumulator store accumulator in zzz

The Intel i860 (unrelated to x86, i960, and the much more recent "i860 chipset") has a pipeline that is more exposed to the programmer than in most RISC processors. It has interlocks so that having an instruction write a register followed by an instruction that reads that register will incur a pipeline delay. However, the floating point unit offers most operations in either a scalar mode or a pipelined mode. In pipelined mode, the pipeline is much more visible to the programmer. The destination of a pipelined-mode FP operation is where the result of the FP operation two instructions previous is strored, not the result of the current operation, and there is no interlock. A pipelined FP operation immediately followed by a store will not store the result of that operation. Needless to say, programming in pipelined FP mode is challenging, but it's the way to get the highest FP performance out of the i960. Math libraries used hand-written code to do that, but I don't think any of the i960 compilers were smart enough to do it for you.

Reply

emanuel stiebler

12:06 p.m.

New subject: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

On 2024-06-21 05:17, Eric Smith via cctalk wrote:

A pipelined FP operation immediately followed by a store will not store the result of that operation. Needless to say, programming in pipelined FP mode is challenging, but it's the way to get the highest FP performance out of the i960. Math libraries used hand-written code to do that, but I don't think any of the i960 compilers were smart enough to do it for you.

I guess, you're still talking about the i860, not the i960? ;-) Anyway, that could be the reason, why most high performance examples in the application notes are hand crafted assembler code ...

Reply

ben

13 Jun 13 Jun

12:17 p.m.

On 2024-06-13 9:40 a.m., Jon Elson via cctalk wrote:

AACK! Sorry, that was supposed to be F-16!

The divide bug strikes again.

Jon

What would one use today instead of the 586? Ben.

Reply

400

days inactive

412

days old

cctalk@classiccmp.org

Manage subscription

54 comments

24 participants

tags (0)

participants (24)

Adrian Godwin
ben
Brian L. Stuart
CAREY SCHUG
Christian Kennedy
Chuck Guzis
Dave Dunfield
dwight
emanuel stiebler
Eric Smith
Fred Cisin
Gavin Scott
Henry Bent
Jon Elson
Jonathan Stone
Joshua Rice
Maciej W. Rozycki
Mike Katz
Milo Velimirović
Murray McCullough
Paul Koning
Peter Corlett
Scott Baker
Tom Hunter