Stack machines ARE fast! (was Re: Stack Machines)

23 Oct 2006

Chuck Guzis wrote:
...
  I'm talking about SIMD implementations here.

     <load R1 with address and length of source vector 1>
     <load R2 with address and length of source vector 2>
     <load R3 with address and length of destination vector 3>
     <load R4 with address of sparse bit vector 1>
     <load R5 with address of sparse bit vector 2>
     <load R6 with adddress of sparse bit vector 3>
     ADDSV	(R1,R4),(R2,R5),(R3,R6)		Add the two sparse vectors

 The point being that most vector capabilities are three-address 
 implementations--because they run fastest.

 Cray went so far as to adopt 3-address architecture using vector 
 registers rather than 3-address memory-to-memory operations. 
    Yes, and IBM went so far as to implement AltiVec, etc.  A stack
architecture is not mutually exclusive from a 3-address opcode, nor is a
dedicated vector engine exclusive to a stack machine, anymore than an
FPU, MMU or anything else. 

I believe you're confusing different features here.  There is no
marriage between a register file implementation and a 3-address
instruction set architecture, nor any between a register file
implementation and a vector unit, nor between a register window
architecture and a 3-address ISA.
...
  But no stack vector machines exist to the best of my
knowledge, 
 although a vector stack would certainly be possible to construct.  
    Exactly.  Just because it hasn't been done in the past doesn't mean
that
a) it can't be done, or b) if implemented properly it won't be faster
running C/C++ code over a similar architecture that isn't a stack
machine.  Nor does it mean that the first implementation of such a thing
will be faster than current technology any more than the first RISC's
were that much more substantially faster than previous CISC's. 

...
  No, what I've seen are your dumb old three-address
implementations.

 I think that speaks volumes regarding performance.
    It doesn't actually. 

Nor would a nascent stack machine built today be necessarily much faster
than its peers - for the first iterations!   Example: For a lot of
things a 68040 running at 40Mhz was a lot faster than a 40Mhz SPARC v7.
But after a while as the SPARC ISA progressed, you saw better and better
implementations that left the Motorola line in the dust.

Let's take this analogy further, to say software floating point routines
VS hardware FPU's:

The one and only thing an FPU buys you is fast floating point math.  If
you look at the early implementations of Intel's, you'll find references
that say software floating point routines on an 8Mhz 68000 were faster. 
You can't say that any more of software implemented 680x0 floating point
instructions versus, the FPU of an intel chip of a P4.  Over time,
things improve.  Mostly incrementally, but they do improve.

Expand it a bit out of FPU's:

Nor can you say that an FPU is great or poor at vector processing. A
mature FPU will run rings around a nascent vector until, but give it a
few generations of design and you'll find the now mature vector units
will do much better at vector processing than current FPU's.

What I'm saying is that I believe you're comparing oranges and
pineapples here.  They do completely different things.  Up until
recently you didn't see vector units inside microprocessors.  You do now
because they're useful and more importantly because they're becoming
practical to implement.  In the early days you didn't see FPU's or
MMU's, nor SIMD's, nor multiple cores built into the microprocessors. 
You do now.

The various technologies involved do different things and speed up
processing in different areas of code, (or in the case of an MMU provide
memory protection and virtual memory capabilities). 

Whether a specific CPU implementation uses a two address or three
address ISA is independent of how it implements its register access, and
is independent of how it implements a vector unit, or an FPU.  Whether
your architecture has 2 or 3 addressing depends on only one thing: the
guy that mapped out the bits of the opcodes and operands to the opcode
decode unit.  Did he have enough bits for 3 operands, or did he give up
and use 2 like most folks?  There's a lot you can do with a 64 bit wide
instruction that you can't with a 32 or 16 - (unless you have variable
sized opcodes and think that the guys using your chip will benefit from
what you put in there.)

You don't have to take my word for it.  Here's the proof! 

While the SPARC ISA is not a stack machine, it's based on the next
closest thing: register windows, which behave identically to a stack
architecture machine up until you have to access main memory.

Take a look at this example of SPARC assembly code:

     ld   [%l2],%o0                  
     add  %l1,%o0,%l1                
     add  %l2,4,%l2                  
     inc  %l0                        
     ba   loop                       

See those two ADD instructions?  How many addresses do they use?  Why 3!  Would you say
that machines with register 
windows are faster than those without?  How far off is a register window from a stack? 
Don't answer that, let's look at an
actual stack machine:

Here's a direct quote from the AT&T 92010 Hobbit CPU manual as used in the Eo
440/880's:

"True three-operand (triadic) instructions are not
provided. However, instruction encoding that provide two source operands and
store the full 32-bit result in the accumulator are provided. This instruction
is called a two-and-a-half-operand instruction.

"For example, the mnemonic for an addition instruction
is ADD3, while a two- operand (dyadic) addition is ADD. For this instruction,
the two source operands are added and the full 32-bit result is stored in the
accumulator."

...

Ok, so far so good, but what exactly is the accumulator, sounds like a register, right? 
Not quite!

"1.4.3 Integer Accumulator**
The integer accumulator is not an actual hardware
register. It is the word in memory above the word addressed by the current
stack pointer (CSP). The CSP is either the stack pointer (SP) or the interrupt
stack pointer (ISP) as determined by the program status word (PSW).

"The integer accumulator normally resides on-chip in
the stack cache, but it may be off-chip if the SP = MSP or CSP = ISP."

...

So there you have it, a 3 register operation on a stack machine.  Ok,
granted the third register isn't addressable like the others, but none
the less, it is on the stack, just like the rest of the other operands. 
Why did they do it that way?  Same reason everyone does stupid things
(at least from some points of view): Compromise!  They didn't think it
was important enough to expand the ISA for it - so they went a bit
cheaper to save on cost fully supporting it.

This is because the ISA doesn't have room in the opcode definition for 3
register indexes.  But either way, it's still in there.  There is
absolutely nothing stopping anyone from building an ISA with 3-addresses
as you call them, or 3 triadic operands as AT&T calls them.

Wow!  What a miracle!  A Stack machine that takes 3 addresses!

Why does this machine suck you ask?  Simple, it only has enough room for
64 stack entries in the cache.  It doesn't have an FPU.  It doesn't have
a vector unit.  It has only a single core, at the fastest they've built
ran at 32MHz.  But you know what, for its day and age, and considering
that it was meant to be used in portable machines such as PDA's, it
compared very nicely to other CPU's of the era.  If you look at the size
of it's internal cache, it was certainly on par with low end 32 bit
CPU's.  What else did it feature?

MMU with dual 32 entry TLB's
3-stage instruction pipeline
A 32 bit integer unit.
Tagged math for OOP coding.
Prefetch instructions for on demand prefetching
Branch prediction and branch folding
Instruction Tracing
Vector base for interrupts so that you can have multiple processes.
Kernel/User mode
7-level interrupt model like the 68000 (NMI + 6 IRQ's)
3K instruction cache implemented as a 3-way set associative cache
256 byte stack cache
big endian/little endian byte ordering

Was it a CISC or RISC?  Neither, it was something in between, they
called it a CRISP ISA.  But this was not because it's a stack machine,
but rather because it didn't use a fixed opcode size.  Like CISC
machines it used a variable opcode size, and like RISC machines it had
lots of registers.  Granted most of those registers were on the stack.

If you were to build a modern multi-core 64 bit version of the Hobbit
CPU, add an FPU and vector unit and large enough split caches, you'd
find it would work quite nicely.

But was it fast you ask?  To answer your question get your hands on two
machines running the PenPoint OS.  Two machines from the 1992 era! 

An AT&T Eo 440 and an IBM Thinkpad 730

The TP730 used a 486SL/33 (no fpu), 8MB of RAM and optional hard drives.
The Eo 440 used the Hobbit 92010 running at 20Mhz, and the Eo 880 ran at
30Mhz.  They had 4MB of RAM and optional hard drives.  Wanna guess which
was faster?

I'll give you a hint, it's the machine that has the famous deathstar
logo on it! 

Yes, even the low end 440 running at *20Mhz* with *HALF* the RAM outruns
the Thinkpad!

Now, that's comparing apples to oranges, sure, since they're different
machines with different motherboards and chips, but sorry, I've no other
way to compare them.  That said, if you open up an Eo, you'll find
mostly standard PC hardware beyond the AT&T Hobbit chips.  And they ran
the same operating system: penpoint.  Yet, if you compare something very
basic such as the response time of the time when it refreshes the
screen, (I don't mean just bitblit drawing, but rather when it performed
CPU intensive operations), the Eo outran it. 

In fact, it outran it in every operation I tried including the crappy
handwriting recognition.

Is it possible that a stack machine could actually fast?  Wow!  We must
be in an alternate universe or something!

http://pencomputing.com/old_pcm_website/PCM_7/review_thinkpad_730te.html
http://www.bebox.nu/history.php?s=history/eo
http://www.quepublishing.com/articles/article.asp?p=481859&seqNum=15&am…

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Stack machines ARE fast! (was Re: Stack Machines)