Chuck Guzis wrote:
...and at some point, the stack overlows the L1 cache
and degrades to
memory-to-memory speed, no?
Except that it doesn't matter. They'll fall off out of the cache
anyway, and the cache will now see a new area of memory, so the very
early calls are no longer in cache, so you get a few cache misses. But
the stuff the CPU's currently working on is almost always in cache.
Once you return back to the beginning, you have another cache miss, and
a reload. You can engineer the stack cache to work with this in mind.
i.e. fetch lots and lots at once and don't have too many cache lines.
The stack can and should have it's own separate cache, just like some
arch's split data/code caches.
As long as the cache part of it does lazy writes back to memory and is
optimized to cache a stack rather than general memory, it'll fly.
While it's true that stack-based architectures can
provide a lot of
bang-for-the-buck, like single-accumulator machines, it's difficult
to leverage parallelsim to improve performance.
Multi-cores would work, each with their own stack caches. The odds are
that two cores will never access the same stack, so it should work
nicely. You can even dictate that this should never happen and let the
OS deal with checking for the stack ranges to ensure it.
I'd venture that it's awfully difficult to
design a stack machine
that can compete with a 3-address architecture with a large register
file, multiple pipelined functional units and instruction scheduling.
Maybe. But unless the above machine uses register windows, or has some
way to use the registers to pass data between functions. The expense is
that using an L1 cache as a register file means that you need lots of L1
and to make it as fast as possible. At some point just having lots of
registers doesn't help you if the compiled function doesn't use them.
On context changes you always have to save all the registers to memory,
so you'd lose on any interrupt. On the stack machine, you only have to
flush the dirty cache lines, and you're done. Don't have to save very
many registers at all! (Likely just PC, SP, and status registers.)
It would also depend on the type of code you run there. Hand tuned
assembly that almost never touches the stack would not benefit at all on
a stack machine: the kind of stuff you'd do where you have a single
purpose real time controller with very little RAM for example. But for
generic C/C++/Pascal/OOP code, well designed stack architectures should
run beautifully.
The next big slowdown will be threading, as switching between threads
will require flushing the stack (and all the other caches.) A multicore
design can help eliminate a lot of the thread caused context switches,
and besides, you'd have the same problem on any architecture there.
IMHO, I think larger register windows would win over a stack machine
however, especially when you have a function call chain where one fn
passes it's operands to another fn which repeats this.
Silicon wise, you'd spend about the same on the stack arch as you would
on the register window arch. The flushes to memory would be less often
on a register window arch, but would take longer to complete, while on
the stack, there'd be more of them, but they'd be more spread out.
Would you consider the Burroughs B5000 to be a stack
machine?
I'm not familiar with that machine. What's it look like?