"Dwight K. Elvey" wrote:
From: "Ross Archer"
<archer(a)topnow.com>
"Dwight K. Elvey" wrote:
From: "Ross Archer"
<dogbert(a)mindless.com>
Jerome H. Fine wrote:
>>Jim Kearney wrote:
>>
>>
>
>
>
>>I just had an email exchange with someone at Intel's Museum
>>(http://www.intel.com/intel/intelis/museum/index.htm)
>>
>>
>
>Jerome Fine replies:
>
>I am not sure why the information is so blatant in its
>stupid attempt to ignore anything but Intel hardware
>as far a anything that even look like a CPU chip, but
>I guess it is an "Intel" museum.
>
>Of course, even now, Intel, in my opinion, is so far
>behind from a technical point of view that is is a sad
>comment just to read about the products that were
>way behind, and still are, the excellence of other
>products. No question that if the Pentium 4 had been
>produced 10 years ago, it would have been a major
>accomplishment.
>
Harsh! :)
Guess it depends on what you mean by "far behind from a
technical point of view."
If you mean that x86 is an ugly legacy architecture, with
not nearly enough registers, an instruction set which
doesn't fit any reasonable pipeline, that's ugly to decode
and not particularly orthogonal, that from purely technical
reasons ought to have died a timely death in 1990,
I'd have to agree.
However, look at the performance. P4 is up near the
top of the tree with the best RISC CPUs, which have
the advantage of clean design and careful evolution.
It surely takes a great deal of inspiration, creativity,
and engineering talent to take something as ill-suited
as the x86 architecture and get this kind of performance
out of it. IMHO.
In other words, making x86 fast must be a lot like
getting Dumbo off the air. That ought to count as
some kind of technical achievement. :)
---snip---
It is all done with smoke and mirrors.
Anything the results in a net faster CPU isn't, in my book,
akin to smoke and mirrors.
If anyone's guilty of "smoke and mirrors", it's probably
Intel by making a ridiculous long (20-24 stage) pipeline
just to allow the wayupcrankinzee of clock rates so they can
be first CPU to X Ghz. Why not a 50 stage pipeline that hits
8 Ghz, nevermind the hideous branch-misprediction penalties
and exception overhead?
We do the same
here at AMD. The trick is to trade immediate execution
for known execution. The x86 code is translated to run
on a normal RISC engine.
Yes, and this in and of itself must be rather tricky, no?
X86 instructions are variable-length, far from load/store,
have gobs of complexity in protected nonflat mode, etc.
I'd bet a significant portion of the Athlon or P4 is devoted
just to figuring out how to
translate/align/schedule/dispatch
such a mess with a RISC core under the hood. :)
It doesn't take as much as one would think but it is a hit
on speed and space. Still, the overall hit is really quite
small.
Based on what you're saying, it follows that a multi-level
instruction-set
implementation (lower level microarchitecture plus higher
level user-visible architecture) is not only feasible, but
might
even be superior in some cases to a one-level implementation
tuned
either for CPU speed or compiler convenience.
What follows is that the user-level instruction set ought to
be
organized for compiler code generation efficiency (less
code,
fewer instructions, less semantic gap between compiler and
compiler-visible CPU to make optimizations more obvious,
etc.)
The microarchitecture is then designed to keep the execution
units
and pipelines as busy as possible without regard to semantic
gap
from the outside world.
The hybrid might eventually surpass the best purely RISC or
CISC
approaches simply because there are two optimization points:
at the compiler/assembler and at the internal hardware.
This means that the same tricks
on a normal RISC engine would most likely only buy about
a couple percent. It would only show up on the initial
load of the local cache. Once that is done, there is
really little difference.
Choices of pipeline depth, out of order execution, multiple
execution engines and such are just the fine tuning.
Intel, like us is just closer to the fine edge of what
the silicon process can do than anything tricky that
people like MIPS don't know about.
Well, why isn't something elegant like Alpha, HP-PA, or MIPS
at the top of the performance tree then? (Or are they and
I'm
just not aware of the latest new products.)
My pet theory is that the higher code density of x86
vs. mainline RISC helps utilize the memory subsystem
more efficiently, or at least overtaxes it less often.
The decoding for RISC is a lot simpler, but
if the caching systems can't completely compensate for the
higher
memory bandwidth requirements, you're stalling more often or
limiting
the maximum internal CPU speed indirectly due to the
mismatch.
And decoding on-chip can go much faster than any sort of
external
memory these days.
This is why the newer processor chips are really a memory
chip with some processor attached, rather than a processor
with some memory attached. We and Intel are turning into
RAM makers. Memory bandwidth is on the increase but it
isn't keeping up with chip speed.
And unless you go with 1024+ bit wide SDRAM buses or such,
it's hard to see how you could have the external memory keep
up. The
"happy" (well, carefree) days of 1000 nS instruction cycles
are long gone. :)
Still, I don't understand why many are not going
to more
efficient memory optimization than apparent execution speed.
The compiler writers have a ways to go.
The day is gone
when pinhole optimization buys much.
For RISC targets, the semantic gap between an HLL statement
in "C", for example, and the target code is wider.
Intuitively
anyways, this means more instructions are output and
fewer optimizations are found for a given level of effort
in the code generation logic. And since it's an "all things
being
equal" deal, you can bet the optimization will be better
with a
friendly target.
Peephole optimization would be particularly
difficult where there is basically only one way to do
something in the
target.
Perversely, all this argues for a RISC engine optimized for
internal
speed and a CISC engine optimized to be compiler-friendly,
or in
other words, "Q: Which technology is better: RISC or CISC?
A: Both
are better than either." :)
At last, a possible explanation for why x86, which is so
ugly from
a performance-theory point of view, really does work so well
in practice?
Keeping the process
in on chip cache is really the important thing. There isn't
an application out there that if one removed the large data
array and image bit tables, couldn't completely fit in the
caches that are being used today.
The problem is cache is generally implemented as "n"
parallel direct-mapped caches (n-way) rather than truly
associative,
because associative memory is impossibly complex and
expensive for any
decent size.
So if you have a 2-way cache, that means you can only have
two data
items in the cache whose index happens to hash to the same
cache slot,
regardless of how big your cache is. For an "n" way cache,
all you need
is "n+1" frequently-used data items that map to the same
cache index,
and musical chairs tosses out vital data on every load. :(
A compiler/linker would not only have to know what data was
dynamically
most frequently used at any given time, but also the method
by which the
item's address maps it to a cache index, and how many
cache-ways there are,
to prevent ugly *stuff* like this from happening by locating
data so
frequently-used data is always paired in the other ways with
infrequently-used
data.
One thing a compiler *could* do is set a "hint" bit in the
load and store
instructions (provided the CPU provides a bit in load/store
for this purpose).
when the code generator thinks the data just loaded/stored
will be used again
especially often in the *near* future. The CPU could
let that bit stay set for say one million CPU cycles before
clearing it, and
try its damndest not to toss out a data item with this bit
set if there's
an alternative in the other n-1 ways that has no such bit
set.
That might help quite a bit. Actual implementation would
undoubtedly be
very different (timestamp?) but the idea is to "hint" the
CPU to make a
better choice of who to toss into the street vs. keep in the
shelter. :)
The compilers just don't
write code well enough to keep the size down. It is just
that we've gotten into the poor choice of languages and poor
connection of software writers to the actual machine code
that is run.
I'd have to agree with the poor connection and code size
parts.
I think it's a bit unfair to blame the high-level languages
for this problem though. It seems to me that the code
generation
phase is where things are broken. And since most compilers
have
a lexical view of the world rather than a run-time view of
the
world, it is also kind of difficult to predict what needs to
be
optimized without some fancy simulation technology that
AFAIK
isn't used as a rule as part of code generation, but perhaps
should be. :)
Just my opinion.
Dwight
Some great things I've learned so far.
It's safe to say I already think of things quite differently
than I did just yesterday.
-- Ross
>
>This isn't really a discussion for classiccmp, but I
>couldn't
>resist since I'm sure at least some folks enjoy
>speculationalism
>on such topics. :)
>
>
>>
>> On a separate subject, I was very disappointed in the
>> Intel Museum. I'd thought it might be a good place to
>> research early software or early IC's. They have vary
>> little to offer to someone looking into this level of
>> stuff. Any local library has better references on this
>> kind of stuff ( and that isn't saying much ).
>> Dwight
Yup. Even corporate boosterism shouldn't blind one from a
graceful acknowledgement of the contributions of others. :|
-- Ross
>
>n
>