Request from Intel's Museum - test-drb@ccmp.vtda.org

9 Oct 2002

...
 From: "Ross Archer"
&lt;archer(a)topnow.com&gt;

"Dwight K. Elvey" wrote:

 From: "Ross Archer"
&lt;dogbert(a)mindless.com&gt;

Jerome H. Fine wrote:

 >Jim Kearney wrote:
>
>

>I just had an email exchange with someone at Intel's Museum
>(http://www.intel.com/intel/intelis/museum/index.htm)
>
>

Jerome Fine replies:

I am not sure why the information is so blatant in its
stupid attempt to ignore anything but Intel hardware
as far a anything that even look like a CPU chip, but
I guess it is an "Intel" museum.

Of course, even now, Intel, in my opinion, is so far
behind from a technical point of view that is is a sad
comment just to read about the products that were
way behind, and still are, the excellence of other
products.  No question that if the Pentium 4 had been
produced 10 years ago, it would have been a major
accomplishment.
 Harsh! :)

Guess it depends on what you mean by "far behind from a
technical point of view."

If you mean that x86 is an ugly legacy architecture, with
not nearly enough registers, an instruction set which
doesn't fit any reasonable pipeline, that's ugly to decode
and not particularly orthogonal, that from purely technical
reasons ought to have died a timely death in 1990,
I'd have to agree.

However, look at the performance.  P4 is up near the
top of the tree with the best RISC CPUs,  which have
the advantage of clean design and careful evolution.

It surely takes a great deal of inspiration, creativity,
and engineering talent to take something as ill-suited
as the x86 architecture and get this kind of performance
out of it.  IMHO.

In other words, making x86 fast must be a lot like
getting Dumbo off the air.  That ought to count as
some kind of technical achievement. :)  
 ---snip---

  It is all done with smoke and mirrors.  
Anything the results in a net faster CPU isn't, in my book,
akin to smoke and mirrors. 

If anyone's guilty of "smoke and mirrors", it's probably
Intel by making a ridiculous long (20-24 stage) pipeline 
just to allow the wayupcrankinzee of clock rates so they can
be first CPU to X Ghz. Why not a 50 stage pipeline that hits
8 Ghz, nevermind the hideous branch-misprediction penalties
and exception overhead?

  We do the same
 here at AMD. The trick is to trade immediate execution
 for known execution. The x86 code is translated to run
 on a normal RISC engine.  
Yes, and this in and of itself must be rather tricky, no?
X86 instructions are variable-length, far from load/store, 
have gobs of complexity in protected nonflat mode, etc.
I'd bet a significant portion of the Athlon or P4 is devoted
just to figuring out how to
translate/align/schedule/dispatch
such a mess with a RISC core under the hood. :) 
 It doesn't take as much as one would think but it is a hit
on speed and space. Still, the overall hit is really quite
small.

...

  This means that the same tricks
 on a normal RISC engine would most likely only buy about
 a couple percent. It would only show up on the initial
 load of the local cache. Once that is done, there is
 really little difference.
  Choices of pipeline depth, out of order execution, multiple
 execution engines and such are just the fine tuning.
 Intel, like us is just closer to the fine edge of what
 the silicon process can do than anything tricky that
 people like MIPS don't know about. 
Well, why isn't something elegant like Alpha, HP-PA, or MIPS
at the top of the performance tree then?  (Or are they and
I'm
just not aware of the latest new products.)

My pet theory is that the higher code density of x86
vs. mainline RISC helps utilize the memory subsystem
more efficiently, or at least overtaxes it less often.
The decoding for RISC is a lot simpler, but
if the caching systems can't completely compensate for the
higher
memory bandwidth requirements, you're stalling more often or
limiting
the maximum internal CPU speed indirectly due to the
mismatch.
And decoding on-chip can go much faster than any sort of
external
memory these days. 
 This is why the newer processor chips are really a memory
chip with some processor attached, rather than a processor
with some memory attached. We and Intel are turning into
RAM makers. Memory bandwidth is on the increase but it
isn't keeping up with chip speed.
 Still, I don't understand why many are not going to more
efficient memory optimization than apparent execution speed.
The compiler writers have a ways to go. The day is gone
when pinhole optimization buys much. Keeping the process
in on chip cache is really the important thing. There isn't
an application out there that if one removed the large data
array and image bit tables, couldn't completely fit in the
caches that are being used today. The compilers just don't
write code well enough to keep the size down. It is just
that we've gotten into the poor choice of languages and poor
connection of software writers to the actual machine code
that is run.
Just my opinion.
Dwight

...

This isn't really a discussion for classiccmp, but I
couldn't
resist since I'm sure at least some folks enjoy
speculationalism
on such topics. :)

  On a separate subject, I was very disappointed in the
 Intel Museum. I'd thought it might be a good place to
 research early software or early IC's. They have vary
 little to offer to someone looking into this level of
 stuff. Any local library has better references on this
 kind of stuff ( and that isn't saying much ).
 Dwight 
n