Randy Dawson wrote:
...
Lost, sadly was the machine between then and now, the
Graphics
Supercomputer. In an effort to add computational speed to graphics and
scientific visualazation, two vendors went head to head on this problem,
Ardent and Stellar.
If you were around at the time, and saw one of these I would love to
hear from you. The performance was truly spectacular. I had a chance
to use one for a couple of years, and it still comes pretty close to
current GPU tec in graphics performance. With pipeline vector processor
and compiler to unroll loops it was WOW. Todays Ghz processors cannot
beat a vector machine in computation, Titan had a 16 Mhz 1K floating
point vector ALU.
Randy, no doubt you know a lot more about the ardent/stellar/stardent
stuff than me. I was aware of it and I once got hold of the design spec
for the TOE processor (the 4x4 pixel "stamper"). However, I think you
aren't aware of how sophisticated todays GPUs are.
16 MHz * 1K flops = 16 Gflops. A single top end GPU is more like 500
GFLOPS (single prec only, though). Today's GPUs have myriad pixel
formats, including ARGB with an FP32 for each component. Pixel shaders
are highly programmable. A single GPU can have > 80 GB/sec of bandwidth
to DRAM (not cache).
The TOE processor was a fixed point affair with limited, fixed point
precision. There is no comparison. I wish I still had the spec to make
a more concrete comparison.
A google search turned up this quote:
With the Dore' rendering package [Borden89], each processor is capable
of rendering a maximum of 20,000 smoothly shaded small polygons/seconds.
Today's GPUs can render thousands of times more triangle per second,
antialiased, with multiple, high quality texture maps and arbitrary
blending.
Another google search
http://www.ece.cmu.edu/~ece548/handouts/17v_perf.pdf
says that the Titan 1 had a 125 ns clock period and two FPUs, for 16
MFLOP/s peak. Perhaps you recall 1K FPUs, but maybe it was a 1K vector
register length. The same pdf (written by Philip Koopman) says that
even with four processor, and with a large (1000x1000) array size, the
titan-1 peaked at 15.7 Mflops. It attributes this to the fact that the
aggregate bus bandwidth of the titan was 256 MB/sec. By rewriting the
linpack code to block the data appropriately, they got it up to 46 MFLOP/s.
So, overall, I think there is no comparison. The rose colored glasses
of time have fooled you.
...
Are there any graphics guys on the list?
Yes, from the hw end of things.