The lost art (Was: The VAX is running

7 Apr 2009

On Tue, Apr 7, 2009 at 4:23 PM, William Donzelli <wdonzelli at gmail.com> wrote:
...
   I understand. 
There's no longer any need for an in-depth
 understanding of the functioning of one's tools.    Write garbage and
 leave it to the compiler (both to be correct and to do the best job
 of optimization). 
 I do not think a human could do reasonable optimization for today's
 processors - even by some assembler guru. Even back with STRETCH, we
 started to see the future is in the compiler. 
One problem is the variability among processors that even share the
same instruction set.  I'm writing a fairly simple bit of code that I
need to be as fast as possible on a wide variety of machines.  One
part is to multiply two large floating point vectors together as
quickly as possible.  I haven't started processor specific
optimization yet, but already I've got 12 methods for doing the
multiply.  Thank the deity of your choice for metaprogramming.  On the
machine I'm sitting at right now, the best choice is
vpf_multiply<float,16,256> at 3.4 ns per op (on arrays much larger
than the cache).  On the next nearest machine it's
v_multiply<float,4>: 5.5 ns per op.  (Different clock speeds, of
course, so not directly comparable.  Machine #2 does more flops per
cycle.)  Differences between various methods on the same machine is
20%, so when you run time is months, it makes a difference.

Just in case you're wondering what could make multiplying two arrays
be so different, here's one of the routines.

template <typename T, const int size, const int prefetch>
static void vpf_multiply(const T *v1, const T *v2, T *v3, IDL_LONG n) {
  IDL_LONG rem=n%size,i=0;
  T *ve=v3+n-rem;
  while (v3<ve) {
    register char pf;
    pf=*(reinterpret_cast<volatile const char *>(v1)+prefetch);
    pf=*(reinterpret_cast<volatile const char *>(v2)+prefetch);
    for (i=0;i<size;i++) {
      v3[i]=v2[i]*v1[i];
    }
    v1+=size;
    v2+=size;
    v3+=size;
  }
  multiply<T>(v1,v2,v3,rem);
}

Yes, your eyes are not betraying you, that's an attempt at prefetching
the input arrays written in standard C++ followed by what is probably
an unrolled loop.  And yes, you need to allocate at least prefetch
bytes at the end of your array. (v_multiply<float,4> is just a loop
unrolled to 4 ops per iteration without the prefetch).  And in case
you were wondering, the code is multithreaded at the level above this
one.

Of course, once I start delving into assembly language and SIMD
optimizations, there will be a prefetch instruction, and instructions
to multiply four floats at once so there is probably a speed up there.
 The compiler doesn't find these on its own, unless you get very
specific about which processor you use.  (And on some processors the
vector instructions are no better than the scalars) So the options are
compile separately on every machine, or put in the effort on
optimizing specific parts of a generically optimized compile.

So whereas in 1986 I'd be optimizing for a specific processor, now I'm
throwing multiple options at every processor and using the one that's
fastest.  And yes, you still need to know assembly language.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

The lost art (Was: The VAX is running