It was thus said that the Great Guy Sotomayor once stated:
The big problem is the code expansion, but some compilers (at suitably
high optimization levels) will do loop unrolling. To see why compilers
don't do this take a look at the implementation of memcpy in glibc.
There are *a lot* of cases to handle. Most of the time the compiler
writers just let the runtime/intrinsics deal with those cases rather
than trying to figure out how to make the code generator "do the right
thing".
Which is why for copying, setting, or otherwise manipulating memory (in C)
I use the ANSI C functions memmove(), memcpy() and memset() (or the string
equivilents) above rolling my own routines to do the same. The compiler can
then "see" what I'm trying to do and figure out what's going on, and I
almost never try to second guess the compiler (I write the code, then if
it's important, generate the assembly to see what the compiler is doing).
It also helps to know the language. For instance:
struct foo a;
struct foo b;
b = a; /* Legal under ANSI C! */
GCC can assume (since it's the one doing the compiling) that both a and b
are aligned for best access, and it can also pad out struct foo to be a
multiple of, say 4 bytes (on a 32-bit system) so the assignment (and yes,
you can now do structure assignment in ANSI C) can be done with a "rep
movsl" (which it did when I tested it). I could have done:
struct foo a;
struct foo b;
memcpy(b,a,sizeof(struct foo));
and GCC would have probably done the same sequence, since memcpy() is
defined by ANSI C, the compiler has some leeway to consider the parameters
to memcpy() (in this case, a and b are aligned, the size follows a certain
convention so inline it with a "rep movsl"). I like the first sequence
though, since the intent is a bit more clear.
But yes, the generalized memcpy() in glibc is rather complicated, trying
to figure out (at run time) the best way to copy memory (I seem to recall
discussion about this several years ago in comp.lang.asm.x86 (I think that's
the group---it's been a while) that with Pentium class machines, you may get
better performance using the floating point registers to copy memory).
Actually most compilers still do a pretty poor job of
optimization and
the windows they use for peep-hole optimization appears to be way too
small to do anything really useful.
I know the IRIX C compiler can do global optimizations but that takes
quite a bit of time and processing power; I never bothered with it when I
was programming under IRIX. And this was at least 10 years ago so it's
margially on topic here 8-P
-spc (been programming in ANSI C for over ten years ... )