My suspicion is you are seeing cache line conflicts within your
code, such that code/data accesses keep pushing valid data
out of your cache. When your code fits, and the address does
not collide with your data accesses, it runs 'fast'. When you
get address collisions that cause cache invalidate/replacement,
it runs slow.
A little bit of research indicates:
Cache Set Block
Model Size Size Size Type
11/44 8KB 1 2B Unifiied I/D
11/60 2KB 1 2B Unifiied I/D
11/70 2KB 2 4B Unifiied I/D
11/73 8KB 1 2B Unifiied I/D
As you can see, the 11/73 has a very simple direct-mapped cache.
Any two memory locations that have the same low 13 physical address
bits will map to the same cache location, and contend for that slot.
By varying how your loop is unrolled, and the placement of your code
and data in memory, you can either allow thrashing to never occur
or always occur. Since the PDP-11 caches were all unified I/D caches
(not separate I/D caches like modern processors) and have very
simplistic set organizations (direct mapped, except for 11/70) thrashing
is very easy to achieve (either on purpose or by accident).
For your code, you want to guarantee that any code loops or data buffers
are not offset by 8KB in physical address (the caches are physically
tagged).
Jerome H. Fine wrote:
Hi All,
I am not sure if there is still sufficient interest, however, I have
encountered a rather bizarre timing situation when a specific
subroutine is executed on my PDP-11/73.
Background: I want to calculate the reciprocal (inverse) of a
number between 1 and 65535 to an accuracy of 256 bits after
the binary (or decimal) point. This result will then be used to
calculate the logarithm of that value which in turn will be used
to calculate li(x) for values of x up to 10**38. The code is in
FORTRAN 77 (could also be in FORTRAN IV) which calls
MACRO-11 code to do the really low level stuff such as
repeated addition of many words with MANY Adc instructions
following each addition (99.9% of the time the Adc instructions
are never used).
Preliminary Details: The subroutine first converts the INTEGER * 4
to a REAL * 8 to produce:
R8ARG = I2ARG
R8RCP = 1.0D0 / R8ARG
as the initial 56 bit accurate estimate.
The subroutine then continues: I must then convert R8RCP to a:
REAL * 64 = INTEGER * 32 / FRACTION * 32
which is actually quite simple since the R8RCP value already
contains the value EXP ** 2 which is the number of bits to
be shifted right after the R8RCP fraction of 56 bits is unpacked.
Next the unpacked 56 bits are moved by a whole number of
words for each multiple of 16 bits to be right shifted. Completion
is done by shifting right a group of 5 words when the last portion
of the total shift is less than 16 bits of shifting.
Unpacking uses the instructions:
Mov (R0)+,-(R1)
Word shifts use the following pair of instructions four times:
Mov (R0),-(R1)
Clr (R0)+
The final shifting (if needed) uses five instructions of:
Ror -(R0)
as many times as are required (1 to 15 times)
The problem is as follows: When the destination argument
address is in PAR0 (address is less than 17600), there is
no problem. When I use an argument address in PAR1
(address is above 21000), there is a substantial speed reduction
in the subroutine. It takes an average of 1611 microseconds
for each of the 65535 cases when the destination address is
in PAR0 (specifically 15236) while it takes an average of 6055
microseconds for each if the 65535 cases when the destination
address is in PAR1 (specifically 21000) or about 3.75 times as
long.
What is EXTREMELY interesting is that the above timing is also
repeated under E11, although everything is over TWENTY times
as fast. Under E11, it takes 65 microseconds using PAR0 and
230 microseconds using PAR1 with the same identical program.
I run using RT-11 with RT11XM for all testing, although I did
check with RT11FB under E11 with almost the same timing results.
I could also check with RT11FB on the real DEC PDP-11/73
if anyone thinks it might be worthwhile.
-----------------------------------------------------------------
HOT FROM THE COMPUTER - A BIT MORE DETAIL!
At first I was most suspicious of the Mov / Clr pairs since they
are the code that is extra. HOWEVER, if I eliminate the last loop
with the five Ror instructions, then (although the answer is incorrect -
up to 2 ** 15 too large without the final shifts) the timing is the same
for both PAR0 and PAR1 arguments. But this result contradicts the
earlier version of the code which uses ONLY Ror instructions (a total
of 16 Ror instructions are needed to span the 32 bytes in question) to
accomplish all the shifting. Because the timing for the earlier version
of the code (which is still present, just not used when the branch is
taken to the "faster??" version) can now be compared by changing
the branch instruction to use the less efficient code, I am still able to
compare the timing for both versions - and the timing for the original
less efficient version (which still includes the four unpack instructions
and many more Ror instructions) is still the same as before for both
PAR0 and PAR1 destinations arguments and equal to each other.
SO I AM STUMPED!! Even when I turn the five Ror instructions
loop into five Nop instructions (as opposed to not doing the loop
even when there are more shifts needed - the code checks the remaining
shift count, so pretending the remaining shift count is always zero to
stop using the last five Ror instruction loop was easy), the timing test
with a PAR1 destination argument takes longer even though NOTHING
is being done! In fact using a PAR1 destination argument with a five Nop
instructions loop still takes more than half of the extra timing (i.e.
much
more time than for the PAR0 destination argument which still uses the
five Ror instructions loop).
----------------------------------------------------------------------
HOTTER FROM THE COMPUTER - STILL MORE DETAILS!
To my surprise, there is considerable variation even when different PAR0
destination addresses are used. Since the portion of the code up to the
final right shifting within 5 words is all that can be different, I am
even
more mystified. However, after running 3 different code versions at
3 different addresses (2 in PAR0 and 1 in PAR1) on both a real DEC
PDP-11/73 and under E11 on a Pentium III 750 (at over 20 times the
speed of the PDP-11/73), the results are consistent. When I gradually
increase the PAR0 destination argument address right up to the end of
PAR0, the timing gradually decreases. When the address is in PAR1
(all or none - the field is 64 bytes long, so the last example in PAR0
is 17600), the average time suddenly jumps, but only for the "more
efficient" code version.
Since the results of all calculations are correct (verified against a
version
which uses the less efficient code), I am now totally mystified. I have
never seen this occur before in all of the 45 years I have been using
computers.
Sincerely yours,
Jerome Fine
--
If you attempted to send a reply and the original e-mail
address has been discontinued due a high volume of junk
e-mail, then the semi-permanent e-mail address can be
obtained by replacing the four characters preceding the
'at' with the four digits of the current year.