[cctalk] Re: Elliott Algol

17 Feb 2025

On Mon, Feb 17, 2025 at 09:11:17AM -0500, Paul Koning via cctalk wrote:
[...]
...
  	int f1(int i) {
  	  int j;
 	  int f2(int x) { int y; y = j*2; ... f1(x+1); }
 	  f2(...);
 	}
 	f1(42);

 The local variables go into stack frames, one for each call of each
 function. So when recursive calls are made as in this example, there are
 multiple stack frames belonging to calls of f1 and of f2. Each f1 frame
 has a j in it. So how does the code for f2 find "the right f1 frame" to
 resolve the reference to j that I showed?

 The answer is to have a display, which is a vector of pointers, indexed by
 "static scope", in other words by textual nesting level. In this case that
 vector would have three entries: one for the program, f1, and f2. When a
 call to f1 is made, the stack frame is created. In the stack frame the
 previous value of the display entry for f1 is saved, and that entry is
 then set to the stack frame address. On function return, the previous
 display entry is restored.

 To resolve the reference to j in f2, the code would load the display entry
 for f1 (the second entry in the display vector), and add the offset to j
 in the stack frame. 
That's the very old-school approach for weird CISC architectures which were
mainly programmed in assembly language or with naivee compilers. It's most
definitely not a *good* way to do it, performance-wise. So of course x86
supports it. When I first encountered x86's ENTER and LEAVE instructions, I
thought "that's on crack". Being x86, it almost never removes features and
often even doubles-down, and so this still works and is even supported in
64-bit long mode.

Although the exact terminology and implementations can vary by language,
these days that inner function would normally be called a "closure" which
"captures" variables from the outer function. In your example, j but not i
is captured by f2, and so f2 doesn't need access to all of f1's stack frame,
just its j.

I wrote a contrived test program in Rust and inspected the machine code, and
what it did was to pass the address of j as a hidden parameter to f2. This
meant that f2 already had that address in a register and didn't need to do
an extra memory access to find it. The nested call used the normal CALL/RET
instructions which do not do the slow and complex microcoded operations on
stack frames that ENTER does.

How slow are we talking? ENTER/LEAVE is also at least order of magnitude
slower than CALL/RET on modern x86. On my Haswell, CALL is ~2 cycles,
whereas ENTER is ~80. It's a bit better on e.g. Zen 4 at ~20 cycles, but
still not great.

However, I'd contrived the code so that it didn't get optimised away. Since
j is not modified by the code (unless your "..." elision does it), the
compiler could have passed the *value* of j as a parameter as the changed
value doesn't need to be propagated back up the call stack, saving another
memory access. Or most likely it'd have noticed that f2 was a good candidate
for inlining and now f1 and f2 share the same stack frame and can trivially
access each other's variables.

ENTER/LEAVE might have perhaps been more useful on the 286 and mitigated
some of its many performance problems in protected mode, but I'd rather just
use pretty much anything else, possibly even pen and paper, in preference to
segmented x86.

2025

2024

2023

2022

[cctalk] Re: Elliott Algol