Hi, Walter.
"Walter F.J. Mueller" <w.f.j.mueller at gsi.de> wrote:
Johnny Billquist <bqt at softjar.se> wrote:
> "Walter F.J. Mueller"
<W.F.J.Mueller at gsi.de> wrote:
> I've also implemented a PDP-11 on an FPGA. It is a full 11/70 with
> split I&D, MMU and cache. No FPP so far. Available peripherals are so
> far DL11, LP11, KW11L, PC11, and RK11. All I/O is channeled over via
> 'remote-register-interface' onto a single bi-directional byte stream
> interface, so the FPGA board needs a backend PC with a server program
> to handle the I/O requests.
Any plans on the FPP? It would be really nice and
useful to have.
Hi Johnny,
sure, an FPP is on the 'todo-list', but it doesn't have the highest
priority. After having put the first version on OpenCores I'd like to
add a trace/debug unit (allowing hardware breakpoints ect), and add
a few more peripherals, especially larger disks. Currently I have
only an RK11 controller, good enough for proof-of-principle, but
not enough for real usage.
Disks are definitely a good thing. And as usual, I'll advocate MSCP.
Even though it's not the simplest, it's simply just the best. :-)
But FPP is among the most important things in there as well, I'd say.
Lots of software who won't be happy without it.
As for traps
and double errors, feel free to ask. I don't know if I have
all the answers, but I might be able to figure them out. Besides, I also
have access to one (or three) functional 11/70 machines.
I've tested much of the implementation against simh and xxdp's, but there
are still a few loose ends regarding corner cases. It be great to run a
few test programs on simh, a real 11/70, and my fpga implementation,
called now w11a.
Feel free to talk with me more offlist, and we can see when we can
schedule some testing on real hardware.
The 11/53 is a
really slow machine. Not that helpful to compare with. But you
seem to push a nice number anyway. But 50MHz... The J11 in an 11/9x machine
runs at 20 MHz, which would suggest that you should only be able to push about
2.5 times the performance, unless you do some more clever tricks.
(The 11/9x machine runs all memory as cache.)
I know, but the 11/53 is the only pdp-11 where I know the Unix Benchmark
and thus the Dhrystone results, so it became the reference.
Even though my implementation is quite different from the organization of
the original 11/70, it has essentially the same instruction timing as a
11/45 or 11/70 when expressed in clock cycles. The 11/45 or 11/70 CPU's
ran with a 150 ns clock period (ignoring clock stretching here), thus a
6.7 MHz clock. A register-register operation takes 2 cycles, a
"mov r0,(r1)+" for example 5 cycles.
Because the cpi (cycles-per-instruction) for 11/70 and the w11a is very
similar and both have a good cache the w11a should simply be 50/6.7 or a
factor 7.5 faster than a 11/70.
The 11/70 and the w11a have some pipelining, instruction fetch and
decode/operate can overlap for register destination instructions. The
J11 is more pipelined, here fetch, decode, and operate stage can overlap.
Therefore register-register instructions take 1 cycle in the best case,
a "mov r0,(r1)+" for example 3 cycles.
Therefore a 50 MHz w11a will not be 2.5 times faster than a 20 MHz J11,
maybe just 1.5 times faster. The w11a is intentionally implemented in a
quite simple and conservative way, prime goal was to get it right and
working, and not to get it fast.
Good thinking.
But I'm surprised by some numbers here.
The J11 at 20 MHz is only slightly faster than an 11/70. In fact, if you
can throw the 11/70 into running all from cache, it might even be
slightly faster than an 11/9x.
Or so I seem to remember from looking at the numbers back when I last
was digging into this.
Maybe I'm mixing some numbers up here... What I do remember for sure is
that the 11/9x machines run at 20 MHz, and that they are not more than
maybe 1.2 times the speed of an 11/70 in general.
At some later time maybe I'll try a really fast
design, with separate
instruction and data caches and significantly more parallelism than
the J11 had.
Hmm. I wonder if that might cause headaches? There might be code out
there that require your i-cache and d-cache to be consistent with each
other.
IIST is needed
for RSX to be happy (the only OS that supports the 11/74),
and you also need to implement parts of the memory bus behaviour with
interlocking. You can ignore the MK11 box CSRs, even though it will look
a little funny, but you do need separate DL11s for each CPU core, along with
the rest of the I/O bus, or else things will probably not work. The 11/74
is a shared memory machine, but not shared I/O bus.
I'm fully aware of this, the MP version will have one I/O bus per CPU and
a shared memory and asrb interlock, and caches with proper cache coherency.
Yes. But what I was thinking of was the fact that at a level below this,
you have the CPU that be issuing read-modify-write cycles to memory, and
those needs to be interlocked to memory.
At a higher level, the 11/70 was modified for asrb to always bypass
cache, and then you had two other ways to bypass cache as well. But
bypassing cache is only half the problem, as you also need to make sure
some memory operations are atomic, as seen from other cpus.
But if you know a thing or two about cpu and memory design (which it
would appear you do), then you probably understand the problem already.
By the way. You don't have to worry about cache coherency. The PDP-11/74
do not do that. Cache coherency is managed by software on the PDP-11
(well, in RSX, since that's the only system that supports the hardware).
In short, the real hardware do not implement any sort of cache coherency
in hardware.
It's true that RSX is the only OS that supports an
11/74. Unfortunately I
don't have an RSX11-M plus license. So the plan is to patch 2.11BSD to support
an MP system. Sounds like a long shot, but looking into the kernel sources
I concluded that a funneling or 'big kernel lock' type MP support seems to
be quite feasible. Will not scale well, but for a 'dual-core' this is likely
good enough.
It's definitely doable. However, it is not that simple.
The reason why DEC choose RSX as the OS for implementing multi-processor
support is that it does not, in general, use interrupt priority levels
to serialize access to data, protect sections of code, or implement locks.
Unix do. So, in short, everywhere where the interrupt priority is
changed, you potentially need to change the code, since another
processor might still get interrupts at that level, and do things you
thought you had locked out.
But I see your problem. It would be great if we could solve the
situation with RSX at some point...
Johnny