On Nov 20, 2021, at 11:15 AM, Jon Elson via cctalk
<cctalk at classiccmp.org> wrote:
On 11/20/21 1:30 AM, Joerg Hoppe via cctalk wrote:
Hi Friends,
Micro fiche scans of the PDP-11 XXDP listings are online now:
Wow, took a quick look. The scans are likely not good enough to run through an OCR
program, but certainly good enough to read through when trying to understand what a
program is doing.
I only tried tesseract once, years ago, and it wasn't useful at all for the particular
material I gave it. Quite possibly it's better now.
Instead, I ended up buying a commercial OCR program, "Fine Reader" from ABBYY,
which has served me well. I used it to read CDC 6600 wire list scans, which it did well.
I also tried to make it do the THE source listings in the Knuth archive; those are
hopeless for OCR partly due to the overprinting convention used, and required manual
entry.
So... it might be worth a try feeding some of those images to current commercial OCR
programs. FineReader has a "learn" capability that does a decent job of making
it deal with the peculiarities of a particular piece of source material.
paul