[cctalk] Re: OCR and line printers

28 Nov 2025

I too have used Fine Reader (paid for, but the price is not all that high) to OCR various
listings.  It's way better than the Acrobat OCR, at least the one I tried ages ago on
the Ethernet spec (back before Bitsavers, or at least before I found it).  The learning
feature is very good, and important when dealing with low quality input.  You can also
tell it only to recognize what you taught it, i.e., not try to match any builtin patterns.
If you're doing line printer listings with 64 character sets, that's helpful,
otherwise it may mistake something blurry for a pound-sterling sign.
In some cases OCR just can't hack what it is given and the only option is to type it
all in again.  I've done that with some listings that were ugly enough I sometimes had
to zoom in to have one letter take up most of the screen, just to figure out what it was.
In that particular case, it also used a lot of overprinting to deal with mixed case text:
upper case letters represented lower case, upper case with dot overprint for actual upper
case.  You'd thing that OCR training could handle that, but the difference is subtle
enough it doesn't really work.
Supposedly current versions of the open source program Tesseract are pretty good, but I
haven't tried it.  Looking for how to train it got me all confused, it didn't seem
to be something that was at all convenient, not like the interactive training feature of
Fine Reader.
OCR likely will not handle fixed layout well (not unless you can treat it as tables).  If
that's important, some Python or Emacs post-processing can clean up a lot.  Similarly
if there are common recognition errors you can spot by pattern matching.  Scanning the CDC
6600 wire lists goes well this way, because the data have a very consistent pattern.  For
example, the OCR might mix up zero and oh, but an edit pass can fix those 100%.
        paul
...
  On Nov 28, 2025, at 10:56 AM, David Wade via cctalk
&lt;cctalk(a)classiccmp.org&gt; wrote:
 I have a copy of Abbey Fine Reader Pro which I got free on a magazine many years ago.
 If it reads a character incorrectly you can add to the image <=> character map so
it can adapt for example to a damaged slug on a line printer train or other type element.
 Its not 100% but I used it to scan the IBM1130 CSMP from the manual....
 Dave
 On 28/11/2025 14:57, Guy Fedorkow via cctalk wrote:
  Greetings Restorers,
   I think a number of us have wanted to restore software that's only available as a
scanned listing from a line printer.  The original printout probably wasn't the best
typographic quality, and scanning doesn't improve it.
   As a first pass, OCR with tools like Adobe Acrobat can easily produce a rough draft of
the content in text form, but it takes almost as much work to correct the many
"typos" as it does to simply re-type the listing.
   It seems like, with all this high-tech AI processing around, it should be possible to
take advantage of the limited character set, fixed fonts, and restricted grammar that one
might find in a listing to resolve more of the ambiguities in character recognition.
   Does anyone have an approach that's more efficient than generic OCR and a long
process of correcting typos on every line of code or comment?
   Thanks
 /guy

2026

2025

2024

2023

2022

[cctalk] Re: OCR and line printers