On May 11, 2023, at 12:12 PM, Marc Howard via cctalk
<cctalk(a)classiccmp.org> wrote:
Marc Howard <cramcram(a)gmail.com>
[image: Attachments]May 10, 2023, 8:58 PM (15 hours ago)
to cctalk-owner
I have some listings I want to convert to ASCII. They're line printer
output from a computer that existed from the mid-sixties to the early 70's
(Agage AGT series).
I can't find any OCR package that can take scanner output (either PDF or
JPEG) and convert it to text with roughly the same number of spaces between
words as was there originally.
Seems like it would be an easy task. The input is non-proportional text
from line printer output (actually it might have been printed on a Diablo
hytype). And yet all I get is most of the characters with either no or
single spacing between words. And it misses quite a bit of scanned
characters at that.
Tesseract supposedly can do this. There's a Tesseract fork, I don't remember the
name, that was tweaked specifically for listings. I believe it was a Japanese project.
I often use ABBYY FineReader, which does a good job with tough source material and has a
good training feature. It will not lose spaces entirely, but as you said, it does
collapse multiple spaces. For dealing with listings of structured material, like
assembler output listings, I found that telling the program to interpret the page as
tabular material works well. That (usually) preserves line endings which is also
important, and it breaks the material up into columns so at least you can do a
"pretty printer" type of cleanup on the rows of "table fields" that
result.
paul