An issue dear to my heart - I have quite a quantity of documentation here to scan so I did
quite a bit of homework and testing on this. It may not exactly be your issue, but I hope
it helps.
For me, the issue was the quality of the documentation - print back in those days was
quite variable of course and over time documents may have deteriorated.
I find Adobe Acrobat the best tool using the ClearScan method for OCR. Most OCR tools
working with PDF place a searchable image or overlay in the document. This can bloat file
size and is not the same as ClearScan - retaining reasonable file sizes was one of my
criteria. I specifically tested this and found Clear Scan documents to have smaller file
size that OCR processing using other methods. On my experience Clear Scan also tended to
improve the quality of the type while faithfully preserving it.
For documents that I obtained as PDFs that ran into trouble being processed like this, I
found that exporting the file to TIFF and then creating a new PDF from the TIFFs worked
best (doing this is like dry cleaning for PDF).
Downside - Acrobat is probably the most expensive of the PDF tools out here.
This might help explain it a bit better:
https://acrobatusers.com/tutorials/better-pdf-ocr-clearscan-smaller-looks-b…
There are some open-source alternatives that use a similar approach to ClearScan but I
have not specifically tested or evaluated them viz:
https://github.com/ncraun/smoothscan
Hope this helps!
Kevin Parker
-----Original Message-----
From: Marc Howard via cctalk <cctalk(a)classiccmp.org>
Sent: Friday, May 12, 2023 2:13 AM
To: General Discussion: On-Topic Posts Only <cctech(a)classiccmp.org>
Cc: Marc Howard <cramcram(a)gmail.com>
Subject: [cctalk] Are there any useful OCR programs for scanning old listings and
producing text with proper formatting
Marc Howard <cramcram(a)gmail.com>
[image: Attachments]May 10, 2023, 8:58 PM (15 hours ago) to cctalk-owner I have some
listings I want to convert to ASCII. They're line printer output from a computer that
existed from the mid-sixties to the early 70's (Agage AGT series).
I can't find any OCR package that can take scanner output (either PDF or
JPEG) and convert it to text with roughly the same number of spaces between words as was
there originally.
Seems like it would be an easy task. The input is non-proportional text from line printer
output (actually it might have been printed on a Diablo hytype). And yet all I get is
most of the characters with either no or single spacing between words. And it misses
quite a bit of scanned characters at that.
Anyone have any good experiences trying to do this? I've attached a PDF scan if you
have a way to do a test run.
Thanks,
Marc Howard