On May 12, 2023, at 10:03 PM, trash80--- via cctalk
<cctalk(a)classiccmp.org> wrote:
An issue dear to my heart - I have quite a quantity of documentation here to scan so I
did quite a bit of homework and testing on this. It may not exactly be your issue, but I
hope it helps.
For me, the issue was the quality of the documentation - print back in those days was
quite variable of course and over time documents may have deteriorated.
I find Adobe Acrobat the best tool using the ClearScan method for OCR.
Admittedly it's been quite a while, but back years ago Adobe offered for a while a
free OCR plugin for Adobe Acrobat (full edition). I tried it on one or two documents --
for example a high quality scan of the Ethernet V2 spec (DIX spec).
It sort of worked, but the results were very bad. No training capability, and the editing
features were even worse than the already pathetically bad PDF editing features of
Acrobat.
I also used it on a scan of the A10-A flight manual. Same sort of outcome: it sort of
worked but really poor quality.
After that experience I tried Tesseract, which at the time wasn't ready yet. (That
was before the current neural net version.) Ended up buying ABBYY FineReader, which was
much better, particularly because it has a good quality training mechanism. I've
still encountered material so bad that it isn't useable but a lot of stuff, including
line printer listings, it can handle well enough.
One of these days I should try the new Tesseract.
paul