On Dec 31, 2018, at 7:13 PM, dwight via cctalk
<cctalk at classiccmp.org> wrote:
Fred is right, OCR is only worth it if the document is in perfect condition. I just
finish getting an old 4004 listing working. I made only two mistakes on the 4K of code
that were not the fault of the poorness of the listing. Twice I put LDM instead of LD. LDM
was the most commonly used.
I wouldn't put it quite so strongly. OCR even if not perfect can help a lot. You can
often OCR + test assembly + proofread faster than retyping, especially since that requires
fixing typos and proofreading also. Many OCR errors are caught by the assembler, though
not all of them of course. I've done both in an ongoing software preservation
project; my conclusion still is to use OCR when it works "well enough". A
couple of errors per page is definitely "well enough".
The program used matters. I looked at Tesseract a bit but its quality was vastly inferior
to commercial products in the examples I tried. I now use Abbyy FineReader, which handles
a lot of line printer and typewriter material quite well.
paul