On Jan 23, 2022, at 12:09 PM, Gavin Scott <gavin at
learn.bio> wrote:
On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk
<cctalk at classiccmp.org> wrote:
One consideration is the effort required to
repair transcription errors. Those that produce syntax errors aren't such an issue;
those that pass the assembler or compiler but result in bugs (say, a mistyped register
number) are harder to find.
You can always have it "turked" twice and compare the results.
This is also the sort of problem that modern Deep Machine Learning
will just crush. Identifying individual characters should be trivial,
you just have to figure out where the characters are first which could
also be done with ML or you could try to do it some other way (with a
really well registered scan maybe if it's all fixed-width characters).
Maybe. But OCR programs have had learning features for decades. I've spent quite a
lot of time in FineReader learning mode. Material produced on a moderate-quality
typewriter, like the CDC 6600 wire lists on Bitsavers, can be handled tolerably well.
Especially with post-processing that knows what the text patterns should be and converts
common misreadings to what they should be. But the listings I mentioned before were
entirely unmanageable even after a lot of "learning mode" effort. An annoying
wrinkle was that I wasn't dealing with greenbar but rather with Dutch line printer
paper that has every other line marked with 5 thin horizontal lines, almost like music
score paper. Faded printout with a worn ribbon on a substrate like that is a challenge
even for human eyeballs, and all the "machine learning" hype can't conceal
the fact that no machine can come anywhere close to a human for dealing with image
recognition under tough conditions.
That said, if you have access to a particularly good OCR, it can't hurt to spend a few
hours trying to make it cope with the source material in question. But be prepared for
disappointment.
paul