Typing in lost code

Paul Koning paulkoning at comcast.net
Sun Jan 23 12:16:34 CST 2022



> On Jan 23, 2022, at 12:09 PM, Gavin Scott <gavin at learn.bio> wrote:
> 
> On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk
> <cctalk at classiccmp.org> wrote:
>> One consideration is the effort required to repair transcription errors.  Those that produce syntax errors aren't such an issue;
>> those that pass the assembler or compiler but result in bugs (say, a mistyped register number) are harder to find.
> 
> You can always have it "turked" twice and compare the results.
> 
> This is also the sort of problem that modern Deep Machine Learning
> will just crush. Identifying individual characters should be trivial,
> you just have to figure out where the characters are first which could
> also be done with ML or you could try to do it some other way (with a
> really well registered scan maybe if it's all fixed-width characters).

Maybe.  But OCR programs have had learning features for decades.  I've spent quite a lot of time in FineReader learning mode.  Material produced on a moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can be handled tolerably well.  Especially with post-processing that knows what the text patterns should be and converts common misreadings to what they should be.  But the listings I mentioned before were entirely unmanageable even after a lot of "learning mode" effort.  An annoying wrinkle was that I wasn't dealing with greenbar but rather with Dutch line printer paper that has every other line marked with 5 thin horizontal lines, almost like music score paper.  Faded printout with a worn ribbon on a substrate like that is a challenge even for human eyeballs, and all the "machine learning" hype can't conceal the fact that no machine can come anywhere close to a human for dealing with image recognition under tough conditions.

That said, if you have access to a particularly good OCR, it can't hurt to spend a few hours trying to make it cope with the source material in question.  But be prepared for disappointment.

	paul




More information about the cctalk mailing list