Typing in lost code
Paul Koning
paulkoning at comcast.net
Sun Jan 23 12:16:34 CST 2022
> On Jan 23, 2022, at 12:09 PM, Gavin Scott <gavin at learn.bio> wrote:
>
> On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk
> <cctalk at classiccmp.org> wrote:
>> One consideration is the effort required to repair transcription errors. Those that produce syntax errors aren't such an issue;
>> those that pass the assembler or compiler but result in bugs (say, a mistyped register number) are harder to find.
>
> You can always have it "turked" twice and compare the results.
>
> This is also the sort of problem that modern Deep Machine Learning
> will just crush. Identifying individual characters should be trivial,
> you just have to figure out where the characters are first which could
> also be done with ML or you could try to do it some other way (with a
> really well registered scan maybe if it's all fixed-width characters).
Maybe. But OCR programs have had learning features for decades. I've spent quite a lot of time in FineReader learning mode. Material produced on a moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can be handled tolerably well. Especially with post-processing that knows what the text patterns should be and converts common misreadings to what they should be. But the listings I mentioned before were entirely unmanageable even after a lot of "learning mode" effort. An annoying wrinkle was that I wasn't dealing with greenbar but rather with Dutch line printer paper that has every other line marked with 5 thin horizontal lines, almost like music score paper. Faded printout with a worn ribbon on a substrate like that is a challenge even for human eyeballs, and all the "machine learning" hype can't conceal the fact that no machine can come anywhere close to a human for dealing with image recognition under tough conditions.
That said, if you have access to a particularly good OCR, it can't hurt to spend a few hours trying to make it cope with the source material in question. But be prepared for disappointment.
paul
More information about the cctalk
mailing list