On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk
<cctalk at classiccmp.org> wrote:
One consideration is the effort required to repair
transcription errors. Those that produce syntax errors aren't such an issue;
those that pass the assembler or compiler but result in bugs (say, a mistyped register
number) are harder to find.
You can always have it "turked" twice and compare the results.
This is also the sort of problem that modern Deep Machine Learning
will just crush. Identifying individual characters should be trivial,
you just have to figure out where the characters are first which could
also be done with ML or you could try to do it some other way (with a
really well registered scan maybe if it's all fixed-width characters).
I think if I had a whole lot of old faded greenbar etc. I would
consider manually converting a few pages then setup a Kaggle
competition for it and maybe invest a bit of money as a prize. Someone
may even have done this already (there have certainly been a number of
"OCR historical documents" competitions), but I didn't spend too much
time searching. I'm sure you're not the only one who has had this
problem to solve.