Typing in lost code

ben bfranchuk at jetnet.ab.ca
Mon Jan 24 16:57:35 CST 2022

On 2022-01-23 12:47 p.m., Chuck Guzis via cctalk wrote:
> On 1/23/22 10:16, Paul Koning via cctalk wrote:
>> Maybe.  But OCR programs have had learning features for decades.  I've spent quite a lot of time in FineReader learning mode.  Material produced on a moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can be handled tolerably well.  Especially with post-processing that knows what the text patterns should be and converts common misreadings to what they should be.  But the listings I mentioned before were entirely unmanageable even after a lot of "learning mode" effort.  An annoying wrinkle was that I wasn't dealing with greenbar but rather with Dutch line printer paper that has every other line marked with 5 thin horizontal lines, almost like music score paper.  Faded printout with a worn ribbon on a substrate like that is a challenge even for human eyeballs, and all the "machine learning" hype can't conceal the fact that no machine can come anywhere close to a human for dealing with image recognition under tough conditions.
> The problem is that OCR needs to be 100% accuracy for many purposes.
> Much short of that requires that the result be inspected by hand
> line-by-line with the knowledge of what makes sense.   Mistaking a
> single fuzzy 8 for a 6 or a 3, for example can render code inoperative
> with a very difficult to locate bug.   Perhaps an AI might be programmed
> to separate out the nonsense typos.
> Old high-speed line printers weren't always wonderful with timing the
> hammer strikes.  I recall some nearly impossible to read Univac 1108
> engineering documents, printed on a drum printer.  Gave me headaches.
> At least that's my take.
> --Chuck
Document source is also a problem.
You would want to keep scan it at the best data format,
not something in a lossey format.

More information about the cctech mailing list