On 1/23/22 10:16, Paul Koning via cctalk wrote:
Maybe. But OCR programs have had learning
features for decades. I've spent quite a lot of time in FineReader learning mode.
Material produced on a moderate-quality typewriter, like the CDC 6600 wire lists on
Bitsavers, can be handled tolerably well. Especially with post-processing that knows what
the text patterns should be and converts common misreadings to what they should be. But
the listings I mentioned before were entirely unmanageable even after a lot of
"learning mode" effort. An annoying wrinkle was that I wasn't dealing with
greenbar but rather with Dutch line printer paper that has every other line marked with 5
thin horizontal lines, almost like music score paper. Faded printout with a worn ribbon
on a substrate like that is a challenge even for human eyeballs, and all the "machine
learning" hype can't conceal the fact that no machine can come anywhere close to
a human for dealing with image recognition under tough conditions.
The problem is that OCR needs to be 100% accuracy for many purposes.
Much short of that requires that the result be inspected by hand
line-by-line with the knowledge of what makes sense. Mistaking a
single fuzzy 8 for a 6 or a 3, for example can render code inoperative
with a very difficult to locate bug. Perhaps an AI might be programmed
to separate out the nonsense typos.
Old high-speed line printers weren't always wonderful with timing the
hammer strikes. I recall some nearly impossible to read Univac 1108
engineering documents, printed on a drum printer. Gave me headaches.
At least that's my take.
--Chuck
Document source is also a problem.
You would want to keep scan it at the best data format,
not something in a lossey format.
Ben.