On Sep 26, 2015, at 5:42 PM, Toby Thain <toby at
telegraphics.com.au> wrote:
...
Software which "recreates" the typography of a document from OCR does not
produce an acceptable substitute, I've yet to see a book that wasn't ruined by it.
True. But that's not the biggest problem with OCR. The biggest problem is that even
professional grade OCR programs have rather low accuracy. Maybe they do acceptably well
on really high grade scans of very clean new documents, but on books, typewritten
documents, etc., even after you use the "train" feature you need to spend a long
time cleaning up. It may be faster than retyping things, if you're lucky. Not if
you're not; two of us recently retyped 300 pages of line printer listing because that
was faster and more accurate than OCR on that particular printout.
Given that OCR can only do, at best, a just barely acceptable recognition of the letters
of the alphabet, it follows that accurately recognizing the actual font used will be
vastly less accurate. And indeed you can see that clearly.
I wonder if there are OCR programs that can be told to choose among 2 or 3 fonts, as
opposed to guess from the entire inventory of the machine. If so, and if they are
sufficiently distinct, then maybe you'd stand a chance. Especially if it also added
heuristics like "never change fonts in mid-word" -- an obvious rule but not one
I have seen implemented.
paul