On 2015-09-27 2:33 PM, Fred Cisin wrote:
On Sun, 27 Sep 2015, Pontus Pihlgren wrote:
It seems to me that a better tool could solve the
issue. One that
could display the OCR:ed content only and the scanned content
only when desired, for instance when you suspect an error.
Is there such a reader? Is the content organised to make it
possible.
I haven't seen one.
I did start trying to write an heuristic probabilistic OCR one 25 years
ago. The idea being to overlay the OCR'd (displayed with matching
fonts) over the scanned content. Besides visual confirmation and
indication of probability of accuracy with each character, it lends
itself well to hiring neighborhood kids to type in just the "wrong"
characters to clean up the OCR'd file, and heuristically tune the font
database, including adding new fonts - EVERY character is "wrong" until
it repeats a few times in the document. ("clean up" a NYT article, and
the OCR now has their font).
DJVU compression is somewhat analogous to this process, because,
font-like, it builds a set of master glyphs then uses them as a
compression dictionary (if everyone will forgive my simplistic
explanation). Being lossy, like OCR, it inherently adds the risk of
picking the wrong (but visually almost indistinguishable) glyph -- the
WORST kind of typo for being so insidious.
There was a somewhat scary case study on the web a few years ago (not
sure if it's still out there, haven't been able to find it) where the
DJVU compression in a Xerox copier was quietly changing digits on
scanned schematics to different digits. Close enough for DJVU -- but
wrong. The risks are obvious(*).
--Toby
* - Hat tip to PGN. comp.risks digest.