If you OCR, always archive the bitmaps too - Re: Regarding Manuals

Fred Cisin cisin at xenosoft.com
Sun Sep 27 13:33:19 CDT 2015


On Sun, 27 Sep 2015, Pontus Pihlgren wrote:
> It seems to me that a better tool could solve the issue. One that
> could display the OCR:ed content only and the scanned content
> only when desired, for instance when you suspect an error.
> Is there such a reader? Is the content organised to make it
> possible.

I haven't seen one.


I did start trying to write an heuristic probabilistic OCR one 25 years 
ago.  The idea being to overlay the OCR'd (displayed with matching fonts) 
over the scanned content.  Besides visual confirmation and indication of 
probability of accuracy with each character, it lends itself well to 
hiring neighborhood kids to type in just the "wrong" characters to clean 
up the OCR'd file, and heuristically tune the font database, including 
adding new fonts - EVERY character is "wrong" until it repeats a few times 
in the document.  ("clean up" a NYT article, and the OCR now has their 
font).




More information about the cctalk mailing list