If you OCR, always archive the bitmaps too - Re: Regarding Manuals

27 Sep 2015

On Sun, 27 Sep 2015, Pontus Pihlgren wrote:
...
  It seems to me that a better tool could solve the
issue. One that
 could display the OCR:ed content only and the scanned content
 only when desired, for instance when you suspect an error.
 Is there such a reader? Is the content organised to make it
 possible. 
I haven't seen one.
I did start trying to write an heuristic probabilistic OCR one 25 years
ago.  The idea being to overlay the OCR'd (displayed with matching fonts)
over the scanned content.  Besides visual confirmation and indication of
probability of accuracy with each character, it lends itself well to
hiring neighborhood kids to type in just the "wrong" characters to clean
up the OCR'd file, and heuristically tune the font database, including
adding new fonts - EVERY character is "wrong" until it repeats a few times
in the document.  ("clean up" a NYT article, and the OCR now has their
font).

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

If you OCR, always archive the bitmaps too - Re: Regarding Manuals