Documentation

30 Nov 2006

Richard wrote:
...
  In article <456EEA43.7050601 at yahoo.co.uk>,
     Jules Richardson <julesrichardsonuk at yahoo.co.uk>  writes:

  [...] it's easy
 to make sure that the page was scanned straight etc., but easy to miss things
 which might hinder some future OCR process.  
 To be honest, almost every time I have tried to OCR something (even a
 pristine original), it was simply faster and more accurate to type it
 in myself.  I don't know why but I have been singularly unimpressed
 with OCR software.  Obviously lots of people do OCR, but the amount of
 rework and editing necessary to get high accuracy is just as much work
 as typing it in yourself for someone like me that is a fast touch
 typer. 
Oh, I agree. Twenty years down the line I expect it'll be a lot better though, 
  but by then the original paper copies of some of the material out there 
might be long-gone - hence my concern about improving the quality of some scans.

I suppose a vague rule of thumb might be that if it's not readable by a human 
then it's never going to be readable via OCR :-) Thing is, to maximise 
chances, every single letter in every single scan would have to be proof-read 
for legibility - which is obviously unrealistic.

Hence my feeling that bi-level just isn't good enough for some docs, because 
it won't necessarily discriminate between real text and a hair / dirt / pen 
mark where greyscale *might*. It's not infallible either of course - a blue 
biro mark might be indistinguishable from the faded text below it after 
scanning; give it five years and I'll probably be advocating full-colour scans :-)

cheers

Jules

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Documentation