Richard wrote:
In article <456EEA43.7050601 at yahoo.co.uk>,
Jules Richardson <julesrichardsonuk at yahoo.co.uk> writes:
[...] it's easy
to make sure that the page was scanned straight etc., but easy to miss things
which might hinder some future OCR process.
To be honest, almost every time I have tried to OCR something (even a
pristine original), it was simply faster and more accurate to type it
in myself. I don't know why but I have been singularly unimpressed
with OCR software. Obviously lots of people do OCR, but the amount of
rework and editing necessary to get high accuracy is just as much work
as typing it in yourself for someone like me that is a fast touch
typer.
Oh, I agree. Twenty years down the line I expect it'll be a lot better though,
but by then the original paper copies of some of the material out there
might be long-gone - hence my concern about improving the quality of some scans.
I suppose a vague rule of thumb might be that if it's not readable by a human
then it's never going to be readable via OCR :-) Thing is, to maximise
chances, every single letter in every single scan would have to be proof-read
for legibility - which is obviously unrealistic.
Hence my feeling that bi-level just isn't good enough for some docs, because
it won't necessarily discriminate between real text and a hair / dirt / pen
mark where greyscale *might*. It's not infallible either of course - a blue
biro mark might be indistinguishable from the faded text below it after
scanning; give it five years and I'll probably be advocating full-colour scans :-)
cheers
Jules