On Mon, 2004-06-28 at 14:33, Paul Koning wrote:
The trick is to look for a threshold setting for the
black vs. white
threshold that results in minimal pixels on the page, but not so high
that the letters lose their shape. This is a compromise -- the edge
of a printed letter is not really sharp in a scan, so as you raise the
threshold some of the outer pixels change from black to white -- your
letter gets "thinner". If you can't get both a clean page and an
acceptable letter shape, then the source material isn't good enough to
support bitonal scanning. If so, you'll need a grayscale scan and
you'll have to put up with the larger file sizes that result.
Personally I'd make sure I had copies of the *original* non-processed
scans archived, though. It's really easy to lose quality when messing
around with threshold settings for 1bpp scans or tweaking
brightness/contrast for greyscale scans. Typically diagrams tend to
suffer more than text, and it's very hard to "proof read" those after
processing to make sure they're spot-on - it's all to easy to miss
something.
Remember that on old documents the original page quality and contrast
can vary quite a bit, plus some pages may have aged differently to
others - so it's not like you can pick a setting that works for one page
and apply it to all.
Personally I'd only want to be doing the scanning once as it's such a
time-consuming job (and OCRing is even worse!). Plus (as I'm sure is the
case with others on the list) I have some original documents where the
number of surviving copies worldwide is probably in single figures. I
wouldn't want a scan to exist where information is missing, and the
original source document is impossible to track down. For more common
documents it's less of an issue - but still a pain to re-scan anything
(and it means there's a fixed and a broken version then floating around
out there too!)
I just do everything as greyscale, save and archive the scans with no
processing whatsoever, and only tweak
colour/brightness/contrast/threshold/whatever settings just prior to
feeding into OCR software. Storage really isn't an issue these days (I
keep data on both tape and hard disk)
cheers,
Jules