Jim Battle wrote:
When you
scan to bilevel, exactly where an edge crosses the threshold is subject
to exact placement of the page, what the scanner's threshold is, and
probably what phase the 60 Hz AC is since it to some degree couples to
the lamp brightness (hopefully not much at all, but if you are splitting
hairs...). Thus there is no "perfect" scan.
Never claimed there was. But I don't want software to DELIBERATELY
muck about with the image, replacing one glyph with another. That's
potentially MUCH WORSE than any effect you're going to get from the
page being shifted or skewed a tiny amount.
If you are scanning as such a low resolution that two
"e"s from
different fonts might get confused with each other, your OCR attempts
will be hopeless as well.
But that's what you yourself said that the DjVu software does. It
replaces glyphs with other glyphs that it thinks are similar. No matter
how good a job it thinks it can do of that, I DO NOT WANT IT FOR
ARCHIVAL DOCUMENTS.
I normally scan at 300 or 400 DPI; when there is very tiny text I
sometimes use 600 DPI.
Even at those resolutions, it can be difficult to tell some characters
apart, expecially from poor quality originals. But usually I can do
it if I study the scanned page very closely. No, OCR today cannot do
as good a job at that as I can. Someday OCR may be better. But
arbitrarily replacing the glyphs with other ones the software considers
"good enough" is going to f*&# up any possibility of doing this by
either a human OR OCR.
And all to make the file a little smaller. DVD-R costs about $0.25
to store 4.7GB of data, so I just can't get excited about using lossy
encoding for text and line art pages that usually don't encode with
lossless G4 to more than 50K bytes per page.
Eric