Eric Smith wrote:
Jim Battle wrote:
When you
scan to bilevel, exactly where an edge crosses the threshold is subject
to exact placement of the page, what the scanner's threshold is, and
probably what phase the 60 Hz AC is since it to some degree couples to
the lamp brightness (hopefully not much at all, but if you are splitting
hairs...). Thus there is no "perfect" scan.
Never claimed there was. But I don't want software to DELIBERATELY
muck about with the image, replacing one glyph with another. That's
potentially MUCH WORSE than any effect you're going to get from the
page being shifted or skewed a tiny amount.
"potentially" is the key word. if the encoding software is crappy, then
they such a substitution could turn all "e"s into "x"s. sure. but
the
djvu encoder doesn't make gross substititutions like that.
Contrary to what you say, skew has a much larger effect on the sampling
than djvu's encoders have. Which scanner you use has a much larger
effect on the sampling too.
...
I normally scan at 300 or 400 DPI; when there is very
tiny text I
sometimes use 600 DPI.
Even at those resolutions, it can be difficult to tell some characters
apart, expecially from poor quality originals. But usually I can do
it if I study the scanned page very closely. No, OCR today cannot do
as good a job at that as I can. Someday OCR may be better. But
arbitrarily replacing the glyphs with other ones the software considers
"good enough" is going to f*&# up any possibility of doing this by
either a human OR OCR.
Eric, in picking a case where the djvu algorithm *might* cause problems,
you must also confess that in this case scanning in bilevel, even
lossless, is going to be a bad choice too. If the page is that poor,
you should be using grayscale.
Why be religious about lossiness and claim anything less is going to
"f*&#" up your efforts when you've just tossed away the bulk of the
information?
And all to make the file a little smaller. DVD-R
costs about $0.25
to store 4.7GB of data, so I just can't get excited about using lossy
encoding for text and line art pages that usually don't encode with
lossless G4 to more than 50K bytes per page.
"A little" can be 3x. For distribution, it is a big deal. Until
recently, it made a signficant difference on disk price too, but now
that you can get 120 GB hard drives in a box of cereal, that isn't so
much of a concern.
Of course you can use whatever format you want for your archiving.
Making it available in a more accessible format means that more people
are likely to take advantage of it.
For most documents, it is the information that I care about preserving,
not the pixels. I would be overjoyed if Adobe would buy out lizardtech
and adopt some of their technology, even the lossy bits.