Eric Smith wrote:
Jim Battle wrote about DjVu:
So for OCR purposes, I don't think this type
of compression really hurts
-- it replaces one plausible "e" image with another one.
No, that's exactly the kind of BS you DO NOT WANT for a file that you
plan to OCR. What if you've got a mathematical formuala that has some
latin "e" letters and some greek epsilons in it? Or perhaps normal
and italic "e" letters? DjVu may well think they are "close
enough",
while a good OCR program might be able to tell them apart accurately.
Eric, the level of differences it allows aren't anything so gross: gee,
these are both serif fonts, so they are close enough. No. When you
scan to bilevel, exactly where an edge crosses the threshold is subject
to exact placement of the page, what the scanner's threshold is, and
probably what phase the 60 Hz AC is since it to some degree couples to
the lamp brightness (hopefully not much at all, but if you are splitting
hairs...). Thus there is no "perfect" scan. If you are scanning as
such a low resolution that two "e"s from different fonts might get
confused with each other, your OCR attempts will be hopeless as well.
The point of wanting lossless compression is that even
if a good
OCR program today can't tell them apart accurately, a good OCR program
ten years from now might.
lossless from what? scan your perfectly clean page with the most
expensive scanner you can afford, then reseat the page and do it again,
1000 times; no two are going to be exactly the same.
So lossless in this context means perfectly preserving a known imperfect
image. Realizing that there is an error floor, allowing other
substitutions that are within that error bound to get a 3x compression
improvement sounds like a great algorithm to me.
But if you use lossy compression now, you are likely
discarding
information that the OCR program will need.
If you scan at a good enough resolution (300 dpi for any normal sized
text, I doubt it makes one whit of difference.
As an aside, I worked four for years at a company that ended up merging
with what was Caere and is now softscan. If Bob Stek is still
subscribed, he will remember the company. He used the Calera
Recognition Systems scanners to scan all of the Sherlock Holmes books
and made the first electronic archive of all of the stories. What I
knew about OCR is now 10 years past, so perhaps things have changed
radically, but I doubt it.