Antonio Carlini wrote:
Although it
doesn't really know text is per-se, one of its
algorithms is
to find glyph-like things. Once it has all glyph-like things
isolated
on a page, it compares them all to each other and if two glyphs are
similar enough, it will just represent them both (or N of
them) with one
compressed glyph image.
That looks like information loss to me.
yes, it is information loss. scanning bilevel is a much worse
information loss. scanning at 300 dpi, or 600 dpi, or 1000 dpi is
information loss. viewing the document on a CRT is information loss.
If one of those glyph-like
things was not the same symbol as the others, then the algorithm
has just introduced an error.
yes, you are right, *if*. And that is where you are wrong to assume it
is likely to make a difference.
So for OCR
purposes, I don't think this type of compression
really hurts
-- it replaces one plausible "e" image with another one.
But one of them might have been something other than an "e".
Antonio --
yes, if you assume that the encoder is going to make gross errors, then
it is a bad program and it shouldn't be used. but have you ever used
it? it doesn't do anything of the sort.
imagine a page with 2000 characters, all of one font and one point size,
and that 150 of them are the letter "e". In a tiff image, there will be
150 copies of that e, all very slightly different. In the djvu version,
the number of unique 'e's will depend on the scanned image, but it isn't
going to replace them all with a single 'e' -- there might be 50 'e's
instead of 150. Thank about that -- to the naked eye, all 150 look
identical unless yo blow up the image with a magnifying tool. djvu is
still being selective enough about what matches and what doesn't that it
still has 50 copies of the 'e' after it has collapsed ones that are
similar enough. It isn't very agressive at all about coalescing glyphs.
As far as I know there is a bound on how small of a size it will try
to group so that for really small point sizes, nothing bad happens at
all. The differences it allows are truly inconsequential.
It is like complaining that mp3 (or insert your favorite encoder here)
sucks because in theory it can do a poor job of it. In fact, ones that
do a poor job get left behind and the ones that do a good job get used.