Many things

31 Jan 2005

Antonio Carlini wrote:
...
  Although it
doesn't really know text is per-se, one of its
algorithms is
to find glyph-like things.  Once it has all glyph-like things
isolated
on a page, it compares them all to each other and if two glyphs are
similar enough, it will just represent them both (or N of
them) with one
compressed glyph image. 
 That looks like information loss to me. 
yes, it is information loss.  scanning bilevel is a much worse
information loss.  scanning at 300 dpi, or 600 dpi, or 1000 dpi is
information loss.  viewing the document on a CRT is information loss.
...
  If one of those glyph-like
 things was not the same symbol as the others, then the algorithm
 has just introduced an error. 
yes, you are right, *if*.  And that is where you are wrong to assume it
is likely to make a difference.
...
  So for OCR
purposes, I don't think this type of compression
really hurts
-- it replaces one plausible "e" image with another one. 
 But one of them might have been something other than an "e". 
Antonio --
yes, if you assume that the encoder is going to make gross errors, then
it is a bad program and it shouldn't be used.  but have you ever used
it?  it doesn't do anything of the sort.
imagine a page with 2000 characters, all of one font and one point size,
and that 150 of them are the letter "e".  In a tiff image, there will be
150 copies of that e, all very slightly different.  In the djvu version,
the number of unique 'e's will depend on the scanned image, but it isn't
going to replace them all with a single 'e' -- there might be 50 'e's
instead of 150.  Thank about that -- to the naked eye, all 150 look
identical unless yo blow up the image with a magnifying tool.  djvu is
still being selective enough about what matches and what doesn't that it
still has 50 copies of the 'e' after it has collapsed ones that are
similar enough.  It isn't very agressive at all about coalescing glyphs.
  As far as I know there is a bound on how small of a size it will try
to group so that for really small point sizes, nothing bad happens at
all.  The differences it allows are truly inconsequential.
It is like complaining that mp3 (or insert your favorite encoder here)
sucks because in theory it can do a poor job of it.  In fact, ones that
do a poor job get left behind and the ones that do a good job get used.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Many things