Lossy compression vs. archiving and OCR (was Re: Many things)

31 Jan 2005

Eric Smith wrote:

...
  Jim Battle wrote:

 When you
scan to bilevel, exactly where an edge crosses the threshold is subject
to exact placement of the page, what the scanner's threshold is, and
probably what phase the 60 Hz AC is since it to some degree couples to
the lamp brightness (hopefully not much at all, but if you are splitting
hairs...).  Thus there is no "perfect" scan.  
 Never claimed there was.  But I don't want software to DELIBERATELY
 muck about with the image, replacing one glyph with another.  That's
 potentially MUCH WORSE than any effect you're going to get from the
 page being shifted or skewed a tiny amount. 
"potentially" is the key word.  if the encoding software is crappy, then 
they such a substitution could turn all "e"s into "x"s.  sure.  but
the 
djvu encoder doesn't make gross substititutions like that.

Contrary to what you say, skew has a much larger effect on the sampling 
than djvu's encoders have.  Which scanner you use has a much larger 
effect on the sampling too.

...
...
  I normally scan at 300 or 400 DPI; when there is very
tiny text I
 sometimes use 600 DPI.

 Even at those resolutions, it can be difficult to tell some characters
 apart, expecially from poor quality originals.  But usually I can do
 it if I study the scanned page very closely.  No, OCR today cannot do
 as good a job at that as I can.  Someday OCR may be better.  But
 arbitrarily replacing the glyphs with other ones the software considers
 "good enough" is going to f*&# up any possibility of doing this by
 either a human OR OCR. 
Eric, in picking a case where the djvu algorithm *might* cause problems, 
you must also confess that in this case scanning in bilevel, even 
lossless, is going to be a bad choice too.  If the page is that poor, 
you should be using grayscale.

Why be religious about lossiness and claim anything less is going to 
"f*&#" up your efforts when you've just tossed away the bulk of the 
information?

...
  And all to make the file a little smaller.  DVD-R
costs about $0.25
 to store 4.7GB of data, so I just can't get excited about using lossy
 encoding for text and line art pages that usually don't encode with
 lossless G4 to more than 50K bytes per page. 
"A little" can be 3x.  For distribution, it is a big deal.  Until 
recently, it made a signficant difference on disk price too, but now 
that you can get 120 GB hard drives in a box of cereal, that isn't so 
much of a concern.

Of course you can use whatever format you want for your archiving. 
Making it available in a more accessible format means that more people 
are likely to take advantage of it.

For most documents, it is the information that I care about preserving, 
not the pixels.  I would be overjoyed if Adobe would buy out lizardtech 
and adopt some of their technology, even the lossy bits.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Lossy compression vs. archiving and OCR (was Re: Many things)