fixing broken .Z files?

3 Mar 2005

...
   And compress
(.Z) compression still leaves a lot of redundancy in
 the compressed data.  Actually, it doesn't -- you can't compress a .Z very
well. 
Sure it does.  It's just not in a form very usable by most compression
programs.

As a simple example, if the high bit of a compressed codon is 1, the
next bit is significantly more likely to be 0 than 1 - at least until
the table fills up.  For a more complex example, consider the Shannon
estimate of one bit per letter for normal connected English text.  At
one bit per character, the first five lines of this paragraph weigh in
at 352 bits, plus a type tag meaning "English text", which should fit
within 32 bits; compress(1) output for that input is well over two
thousand bits - that's still over 75% redundant.

That's a rather small sample, and LZW works better on long texts.  I
happen to have the King James bible online, and tried the same thing
with Genesis.  The version I have is 214518 characters; if we accept
the Shannon estimate, there are 214518 bits of information there.
compress(1) output is 76583 bytes, or 612664 bits, or about 65%
redundant.  Even if you think Shannon was wrong by a factor of two
(unlikely, IMO), 30% of the compress output is redundancy.  (By way of
comparison, gzip?--best gave 62540 bytes, about 57% redundant;
bzip2?-9, 48392 bytes, 44+% redundant.)

The trouble is, a lot of that redundancy takes the form of things like
the rules of English grammar, facts like "`theesompression' is not an
English word", properties like the use of the biblical register
(instead of, say, the scientific grant proposal register), which
compression programs are not good enough to be able to take advantage
of - but a human, driving a good human's-helper program, can.

...
  What I think you're getting at is that, since the
source data has a
 lot of redundancy *and* is human parsable, it can be reconstructed
 more easily in the case of mangled data. 
Well, yeah; "is human parsable" is a form of redundancy, but one that
is almost impossible for programs to take advantage of - certainly
impossible for current programs.  That's (part of) why my suggestions
for recovering from a damaged .Z file keep a human in the loop - to
leverage the human ability to detect bogus decompressions based on
redundancy programs have trouble using.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse at rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

fixing broken .Z files?