fixing broken .Z files?

3 Mar 2005

...
  [...], but once data gets transformed by LZW it has
very little
 entropy left. 
I'm not sure what you mean by "entropy" here.  The
information-theoretic meaning I know is diametrically opposed to the
sense in which you're using it (by it, highly compressed data has much,
not little, entropy - almost as many bits of entropy, of information
content, as it has surface bits).  This is why I consistently use
"redundancy" below.

...
  (If it didn't, by your argument, you could still
recompress it with a
 bitwise encoder like PAQ -- you can't (by a significant margin,
 anyway)). 
That's only because the encoder isn't smart enough to take advantage of
the redundancy that's there.  When I say redundancy is present, I am
speaking from an information-theoretical standpoint; the presence of
redundancy does not mean that any particular compression algorithm can
squeeze it out.  Saying that no encoding software exists that can
compress some data blob, *even if true*, does not mean that there is no
redundancy in it, only that - if there is any - no extant software
knows how to find it and compress it out.

For example, I can produce effectively unlimited amounts of data that
has very little information content but which I defy any program to
compress significantly.  All I need to do is pick a random key and
encrypt a stream of all-0-bits with that key using some decent
algorithm (3DES, IDEA, arcfour, whatever).  Only a few bits of
information content (the size of the key, basically) but about as
uncompressible as it gets.

...
   Well, yeah;
"is human parsable" is a form of redundancy, but one
 that is almost impossible for programs to take advantage of -  Actually, WinRK uses
a dictionary to currently achieve the very best
 Calgary Corpus score, so it is most definitely exploitable. 
Using a dictionary helps with *some* "is human-parsable" redundancy.
It's certainly not enough for all of it.  Consider

	Yours with every time an postmark hasn't empty of two kin
	beside a whose boot-print with.  Besides talk pathological we
	of chapter.  ...

or perhaps

	That's only because the encoder isn't smart enough to take
	advantage of the redundancy that's there.  When I say
	redundancy is present, I am speaking from a cook's pantry;
	buying pre-ground pepper at the corner store won't be nearly as
	good as grinding it yourself.  But a good vinegar exists that
	can compress some data blob, *even if true*, ...

The former might be recognizable as nonsense if the code knows enough
English grammar to recognize parts of speech and recognize the lack of
a valid parse tree.  The latter, well, recognizing that sort of
nonsense as nonsense is as AI-complete as the natural langauge problem.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse at rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

fixing broken .Z files?