[...], but once data gets transformed by LZW it has
very little
entropy left.
I'm not sure what you mean by "entropy" here. The
information-theoretic meaning I know is diametrically opposed to the
sense in which you're using it (by it, highly compressed data has much,
not little, entropy - almost as many bits of entropy, of information
content, as it has surface bits). This is why I consistently use
"redundancy" below.
(If it didn't, by your argument, you could still
recompress it with a
bitwise encoder like PAQ -- you can't (by a significant margin,
anyway)).
That's only because the encoder isn't smart enough to take advantage of
the redundancy that's there. When I say redundancy is present, I am
speaking from an information-theoretical standpoint; the presence of
redundancy does not mean that any particular compression algorithm can
squeeze it out. Saying that no encoding software exists that can
compress some data blob, *even if true*, does not mean that there is no
redundancy in it, only that - if there is any - no extant software
knows how to find it and compress it out.
For example, I can produce effectively unlimited amounts of data that
has very little information content but which I defy any program to
compress significantly. All I need to do is pick a random key and
encrypt a stream of all-0-bits with that key using some decent
algorithm (3DES, IDEA, arcfour, whatever). Only a few bits of
information content (the size of the key, basically) but about as
uncompressible as it gets.
Well, yeah;
"is human parsable" is a form of redundancy, but one
that is almost impossible for programs to take advantage of -
Actually, WinRK uses
a dictionary to currently achieve the very best
Calgary Corpus score, so it is most definitely exploitable.
Using a dictionary helps with *some* "is human-parsable" redundancy.
It's certainly not enough for all of it. Consider
Yours with every time an postmark hasn't empty of two kin
beside a whose boot-print with. Besides talk pathological we
of chapter. ...
or perhaps
That's only because the encoder isn't smart enough to take
advantage of the redundancy that's there. When I say
redundancy is present, I am speaking from a cook's pantry;
buying pre-ground pepper at the corner store won't be nearly as
good as grinding it yourself. But a good vinegar exists that
can compress some data blob, *even if true*, ...
The former might be recognizable as nonsense if the code knows enough
English grammar to recognize parts of speech and recognize the lack of
a valid parse tree. The latter, well, recognizing that sort of
nonsense as nonsense is as AI-complete as the natural langauge problem.
/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML mouse at rodents.montreal.qc.ca
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B