fixing broken .Z files?

2 Mar 2005

...
  Don't suppose anyone's come across anything
that'll attempt to fix a
 corrupt .Z (Unix compress) file, have they? 
Not me, but it may be doable manually, depending.  What kind of error
do you have?  Wrong data, dropout, insertion, you don't know, what?

...
  I've got a 40MB compressed tar archive here, but
uncompress barfs
 after the first 23MB - it'd be nice if there was a way of skipping
 over the duff bits if possible and reading *something* from the last
 17MB! 
In theory it may be possible.  compress (.Z) uses Lempel-Ziv-Welch.  An
insertion or dropout is relatively hard to fix, in large part because
it means you have trouble telling where compressed values' boundaries
are.  Wrong data is comparatively easy; it will give you a corrupt
decoding table (or an outright error if the coded value is out of range
and the decoder bothers to check), but if the decompressed data is
highly redundant it's often fixable.

To fix an insertion, I'd basically have to just try deleting various
numbers of bits or bytes (depending on the medium, what the probable
unit of insertion is may vary) until you get sane decoded output.

To fix a dropout, I'd try inserting varying numbers of 0 bits (or
bytes) until I started getting garbled (instead of totally garbage)
decoded output, then treat it as corrupted data.

To fix corrupted data, I'd use a decoder that lets you look at the
decompression table.  (This description assumes you know LZW.  If you
don't, and you can't find a good description easily, write me off-list;
it's a relatively simple algorithm.)  When you get the first garbled
output, look at what decompression table entry the bogus bits came
from.  Then look back to where that entry was created and see what
you'd have to change to make it come out right.  Repeat until you get a
clean decompression.  (If you get a code-out-of-range error, you know
which compressed code is broken; try the different possible valid codes
until you get something that makes sense locally, and of the few that
do, look to see which produce clean data following.)

There is another possibility: ignore the garbling and keep
decompressing.  If you're lucky, the encoder will emit a clear code
soon and you'll start over with a clean table, at which point you will
suddenly start getting non-garbled decompression.  (If you're unlucky,
the encoder will hold off on the clear code.)

Of course, all this depends on the decompressed data being relatively
nonrandom (so you can tell when you ahve garbled data).  tar headers
certainly qualify; depending on what the files contained in the archive
are, they may or may not qualify.  Running text in a natural language,
and programs in most computer languages, qualify as well.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse at rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

fixing broken .Z files?