On Wednesday 11 August 2004 09:50, Jules Richardson wrote:
On Wed, 2004-08-11 at 13:13, Steve Thatcher wrote:
I would encode binary data as hex to keep
everything ascii. Data
size would expand, but the data would also be compressable so
things could be kept in ZIP files of whatever choice a person would
want to for their archiving purposes.
"could be kept" in zip files, yes - but then that's no use in 50
years time if someone stumbles across a compressed file and has no
idea how to decompress it in order to read it and see what it is :-)
UNIX's compress format has been around for decades now... has its
patents expired yet? If not, there's always gzip...
Hence keeping the archive small would seem sensible so
that it can be
left as-is without any compression. My wild guesstimate on archive
size would be to aim for 110 - 120% of the raw data size if
possible...
You could uuencode the data... For long term archiving it might be
suggestable to have plain-text paper copies of the data (hoping in 50+
years they'll have decent OCR technology :). So, you *need* something
that only uses some sane subset of characters, like what uuencode or
BASE64 encoding gives you. Uuencode is a bit older so I'd tend to
lean towards that over BASE64.
XML has become
a storage format choice for a lot of different
commercial packages. My knowledge is more based on the windows
world, but I would doubt that other computer software houses are
avoiding XML. Sun/Java certainly embraces it.
My background (in XML terms) is with Java - but I've not come across
software that mixes a human-readable format in with a large amount of
binary data though (whether encoded or not). Typically the metadata's
kept seperate from the binary data itself, either in parallel files
(not suitable in our case) or as seperate sections within the same
file.
I'm personally prejudiced against XML, but that's just me. : )
I don't
quite understand why representing binary as hex would
affect the ability to have command line utilitities.
See my posting the other week when I was trying to convert
ASCII-based hex data back into binary on a Unix platform :-) There's
no *standard* utility to do it (which means there certainly isn't on
Windows). If the data within the file is raw binary, then it's just a
case of using dd to extract it even if there's no high-level utility
available to do it.
You could decode it by hand (ick) or write a Q&D program to do it for
you. I'd hope that *programming* won't be a lost art in 50+ years.
Certainly more
cpu cycles are needed for conversion and image file
size is larger, but we need a readable format
I'm not quite sure what having binary data represented as hex for the
original disk data gives you over having the raw binary data itself -
all it seems to do is make the resultant file bigger and add an extra
conversion step into the decode process.
Again, producing paper copies of stuff with non-printable characters
becomes "problematic".
and I would
think that cpu cycles is not as much of a concern or
file size.
In terms of CPU cycles, for me, no - I can't see me ever using the
archive format except on modern hardware. I can't speak for others on
the list though.
As for file size, if encoding as hex that at least doubles the size
of your archive file compared to the original media (whatever it may
be). That's assuming no padding between hex characters. Seems like a
big waste to me :-(
Then use uuencode or similar that does a bit less wasteful conversion.
Anyways, the only computer media type where KB/in^2 (or KB/in^3) isn't
increasing rapidly is paper.
The only
difference I see in the sections that were described is
that the first one encompasses the format info and the data. My
description had the first one as being a big block that contained
the two other sections as well as length and CRC info to verify
data consistency. Adding author, etc to the big block would make
perfect sense.
Yep, I'm with you there. CRC's are a nice idea. Question: does it
make sense to make CRC info a compulsory section in the archive file?
Does it make sense to have it always present, given that it's
*likely* that these archive files will only ever be transferred from
place to place using modern hardware? I'm not sure. If you're
spitting data across a buggy serial link, then the CRC info is nice
to have - but maybe it should be an optional inclusion rather than
mandatory, so that in a lot of cases archive size can be kept down?
(and the assumption made that there exists source code / spec for a
utility to add CRC info to an existing archive file if desired)
Yes. A CRC is *always* a good idea. Or, you could do an ECC even ;)
I don't really understand why you're quite so concerned about archive
size bloat, at least over things like CRC's (which if applied liberally
might add a 4% bloat in size) or plain-text encoding (which would add
between 33% to about 100% to the size). I'd rather give up some
efficiency in this case for ensuring that the data is stored correctly,
and can be properly read (and easily decoded) in 50+ years.
Pat
--
Purdue University ITAP/RCS ---
http://www.itap.purdue.edu/rcs/
The Computer Refuge ---
http://computer-refuge.org