Let's develop an open-source media archive standard

11 Aug 2004

On Wed, 11 Aug 2004, Jules Richardson wrote:

...
    I would
encode binary data as hex to keep everything ascii. Data size would expand,
 but the data would also be compressable so things could be kept in ZIP files of
 whatever choice a person would want to for their archiving purposes. 
 "could be kept" in zip files, yes - but then that's no use in 50 years
 time if someone stumbles across a compressed file and has no idea how to
 decompress it in order to read it and see what it is :-)
 Hence keeping the archive small would seem sensible so that it can be
 left as-is without any compression. My wild guesstimate on archive size
 would be to aim for 110 - 120% of the raw data size if possible... 
I agree, but we'd definitely have to include compression features if we
are to meet this goal.  Using a floppy disk as an example, a worst case
scenario is that the image would be maybe 205% the size of the original
media (200% is the fact that you are now using two bytes to store one, and
5% is all the markup tags).

Keeping the archive small should be a major goal, since that would
encourage people to keep the images stored uncompressed.  Hard drives are
getting larger and all that, and my guess is at some point this issue will
be moot, but we can't know that for certain, so we should always assume a
worst case scenario (i.e. pessimism will be useful when designing this
specification :)

...
   Certainly more
cpu cycles are needed for conversion and image file size is
 larger, but we need a readable format 
 But the data describing all aspects of disk image would be readable by a
 human; it's only the raw data itself that wouldn't be - for both
 efficiency and for ease of use. The driving force for having 
This is a point that needs to be highlighted.  These images are meant to
be human readable, first and foremost.  Machine readable is a secondary
concern.  We know there will definitely be humans in the future (and if
not then who cares about this anyway).  There will probably be machines.
Said machines may not be useful to the task of decoding these images, so
it must be designed with human readability in mind.

...
  human-readable data in the archive is so that it can
be reconstructed at
 a later date, possibly without any reference to any spec, is it not? If 
Indeed.

...
  it was guaranteed that a spec was *always* going to be
available, having
 human-readable data at all wouldn't make much sense as it just
 introduces bloat; a pure binary format would be better. 
Correct.  So even if the spec was lost, people (who could read English at
least) would be able to figure out how to reconstruct the image from the
archive.

...
  I'm not quite sure what having binary data
represented as hex for the
 original disk data gives you over having the raw binary data itself -
 all it seems to do is make the resultant file bigger and add an extra
 conversion step into the decode process. 
But it also makes it human readable, and readable in any standard text
editor.  Mixing binary data in with human readable data in a format that's
meant, first and foremost, to be human readable is antithetical to the
idea.

...
  As for file size, if encoding as hex that at least
doubles the size of
 your archive file compared to the original media (whatever it may be).
 That's assuming no padding between hex characters. Seems like a big
 waste to me :-( 
Nope.  Not a waste.  Essential.

...
  Yep, I'm with you there. CRC's are a nice
idea. Question: does it make
 sense to make CRC info a compulsory section in the archive file? Does it 
Yes.  It's only one or two added bytes at the end of each data segment.

...
  make sense to have it always present, given that
it's *likely* that
 these archive files will only ever be transferred from place to place
 using modern hardware?  I'm not sure. If you're spitting data across a
 buggy serial link, then the CRC info is nice to have - but maybe it
 should be an optional inclusion rather than mandatory, so that in a lot
 of cases archive size can be kept down? (and the assumption made that
 there exists source code / spec for a utility to add CRC info to an
 existing archive file if desired) 
It doesn't hurt.  It only adds negligible overhead.  Certainly something
to discuss more.

I would make the specification unassuming about anything like this.  For
example, say there is an optional CRC feature.  I would make the default
for the image be that there was no CRC added to the data segments, unless
a meta tag was included in the header explicitly specifying that CRCs are
added.  This makes it ever so slightly easier to decode the image data by
someone who knows nothing of the spec.  No assumptions are made regarding
what people in the future will know about these images.

-- 

Sellam Ismail                                        Vintage Computer Festival
------------------------------------------------------------------------------
International Man of Intrigue and Danger                http://www.vintage.org

[ Old computing resources for business || Buy/Sell/Trade Vintage Computers   ]
[ and academia at www.VintageTech.com  || at http://marketplace.vintage.org  ]

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Let's develop an open-source media archive standard