Let's develop an open-source media archive standard

11 Aug 2004

On Wed, 2004-08-11 at 18:15, Vintage Computer Festival wrote:
...
  On Wed, 11 Aug 2004, Jules Richardson wrote: 
...
   I'm not
quite sure what having binary data represented as hex for the
 original disk data gives you over having the raw binary data itself -
 all it seems to do is make the resultant file bigger and add an extra
 conversion step into the decode process. 
 But it also makes it human readable, and readable in any standard text
 editor.  Mixing binary data in with human readable data in a format that's
 meant, first and foremost, to be human readable is antithetical to the
 idea. 
Ahh, I was talking about having the binary data as a seperate section in
the archive to the human-readable part though for precisely that reason.
So, what difference does it make to a human analyst whether the data is
stored as hex pairs or binary data? Both need decoding by some process
to make them usable. A human eye is no better off viewing a stream of
hex digits than they are a stream of arbitary ASCII data.
Actually, binary data could possibly be more useful to the human eye in
a browsing scenario, as at least the eye can quickly pick out meaningful
strings - such as filenames on the original media - from a sea of binary
data, without needing to do any decoding. At least viewing a file
containing binary data could give a clue as to what it contained if the
archive metadata (eg. description) wasn't up to much.
As raised earlier though, I do wonder if it's an idea to define several
possible encoding methods as part of the spec. Maximum flexibility
always seems the key to long-lived data formats, so it perhaps makes a
lot of sense to do so anyway. Who's to tell what use such archives might
be put to in the future - but if the spec covers a reasonable base for
now (with extensibility in mind such that others can be added if needs e
in future versions) then everyone's happy, and future generations can
always convert between formats as they see fit.
Oh, next random thought (which I expect someone's already raised) - the
addressing format for the data on the original media needs to be
flexible enough to cope with different classes of data. Or rather, I'd
expect different addressing classes. For hard disk and floppy archives,
head/sector/track seem a logical addressing scheme. But for say a ROM
image, there's no concept of head/sector/track; maybe just an index to
the data and a length. Maybe someone will want to add scans of
documentation pages to an archive, in which case chapter / page
addressing is logical.
Personally I wouldn't worry too much about what types of information
people might want to store in such an archive at this stage - just be
aware that the flexibility needs to be there just in case :-)
e.g. something like (using vague xml just to get the idea across):
<config>
  <metadata>
    <version>1.00</version>
    <description>blah blah blah</description>
    ...
  </metadata>
  <definition>
    <dataformat>xyz</dataformat>
    <compression>none</compression>
    ...
    <index type="headtracksector">
      <block id="0" crc="xyz" head="0"
track="0" sector="0" offset="0">
      <block id="1" crc="xyz" head="0"
track="0" sector="1"
offset="256">
      ...
      <block id="12345" crc="xyz" head="15"
track="31" sector="17"
offset="7654321">
    </index>
  </definition>
</config>
...............
{bunch of data in format xxxx}
...............
Here the index element's 'type' field gives the type of addressing used,
in this case using data blocks split by head / track / sector. An
appropriate handler for that type would then take over and actually
process the data within the index element accordingly.
I used the idea of the media data following the config section here for
clarity, but mixing media data with the semantic data in the index
section would work too (although I'm starting to realise that doing that
is going to make things about as human-readable readable as a typical
large HTML file :-)
I'd say that important field values like the compression type should
always be human-readable rather than a numeric id, just rigidly defined
by the spec (e.g. 'none', 'base64' 'uuencode' etc.). That makes
life a
lot easier for someone potentially looking at this in the future if they
don't happen to have a copy of the spec handy!
Hmm, I miss the old days of everyone chucking ideas around like this :-)
cheers
Jules

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Let's develop an open-source media archive standard