As coincidence would have it, I work at Indiana University's Digital
Library Program and there was a lecture on archiving audio which hits
many of the same issues that have come up here. The conclusions that
they came up with for the project included:
* There's no such thing as an eternal media: the data must be
transportable to the latest generation of storage
* Metadata should be bundled with the content
* Act like you get one chance to read the media :(
While this is a different context, the principle is basically the same.
I've got a pile of TK50 tapes I'm backing up using the SIMH tape format,
so this is relevant to that process as well.
I think the optimum format for doing this isn't a single file, but a
collection of files bundled into a single package. Someone mentioned
tar, I think, and zip would work just as well. The container could
contain these components:
* content metadata - info from the disk's label/sleeve, etc
* media metadata - the type of media this came from
* archivist metadata - who did it, methods used, notes, etc
* badblock information - 0 blocks which are actually bad.
* content - a bytestream of the data
I don't think there's any real need to document the physical properties
of the media for EVERY disk archived -- there should probably be a
repository of 'standard' media types (1541's different-sectors-per-track
info, FM vs MFM per track information, etc) plus overrides in the media
metadata (uses fat-tracks, is 40 track vs 35, etc).
Emulators could use the content part of the file as-is and collectors
would have enough information to recreate the original media. It would
also allow for cataloging fairly easily.
Brian
On Tue, 2005-05-17 at 21:04 +0000, Jules Richardson wrote:
On Tue, 2005-05-17 at 15:07 -0500, Randy McLaughlin
wrote:
I like and prefer media images as straight data
dumps but I want the
formatting information of the original media somewhere. I even want data
from media that is incomplete or has errors, also documented.
Yep, me too. From when we were bashing around ideas about this though a
few months back it seems that's a minority viewpoint; most people want
data embedded in the metadata.
For hard drive images I zero-pad any bad data but also include metadata
in a seperate file - including disk geometry, which blocks are bad,
resulting dump checksum, timestamp etc. along with anything else that
might be particularly useful. For floppy images things would be
significantly more complex though (due to factors as mentioned - variant
sectors/track, different encoding for different tracks etc.)
The idea behind futurekeep though was to make the metadata highly
structured and in a similar vein to HTML in that clients could handle as
much of the data as needed (eg. someone not dealing with variable bit
rate images wouldn't need a decoder that could handle them). Ideally
it'd be human-readable too (after a fashion) - e.g. XML - so that the
data could be reconstructed into a disk image "by hand" even if some
whizzy util to do it wasn't present. (understanding it at a file level
is obviously outside the scope)
That doesn't seem *too* much to ask; basic metadata can be created for
existing images without a lot of hassle *if desired*. To me such a
format's more useful for future image creation though, particularly in
the case of less-common systems; the popular machines are likely to be
covered by their own archive formats already and the following large
enough that lack of data is not (yet) a problem. Rather than messing
around with proprietary image formats for those, or formats that aren't
particularly descriptive, it'd be nice to start from day 1 using
something that allows us to capture all the useful stuff that goes along
with the raw data.
cheers
Jules