All -
Thought I'd de-cloak from lurk mode long enough to canvas opinion or
archival formats (again). I'm considering this from a digital media (not
paper to digital) at the moment. Considerations for paper and software are
similar though. There are many dozen image formats for diskettes and disk
drives, just as there are for photographic and paper images. My thought was
to avoid the problem by using a non-format. Stay with me on this...
basically persist the data in as raw a format as possible, with an
externalised, self defining descriptor, all wrapped up in an open archive
format to form a single file. The concept being that the long term storage
format is just a not the viewer format. You maintain a converter from the
storage format to a current viewer format, but you don't actually store the
data in the current viewer format. By current viewer, I mean PhotoShop,
Acrobat, etc. Here's the basic file layout:
xxxxx.zip :
raw.bin (a simple sequential byte copy)
descriptor.xml ( instructions on how to carve it into sectors, raster
lines etc.)
*Don't get hung up on *.zip... I know LZW is encumbered... this is just an
example.
The actual scan output is the raw.bin. It no formatting data at all. The
descriptor tells you everything you recorded about the file at scan time
which helps you to interpret the data stream later. That could be ratster
format, page markers, etc. This should work for a broad range of data for
an already digital data format.
The problem I have with this approach is where a scan doesn't result in
actual digital information. Yeah, ok it's all bytes on disk, but what I
mean is sampled data versus byte for byte representation. Paper tape and
punch cards are examples of data which is unambiguous in a digital sense,
because it contains a fixed encoding stream, with a self evident data word
boundary. A card is one character a line, tape one character per linear
unit. All you have to do is store the whole thing as a byte encoded stream,
and put the metadata in the descriptor (card 1, card 2, etc...).
Where this falls down is say for instance, an MFM diskette scanned with a
sampling board like a CatWesel. In this case the data returned is
basically analogous to sampled analog signals. It's clock ticks between
fluctuations. The fluctuations are binary. But their interpretation is
based on sampling rate and media rotation. You can sample at higher and
higher rates and get more sampling data, if you are working blind. This is
less of a problem if you know something about the subject. But otherwise
discretion has to be applied to interpret what you've captured. That
worries me. I like deterministic outcomes, especially in archival work.
You could force an 8 bit boundary on the resulting data, but things like
sector headers are sometimes deliberately encoded in fluctuation sequences
that don't conform to rest the data encoding. That throws your
interpretation of the data off, unless you already know what the sector
header formats look like. The only way I can think of to follow the
inside-out / no proprietary archival format approach you basically don't
transpose to bytes at all, but leave it as sampled fluctuations... not even
bits. As with other viewer approaches, you apply the transposition only
when you attempt to view the data, not when you store it for archival
purposes. Thoughts?
Third example... a booked scanned becomes a byte stream of bit map data
stored in the raw.bin. The descriptor then encodes the page transitions and
raster format, plus discretionary metadata. I like that much better than
PDF, TIFF and JFIF. I realise these formats are unlikely to die out, but I
like the idea of a common archival format which is self documenting better.
I'd be interested in hearing people's (non-flame) comments.
Regards,
Colin Eby