On 02/09/11 5:01 AM, Colin Eby wrote:
All -
Thought I'd de-cloak from lurk mode long enough to canvas opinion or
archival formats (again). I'm considering this from a digital media (not
paper to digital) at the moment. Considerations for paper and software are
similar though. There are many dozen image formats for diskettes and disk
drives, just as there are for photographic and paper images. My thought was
to avoid the problem by using a non-format. Stay with me on this...
basically persist the data in as raw a format as possible, with an
externalised, self defining descriptor, all wrapped up in an open archive
format to form a single file. The concept being that the long term storage
format is just a not the viewer format. You maintain a converter from the
storage format to a current viewer format, but you don't actually store the
data in the current viewer format. By current viewer, I mean PhotoShop,
Acrobat, etc. Here's the basic file layout:
xxxxx.zip :
raw.bin (a simple sequential byte copy)
descriptor.xml ( instructions on how to carve it into sectors, raster
lines etc.)
*Don't get hung up on *.zip... I know LZW is encumbered... this is just an
example.
The actual scan output is the raw.bin. It no formatting data at all. The
descriptor tells you everything you recorded about the file at scan time
which helps you to interpret the data stream later. That could be ratster
format, page markers, etc. ... Paper tape and
punch cards are examples of data which is unambiguous in a digital sense,
because it contains a fixed encoding stream, with a self evident data word
boundary. A card is one character a line, tape one character per linear
unit. All you have to do is store the whole thing as a byte encoded stream,
and put the metadata in the descriptor (card 1, card 2, etc...).
...
Third example... a booked scanned becomes a byte stream of bit map data
stored in the raw.bin. The descriptor then encodes the page transitions and
raster format, plus discretionary metadata. I like that much better than
PDF, TIFF and JFIF. I realise these formats are unlikely to die out, but I
like the idea of a common archival format which is self documenting better.
I'd be interested in hearing people's (non-flame) comments.
One thing to consider is how the format deals with damage. For example,
a series of fixed records (like 80 column card deck) is trivial to deal
with when it is intact, but as soon as the framing gets out of whack,
it's not going to be easy to resync. In my opinion, file formats should
include some provision for scavenging later (which won't recover
everything but should limit loss). The same goes for disk structures
(filesystems) of course.
Unfortunately simply taking a raw uncompressed file and using an "open
wrapper archive" around it does little to address this issue. The
internal file can be checksummed as a whole to detect corruption and
maybe allow the wrapper to skip it en bloc and resync, but (being "as
raw as possible") there's not necessarily any internal structure to aid
recovery (such as short signature strings marking individual structures).
Metadata that exists as file pointers inside a blob (e.g. page or row
pointers as you suggest above) can't deal with, e.g., a missing sector;
but of course they can of course be helpful when the file is damaged and
length is unchanged.
--Toby
Regards,
Colin Eby