Digital archiving tools - test-drb@ccmp.vtda.org

8 Mar 2011

Many of us maintain large collections of bits that we'd like to preserve over a long
time, and distribute, replicate, and migrate via unreliable storage media and networks. 
As disk sizes (and archive sizes) have increased, the probability of corruption undetected
or uncorrected by the mechanisms normally built into disk drives, network protocols, and
filesystems has increased to a level that warrants great concern.

I would be interested to know if there exists an archive format that has the following
desirable properties:

1) It is well-documented, and relatively simple, to facilitate its implementation on many
platforms present and future.

2) It supports some degree of incremental updating, but need not be particularly efficient
about it.  An explicit compaction operation is preferable to an overly complex format.  It
is adequate to use append-only strategies appropriate for write-once media.

3) Insertion and extraction of files, copying of the archives, and other
archive-manipulation utilities support end-to-end verification that identical bits have
been stably recorded to the media, bypassing or defeating platform-level or hardware-level
caching mechanisms.  Where this is not possible, the limits must be carefully delineated,
with some basis for determining the properties of the platform and certifying reliability
properties where possible.

4) The format should provide for superior error detection capability, designed to avoid
common failure modes with mechanisms typically used in hardware.  For example, use a
document-level cryptographic checksum rather than a block-level CRC.

5) The format should include a high degree of internal redundancy and recoverability, say,
along the lines of a virtual RAID-array.

Just as biological organisms constantly correct DNA transcription errors,
the idea is to have a format that is robust across long-term exposure to
imperfect copying and transmission channels.

Does anything like this exist?

--Bill