On 3/8/11 5:33 PM, William Maddox wrote:
Does anything like this exist?
As Tim already said, the short answer is 'no', and this is a huge problem for
me/CHM
I made a decision a long time ago that the primary mode of storage would either be media
images
or uncompressed archive files (tar, or uncompressed zip, mostly).
I have thousands of container files, containing millions of files. Not all of them were
imaged
by me, many have unverified or even unknown contents. Not all of the provenance is known.
I have
been generating them much faster than they can be verified or indexed, since I have been
racing
against the clock to get data off of media while it still can be read.
I have two problems, making sure that the containers don't suffer bit rot (currently
in the
terabytes), and getting them cataloged and verified.
There are two very different data sets involved, one that MUST remain invariant, and the
second
which changes constantly as information about the first are added.
The preservation of the first is somewhat simpler, since it is known to not be changing
you can use
an MD5 or some other mechanism over it and take advantage of the fact that bits are easy
to
copy to make redundant images of it and spread the images over geographic locations. The
problem
then, is not to lose the connection between the invariant containers of data, and the
description of
it, stored either as flat files or in a database. One key can be the MD5 of the container,
but this
doesn't take care of the case where you have redundant images from different sources
(although if they
are logically 'the same', does it matter?).
Getting back to your problem, the first thing to think about is separating the
'archive' from the
data that describes it. As Richard has discovered, there are lots of catalog systems, some
open, some
not, for describing the contents of an archive. Archivists are good at this, they have
been dealing with
archives on paper for a long, long time. What they aren't very good at (yet) is
managing a resource which
can't just be put into archival boxes, placed into a climate controlled environment
and 'forgotten'. One
of the few good things about our ephemeral digital blobs of data, though, is they are easy
to copy
(though the time and CPU resources to do so is a non-trivial problem) and we take take
advantage of the
fact that writable mass-storage prices keep going down.
So, the first thing to think about is really getting a handle on what you have in your
collection of bits.
Do you know their provenance, what is in the containers (format type, validity, etc.) and
come up with some
identification scheme where you can link these invariant containers to the descriptive
system to locate and
describe them.
-- hope that helps
And, if someone does know of a system that is available that can do this, I'd love to
hear about it.
I keep thinking this is a business opportunity for someone, except for the fact that
archives never have
any money :-(