There's plenty of file system meta-data out there to confound
this process. Blindly archiving and un-archiving will destroy
data that's not inside the file. There's a lot to be said for
archiving images of entire filesystems. What, timestamps
aren't important? Creation dates as well as last-modified dates?
Archive bits? At least 'tar' preserves Unix's groups and
permissions to a reasonable degree.
--
There is also 'hidden' but valuable data you may not know about.
I recovered the sources to some stand-alone machine utilities off
the end of a 9 track data set on a tape that had been reused.
It's also been mentioned that DEC 'distributed' unsupported code
on distribution discs where the unsupported code was 'deleted'
(marked as deleted in the directory)
--
Another problem with bits in the wild is you have no idea if they've
been patched or corrupted in some way before you get them.
This is why it's necessary to read as many copies of the same program
that you can find, even if the program has already been 'archived'.
running when shutdown 5 years ago
alpha server 2100
monitor VRT19-HA
tape drive TS05
monitor VR297-DA
also several dec terminals
Mike
c m c f a d d e n 5 a t c o m c a s t d o t net
>From: "Randy McLaughlin" <cctalk at randy482.com>
---snip---
>
>Some of the original DRI archives I sent to Gaby are examples especially of
>the working disks that were never distributed outside of DRI. On these
>disks that contain up to three versions (*.A86, *.BAK, and the deleted file)
>of the source code being developed I included teledisk images so people
>could look at these unique disks byte by byte.
---snip---
Hi
On one of the disk images I was recovering, someone had deleted
the wrong copy of the program. The broken one was in the normally
readable directory while the good one was marked deleted.
I was able to recover this from a complete image.
Of course, there is always the bits of confidential information
that gets onto disk that was never intended to be archived
forever ( still of interest to some future generation ).
Dwight
>From: "John Foust" <jfoust at threedee.com>
>
>There's plenty of file system meta-data out there to confound
>this process. Blindly archiving and un-archiving will destroy
>data that's not inside the file. There's a lot to be said for
>archiving images of entire filesystems. What, timestamps
>aren't important? Creation dates as well as last-modified dates?
>Archive bits? At least 'tar' preserves Unix's groups and
>permissions to a reasonable degree.
>
Hi
In one file system I'm looking at, the directory contains
the information as to what type of file it is ( not in the
name of the file either ). Things like system only executable
and word processor format are not encoded in the file name
or the file it self on the one I'm looking at.
Also, missing from the file are things like start address
and load address.
Dwight
>
>Distribution can be an archival tool, I recommend people download as much of
>my site as possible. God forbid if later today I get hit by a car I don't
>want the information I've gathered to be lost.
>
Hi
A Great point not discussed so far in this bout!
Dwight
>From: "Vintage Computer Festival" <vcf at siconic.com>
>
>On Thu, 19 May 2005, Dwight K. Elvey wrote:
>
>> In any case, these are all academic in comparison to the problems
>> of indexing. I don't even have the beginings of how to deal
>> with that problem.
>
>Google :)
>
Hi
It works surprisingly well but it still misses a lot.
Like when I was looking for the data sheets of the WD1100V-01.
The information was out there, it just wasn't indexed.
Most document writing programs today have that automatic
indexing by marking things as you go along to place in
the index. It requires that someone actually realizes
what needs to be indexed. Then comes the problem of cross
references. Add to that synonyms.
I was looking through the directories of one of the images
I'd captured from the Polymorphic stuff and found that
a disk labeled "GAMES" contained a version of Forth.
That may have been the persons personal feelings
about it but it was not good indexing.
My guess is that Google is missing 90 to 95% of the
relevant information out there. If you include site links
on individual pages that improves to about 85% at best.
Now, add to that the problem of something that exist
but gets somehow placed in the wrong place.
Indexing will be the biggest challenge!
Dwight
>From: "Jim Leonard" <trixter at oldskool.org>
>
>On Thu, May 19, 2005 at 10:55:37AM -0700, Dwight K. Elvey wrote:
>> We then have a library with the one key.
>
>There are billions of .ZIP files. I don't think we're going to "lose
>the key" any time soon.
>
>> Maybe I'm just thinking a little beyond where you are at. Extra
>> levels are not good. I can't find a way to make it any clearer
>> than that.
>> If you've ever spent some time actually recovering corrupted
>> data you'd understand.
>
>You're assuming I haven't. And besides, in my RAR example,
>recovering corrupted data is actually easier than raw data due to
>the use of ECC embedded in the archive. So while you're struggling
>for days trying to make sense of mangled flux reversals on some
>disk/tape somewhere, I will simply read what I can and ECC the rest
>in 30 seconds.
Hi
You are assuming that the ECC was not part of the corruption?
Again, I don't think you have really done a recovery project.
You've just allowed the tool to do it for you. You need a
little more experience under your belt before you can truly
understand the issues. Do you fully understand the limitations
of ECC's? Have you actually worked with some of the algorithms?
And yes, I can see a day when even ZIP might be lost.
How many Mickey Mouse watches were made and how many exist
today. Sometimes, being more common makes the item more
likely to be lost.
Again, I'm looking at how things have been treated in the
past. What mistakes were made. These are the types of things
that are most likely to trip us up in the future. If
one doesn't learn from mistakes ......
Sorry for talking about such a volatile subject. I'm just
glad that the person actually working on the project at the
museum has researched the subject enough to recognize the
many pitfalls.
Dwight
>From: "Randy McLaughlin" <cctalk at randy482.com>
---snip---
>
>It works for the largest base, even for those that are just now ready to
>learn.
>
>I am not advocating winzip as a perfect system, just a system that fits my
>personal desire to spread archives to as many as want to download them.
>
>
>Randy
>www.s100-manuals.com
Hi Randy
For what you are doing, providing data to users, you are
doing the right thing. For an archive, it is the wrong thing.
For an archive, one should have the least and simplest
amount of encoding or remoteness from the original as
possible. Feeding the information into a form such as a zip
does not meet that criteria.
Providing data to the users in the most universal and current
format is the proper thing for an archivist to do in many
cases ( like you do ). Still, you have shut out many MAC
users and some of those that are so far in the past that
they are still running on a 8080 CP/M machine.
Imagine that some future archivist digs up Sellams library.
He must first realize that he needs something that interprets
x86 instructions to unwind the encoding. He then has to figure
out what the purpose of the data was.
Many here have made the assumption that the information would
smoothly be moved along to the next current media and retranslated
to the next handy compaction or encapsulation tool.
The real world isn't like that. There will be gaps in the
maintenance. For many reasons, like budgets or just apathy.
A good archivist need to do the best to anticipate this
and consider this as part of their strategy.
Dwight
I have some archives that are archives of archives and intend to go
back and
extract the original archives into un-packaged treed directories then
repackage into just a zip format so that future users don't have to
figure
out the zip format to then have to figure out the myriad of CP/M
packaging/squeeze formats.
--
Verifying that all of these files are 'correct' of course..
One thing to be VERY careful of is reinjecting files that were bad and
fixed
later into reencoded archives.
When I was more active in arcade game ROM archiving, we ran into this
constantly,
where someone would find corrupt ROM data and declare them as 'new'
versions of
the game. A CRC (now SHA1) checking scheme eventually evolved to index
what had
already been shown to be correct (either though simulation or by
finding duplicates
of the physical ROMS) as well as a utility to detect when ROM dumps had
things like
stuck adr or data bits.