Better indexing on bitsavers

20 May 2005

On Fri, 2005-05-20 11:36:24 +0000, Jules Richardson <julesrichardsonuk at
yahoo.co.uk> wrote:
...
  On Fri, 2005-05-20 at 09:31 +0200, Jan-Benedict Glaw
wrote:
  On Thu, 2005-05-19 22:20:53 +0000, Jules
Richardson <julesrichardsonuk at yahoo.co.uk> wrote:
  On Thu, 2005-05-19 at 23:46 +0200, Jan-Benedict
Glaw wrote:
  I'm still thinking about how paper-based
documentation can be made up
 cleverly enough to gain text as well as images and mixing meta-data into
 that. Maybe I'd do some C programming and hack something nice producing
 PDF files helding everything? But first, I'd need to understand PDF
 (whose specification actually is about 8cm thick...) 
 Doesn't this sort of imply that PDF is the wrong choice of format for
 jobs like these? (plus I'm pissed at Adobe because their current readr
 for Linux eats close on 100MB of disk space just to let me read a PDF
 file :-) 
 There are alternatives, like:
        - A tarball containing all the TIFF (or whatever) images as well
          as some (generated)  
 See my other post; that's my preference and what I tend to do with all
 image-based PDF content I download from anywhere anyway... 
For the records (and my education), how do you extract these?
...
   HTML page
(containing some kind of slide
          show) as well as a small description file (use this with some
          program (to be written) to generate the HTML file(s)).
          This gives the chance that the description file can be done
          quite clever, so you'll get eg. a clickable index for the TIFF
          files (though, needs to be done manually, but now this work
          load can actually be *distributed*) 
 One of the things that I was working on a few years back was layering
 multiple delivery mechanisms over one form of content (where the dataset
 was sufficiently large that storage in multiple formats wasn't
 justified).
 Data was kept in the "purest" form on the server side, and a client
 could ask for content in whatever format they wanted (in this case raw
 images, PDF, HTML etc.) and over whatever interface mechanism they
 wanted (HTTP, FTP, WAP, email, network filesystem etc.) 
Actually, I was working on something like that as well, but with a
different ulterior motive: build something like this as a redundant,
peer-to-peer capable database and many of archiving-old-data problems
just vanish. (Indeed, it would make up a nice P2P system as well.)
...
  I could see some of the big archives around the planet
(regardless of
 content) going this way in the future; user base is maximised through
 offering different formats whilst the "pure" dataset is all that's
 backed up and actually kept on disk. 
That's what I dream about in long nights, but not as a centralized
database, but a distributed one. Imagine you shift in a raw-encoded
audio CD (with all the interleaved stuff intact, even containing all the
mischievous, intentional errors. ...and a front-end interlacing WAV
files (which another front-end could use to produce ogg/mp3/wma/you name
it).
Concepts like that can be applied to nearly all media types. Just record
that the underliing (sp?)  recording/reading machinery gets and write
filters for that.  These filters may get as complex as filesystem
drivers. In fact, this *is* a layered filesystem. Remember the thread(s)
about how to rescue tapes? ...or other HDD images? Apply these general
concepts and things may get easier :)
For now (as we're not (yet) there), prividing space isn't my main
problem. It's about having (time to write) the software.
MfG, JBG
--
Jan-Benedict Glaw       jbglaw at lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg  _ _ O
 fuer einen Freien Staat voll Freier B?rger" | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Better indexing on bitsavers