Better indexing on bitsavers

20 May 2005

On Fri, 2005-05-20 at 09:31 +0200, Jan-Benedict Glaw wrote:
...
  On Thu, 2005-05-19 22:20:53 +0000, Jules Richardson
<julesrichardsonuk at yahoo.co.uk> wrote:
  On Thu, 2005-05-19 at 23:46 +0200, Jan-Benedict
Glaw wrote:
  I'm still thinking about how paper-based
documentation can be made up
 cleverly enough to gain text as well as images and mixing meta-data into
 that. Maybe I'd do some C programming and hack something nice producing
 PDF files helding everything? But first, I'd need to understand PDF
 (whose specification actually is about 8cm thick...)  
 Doesn't this sort of imply that PDF is the wrong choice of format for
 jobs like these? (plus I'm pissed at Adobe because their current readr
 for Linux eats close on 100MB of disk space just to let me read a PDF
 file :-)  
 There are alternatives, like:

 	- A tarball containing all the TIFF (or whatever) images as well
 	  as some (generated)  
See my other post; that's my preference and what I tend to do with all
image-based PDF content I download from anywhere anyway...

...
  HTML page (containing some kind of slide
 	  show) as well as a small description file (use this with some
 	  program (to be written) to generate the HTML file(s)).

 	  This gives the chance that the description file can be done
 	  quite clever, so you'll get eg. a clickable index for the TIFF
 	  files (though, needs to be done manually, but now this work
 	  load can actually be *distributed*) 
One of the things that I was working on a few years back was layering
multiple delivery mechanisms over one form of content (where the dataset
was sufficiently large that storage in multiple formats wasn't
justified). 

Data was kept in the "purest" form on the server side, and a client
could ask for content in whatever format they wanted (in this case raw
images, PDF, HTML etc.) and over whatever interface mechanism they
wanted (HTTP, FTP, WAP, email, network filesystem etc.)

Conversion of imagery tends to be memory-expensive, but not
computationally so (unless you're also scaling / cropping images);
caching can be added in where necessary so save on *some* resources. 

The other advantage is that as a database is keeping track of what's in
the system there are potential hooks there for indexing and maintaing
access stats.

I could see some of the big archives around the planet (regardless of
content) going this way in the future; user base is maximised through
offering different formats whilst the "pure" dataset is all that's
backed up and actually kept on disk.

cheers

Jules

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Better indexing on bitsavers