Better indexing on bitsavers

20 May 2005

On Thu, 2005-05-19 22:20:53 +0000, Jules Richardson <julesrichardsonuk at
yahoo.co.uk> wrote:
...
  On Thu, 2005-05-19 at 23:46 +0200, Jan-Benedict Glaw
wrote:
  I'm still thinking about how paper-based
documentation can be made up
 cleverly enough to gain text as well as images and mixing meta-data into
 that. Maybe I'd do some C programming and hack something nice producing
 PDF files helding everything? But first, I'd need to understand PDF
 (whose specification actually is about 8cm thick...) 
 Doesn't this sort of imply that PDF is the wrong choice of format for
 jobs like these? (plus I'm pissed at Adobe because their current readr
 for Linux eats close on 100MB of disk space just to let me read a PDF
 file :-) 
There are alternatives, like:
        - A tarball containing all the TIFF (or whatever) images as well
          as some (generated) HTML page (containing some kind of slide
          show) as well as a small description file (use this with some
          program (to be written) to generate the HTML file(s)).
          This gives the chance that the description file can be done
          quite clever, so you'll get eg. a clickable index for the TIFF
          files (though, needs to be done manually, but now this work
          load can actually be *distributed*)
        - PDF isn't all that wrong. As far as I understood it, it's
          possible to embed any binary sequence into a PDF file. With a
          program (like an extended tumble) you can produce a readable
          PDF file that also acts as some kind of tarball (though needs
          a self-written generator/extractor).
I actually *really* like the PDF approach, just because it's so easy and
hassle-free to view the file. Also, if done right, you won't loose
access to your actual image files. But in the long term, we need to work
on the tools. (That's why I started to play a bit with the TeX
approach.)
That's my wishlist (to be addressed after the vax-linux port matured:-)
        - PDF files with really cool bookmarks, containing the chapter
          numbers, headings or page numbers (or any mixture of those).
          That means that the backing store might either be a tarball or
          a PDF file (used as a tarbarll).
        - I'm *really* missing a clickable index.  People may invest the
          time to hand-type the original book's index and the pages into
          some description file. From this, I expect a new index
          (clickable) to be generated.
        - If graphical contents is OCRed+verified, there shall be a way
          to generate the final PDF with the OCRed data (except for
          those pages where this hasn't been done--there, the original
          image file should show up).
Comments?
MfG, JBG
--
Jan-Benedict Glaw       jbglaw at lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg  _ _ O
 fuer einen Freien Staat voll Freier B?rger" | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Better indexing on bitsavers