Better indexing on bitsavers

20 May 2005

On Fri, 2005-05-20 at 15:50 +0200, Jan-Benedict Glaw wrote:
...
  On Fri, 2005-05-20 13:37:54 +0000, Jules Richardson
<julesrichardsonuk at yahoo.co.uk> wrote:
  On Fri, 2005-05-20 at 14:05 +0200, Jan-Benedict
Glaw wrote:
  On Fri, 2005-05-20 11:36:24 +0000, Jules
Richardson <julesrichardsonuk at yahoo.co.uk> wrote:
  >   - A tarball containing all the TIFF (or
whatever) images as well
 >     as some (generated)
 See my other post; that's my preference and what I tend to do with all
 image-based PDF content I download from anywhere anyway... 
 For the records (and my education), how do you extract these? 
 I've always used Imagemagick's 'convert' util to do it -
 'convert foo.pdf foo.tif'  will produce a multi-page TIFF image
 corresponding to the input pdf file.
 'convert foo.pdf foo%02d.tif' will give you seperate TIFF images for
 each page (which is the more useful flavour) 
 Will these extract the _original_ TIFF files (eg. with embedded comments
 etc.) or would it only produce images looking like the original ones? 
Hmm, I'm not actually sure - if I get chance later I'll do some tests...
it certainly won't result in any quality loss, but whether it preserves
comments I don't know. Possibly it depends on the PDF spec too (i.e.
whether the PDF spec explictly mentions non-image fields within image
data embedded in a PDF file and whether the spec says that they must be
preserved when transformations are done)
...
  Oh, TIFF, *Tagged* image file format. TIFF could even
be used to hole
 metadata :) 
It's *could* - problem I've always found is that tools to handle multi-
image TIFF files are pretty thin on the ground; they either don't bother
at all (and assume one image per file) or they try and buffer the whole
lot into memory (for 100 A4 page scans in one file, that's Not A Good
Thing). Only about 20% of tools I've come across over the years actually
try and handle TIFFs properly (as I've said in the past, even
Imagemagick isn't great because it buffers data to 32-bit colourspace in
memory regardless of the actual images' bit depth)
That then means that for TIFF to be a viable archive option, it needs to
be one image per page - which also means duplicating metadata across all
pages (images) too.
In some ways it'd be handy having metadata in each image saying what
document it relates to, but duplicating a copy of the index,
description, search keywords or whatever across every image making up a
scanned document seems a bit excessive.
I think I'd rather keep the image data as seperate TIFF files within a
common directory / archive making up a document, and a single ASCII file
that provides the metadata for that document (TIFF's a pretty darn good
format, but it's let down by some pretty bad tools out there)
Open question to the list - would PNG files be a better option than
TIFF? In terms of what I've said above, there's probably not a lot in it
(and TIFF probably has support on classic hardware where PNG doesn't).
But PNG possibly has the advantage that most (and certainly all modern)
web browsers support it, which possibly has implications for a web-
accessible archive...
cheers
Jules

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Better indexing on bitsavers