Inventory for handling scanned documents (was: Better indexing on bitsavers)

23 May 2005

On Fri, 2005-05-20 19:29:18 +0200, Jan-Benedict Glaw <jbglaw at lug-owl.de> wrote:
...
  On Fri, 2005-05-20 17:08:34 +0000, Jules Richardson
<julesrichardsonuk at yahoo.co.uk> wrote: 
...
  We've now named quite a lot of applications and
concepts about how to
 handle scanned documents. I'd like to get the big picture: 
Thanks a lot for all your input. What I read out of that:

	- All of you just want to read the contents.
	- It would be nice if specific make-up could be applied, but
	  because it mainly depends on a lot of work, most of you don't
	  consider doing this.
	- Getting it right into a web browser might be nice to have.
	- Some of you also care about each individual image for
	  later[tm] post-processing (like OCRing it with The Next
	  Generation Never-Failing HyperOCR[tm](C)(REG) )
	- Getting a PDF file out of those images is a must because it's
	  so nice'n'easy to print and copy around.

The point is: most of this actually can be done with little scripting or
programming! Most of you (especially those who scan, but also those who
try to dismantle a PDF file) stated that there are home-brewn scripts
doing most of the dirty work. But nobody actually told me something like
'Here's this URL, you'll find my stuff right there.'

...and this is what brings most of the complecity into the big picture.

Using libtiff, I guess it's not all that complicated to write a little
helper application capable of merging up to 64K single TIFF images into
eg. a multi-page TIFF with additional arrays of type-7 tags. That's kind
of magic: with this trick, you can embed any data inside a (multi-page)
TIFF, pegged to one of the sub-files (read as: one of the embedded page
images).

Armed with this, you can have /n/ TIFFs for a book's /n/ pages or one
hugh multi-page TIFF containing them all, *plus* the bonus of added
keywords, chapter captions, printed page number and the like.  All these
goodies won't show up during TIFF->PDF conversion (for printing), but
they'd be well-used by some little scripts to generate a nice web-based
slide show.  Even a X11-based reader is in reach (there are probably
already nice readers for things like eBooks or books for Palm Pilots,
though, I've never ever used something like that).

So we'd push the data into single TIFF files, we'd easily extract them
into text files (eg. for indexing), we'd even import auto-OCRed text
(won't be perfect, but probably better than nothing, just for pure
Google indexing).

To cut a long story short:

	- Could you life, for long-term storing the data, with using
	  single-page TIFFs?
	- These could be used to create multi-page TIFFs, PDFs, web
	  pages, ...
	- Would you like to see something like a X11-based reader which
	  could support searching and equal-or-better navigation
	  (compared to Acrobat Reader)?

I currently don't have something scanned around; could somebode send me
(preferrably an URL) for some complete scanned doc, I'd prefer something
in the range of 20..120 pages.

Eric, I don't know how well-working your bookmark generation code is.
Can it already handle really tree-like looking bookmarks if the data was
available in tumble's input files?

MfG, JBG

-- 
Jan-Benedict Glaw       jbglaw at lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg  _ _ O
 fuer einen Freien Staat voll Freier B?rger" | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Inventory for handling scanned documents (was: Better indexing on bitsavers)