Inventory for handling scanned documents (was: Better indexing on bitsavers)

20 May 2005

On Fri, 2005-05-20 at 19:29 +0200, Jan-Benedict Glaw wrote:
...
  On Fri, 2005-05-20 17:08:34 +0000, Jules Richardson
<julesrichardsonuk at yahoo.co.uk> wrote:

  I can't say I've found many bad TIFF
viewers for single images though
 (on any platform); it's only when multiple images are put into the same
 file that a lot of tools start falling over.  
 We've now named quite a lot of applications and concepts about how to
 handle scanned documents. I'd like to get the big picture:

 - How do you scan a paper document? Page by page? Two pages at once
   (with a sufficient large scanner)? Do you use a script or something
   like that? ...or are there well-working applications out there that
   aid in scanning some 100 pages? 
Single page here for A4 docs; for A5 I'll do two pages at once and have
written scripts before to automate the rotating / cropping process (and
chop out some of the 'noise' in the white background to reduce file
size). I tend to use 300dpi.

(Lots of Windows scanner software seems to try and be clever and adjust
the contrast on the fly btw - which isn't so good when you're trying to
get consistency across multiple pages!)

...
   Do you directly scan b/w, or first use
grayscale/colour and then degrade that to b/w? 
I use 8 bit greyscale for all pages, 24bit colour for covers. The former
because I don't want my scanning process to harm future OCR attempts;
I'd rather leave as much info in the images as possible rather than chop
to b/w at a certain threshold and find much later down the line that
information had been lost. If there's no assumption that the docs will
never be OCR'd then that's not a problem though. I could probably get
away with 16 grey levels actually rather than 256 as a reasonable trade-
off between flexibility and storage requirements.

...
  - How do you work on the scanned images: Do you cut
off the white rim as
   much as possible? How do you deal with images that are a tad rotated?
   Accept that? Re-scan to hopefully get a better image? Revert rotation
   in software?  
I've been known to clean up scans and sort out rotation (in particular
photocopied docs tend to be not very straight). It's a time-consuming
job though; I'm almost tempted to say it's not relevant at the scanning
stage and can be deferred (much like the OCR process). 

...
  How do you deal with single black dots in white areas
or the other way around? 
I don't. Lots of storage space can be saved by altering the black and
white threshold of the image though (without checking, I think I found
treating the bottom 10% as black and the top 20% as white worked
particularly well)

...
  - What digital format do you like to get when it's
all finished? Plain
   PDF? PDF with some bookmarks? PDF with all headings as bookmarks? A
   new PDF-hyperref based index? Multiple TIFF/PNG/whatever images?
   Something like a web-based slide-show? ...or multiple formats
   (web-based for viewing, PDF for printing, ...)? 
Seperate TIFF images, one image per page in the original doc. I scan
blank pages too (!) just so that things will work out at any printing
stage. I suppose I want to capture a bit of the "feel" of the original
document as well as the raw content.

...
  - What do you currently use as your software:

 	Operating system: 
Linux for processing / scripting images, scanner's currently an awful
USB thing hooked up to a Windows box though (urgh).

...
  	PDF viewer: 
Try to avoid it where possible. I think I'll end up settling on Acrobat
5.1 under Linux; 7.0 is way too bloated. XPDF and KDE's offering don't
handle PDFs of document scans at all well. Ghostscript seems to have
issues with rendering quality (that could just be a misconfig / RTFM
issue on my part though)

...
  	TIFF viewer: 
GQView's my current favourite general image viewer; it seems fast at
decoding and does a good job as a browser / thumbnail handler. I've not
tried it on multi-page TIFFs yet though to see what it does...

...
  	Browser/other viewers you'd love to use: 
Well, I'd still like a native Linux version of Paint Shop Pro... :)

cheers

Jules

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Inventory for handling scanned documents (was: Better indexing on bitsavers)