Linearizing PDF scans

J. David Bryan jdbryan at acm.org
Fri Aug 13 17:15:21 CDT 2021


On Friday, August 13, 2021 at 17:23, Alexandre Souza wrote:

> Is any kind of standard, recomendation, group, mail list, to discuss
> the subject? 

I am not aware of any.  I started with Al Kossow's basic recommendations, 
modified slightly:

  - scan at 600 dpi
  - use TIFF G4 where feasible
  - use tumble to convert to PDF

I then wrote and use a couple of simple image-processing utilities based on 
the Leptonica image library:

  http://www.leptonica.org/

...to clean up the scans (the library makes the programs pretty trivial).  
They start with the raw scans and:

  - mask the edges to remove hole punches, etc.
  - size to exactly 8.5" x 11" (or larger, for fold-out pages)
  - remove random noise dots (despeckle)
  - rotate to straighten (deskew)
  - descreen photos on pages into continuous-tone images
  - quantize and solidify screened color areas into solid areas
  - assign page numbers and bookmarks in the PDF

A good example PDF produced by these programs is:

  http://www.bitsavers.org/pdf/hp/64000/software/64500-90912_Mar-1986.pdf

The cover is a "solidified" black/gray/white image, manual pages 1-2 and 
1-4 are continuous-tone JPEG images overlaying bilevel text images, and the 
rest of the pages are masked, deskewed, bilevel text images.  The PDF 
bookmarks and logical page numbers are auto-generated from the original 
scan filenames.

The final step is linearizing the PDFs, but I'm wondering whether this is 
still useful.

                                      -- Dave



More information about the cctalk mailing list