Linearizing PDF scans

Sat Aug 14 18:20:05 CDT 2021

On Fri, Aug 13, 2021 at 06:15:21PM -0400, J. David Bryan via cctech wrote:
> On Friday, August 13, 2021 at 17:23, Alexandre Souza wrote:
> 
> > Is any kind of standard, recomendation, group, mail list, to discuss
> > the subject? 
> 
> I am not aware of any.  I started with Al Kossow's basic recommendations, 
> modified slightly:
> 
>   - scan at 600 dpi
>   - use TIFF G4 where feasible
>   - use tumble to convert to PDF

My current toolchain for that:
 - scans at 600 dpi grayscale
 - compresses the raw scans with zip for archival and possible reruns
   (yes, I got bit by overly aggressive compression optimization in
    djvu, ask me about it, *grr*)
 - runs them through
  gm convert input.tiff -normalize -despeckle +dither -type bilevel output.tiff
 - uses tiff_findskew and pnmrotate to deskew them
 - compresses the tiff files with G4 for feeding into tesseract
 - uses tesseract to create PDFs with overlaid OCRed text per page and 
   separately dump the OCR text into a txt for later indexing
 - bundles the per-page PDFs into a single PDF with pdfunite
 - finally archives, as a single commit into a git repo
   - the zip compressed raw scans
   - the OCR overlaid PDF
   - the OCRed txt

While tesseract isn't perfect, it does a pretty good job. Copy-pasting OCRed
text from one of those PDFs opened in evince works remarkably well.

It is mostly used to avoid piling up mountains of paper from stuff like
invoices, tax bills as well as the occasional "I'm not sure if I'm ever
going to look at this manual again, let's archive it just in case".

I probably should bundle the whole mess of scripts up and put it on github
some day.

> 
> I then wrote and use a couple of simple image-processing utilities based on 
> the Leptonica image library:
> 
>   http://www.leptonica.org/

Thanks for the pointer, I'm going to take a look - apparently tesseract uses
leptonica for some image processing work.

> 
> ...to clean up the scans (the library makes the programs pretty trivial).  
> They start with the raw scans and:
> 
>   - mask the edges to remove hole punches, etc.
>   - size to exactly 8.5" x 11" (or larger, for fold-out pages)
>   - remove random noise dots (despeckle)
>   - rotate to straighten (deskew)
>   - descreen photos on pages into continuous-tone images
>   - quantize and solidify screened color areas into solid areas
>   - assign page numbers and bookmarks in the PDF
> 
> A good example PDF produced by these programs is:
> 
>   http://www.bitsavers.org/pdf/hp/64000/software/64500-90912_Mar-1986.pdf

That is a very nice and clean scan!

> 
> The cover is a "solidified" black/gray/white image, manual pages 1-2 and 
> 1-4 are continuous-tone JPEG images overlaying bilevel text images, and the 
> rest of the pages are masked, deskewed, bilevel text images.  The PDF 
> bookmarks and logical page numbers are auto-generated from the original 
> scan filenames.
> 
> The final step is linearizing the PDFs, but I'm wondering whether this is 
> still useful.

What is that? Never heard of linearizing PDF before, I've so far been
concerned to eventually adjust my pipeline to properly support PDF/A
(the archival version), but haven't gotten around to look into it.

Kind regards,
           Alex.
-- 
"Opportunity is missed by most people because it is dressed in overalls and
 looks like work."                                      -- Thomas A. Edison