Linearizing PDF scans
Tony Aiuto
tony.aiuto at gmail.com
Fri Aug 13 17:23:29 CDT 2021
On Fri, Aug 13, 2021, 6:15 PM J. David Bryan via cctech <
cctech at classiccmp.org> wrote:
> On Friday, August 13, 2021 at 17:23, Alexandre Souza wrote:
>
> > Is any kind of standard, recomendation, group, mail list, to discuss
> > the subject?
>
> I am not aware of any. I started with Al Kossow's basic recommendations,
> modified slightly:
>
> - scan at 600 dpi
> - use TIFF G4 where feasible
> - use tumble to convert to PDF
>
> I then wrote and use a couple of simple image-processing utilities based
> on
> the Leptonica image library:
>
> http://www.leptonica.org/
>
> ...to clean up the scans (the library makes the programs pretty trivial).
> They start with the raw scans and:
>
> - mask the edges to remove hole punches, etc.
> - size to exactly 8.5" x 11" (or larger, for fold-out pages)
> - remove random noise dots (despeckle)
> - rotate to straighten (deskew)
> - descreen photos on pages into continuous-tone images
> - quantize and solidify screened color areas into solid areas
> - assign page numbers and bookmarks in the PDF
>
> A good example PDF produced by these programs is:
>
> http://www.bitsavers.org/pdf/hp/64000/software/64500-90912_Mar-1986.pdf
>
> The cover is a "solidified" black/gray/white image, manual pages 1-2 and
> 1-4 are continuous-tone JPEG images overlaying bilevel text images, and
> the
> rest of the pages are masked, deskewed, bilevel text images. The PDF
> bookmarks and logical page numbers are auto-generated from the original
> scan filenames.
>
> The final step is linearizing the PDFs, but I'm wondering whether this is
> still useful.
>
> -- Dave
It is of negative value. Any single container for a document makes it
easier to handle than a bunch of pages discrete files that must be managed
as a unit. Bandwidth is cheaper than human labor. Don't optimize the wrong
thing.
>
>
>
>
More information about the cctalk
mailing list