Converting Documents

Thu Apr 9 09:35:20 CDT 2020

On 2020-04-09 10:16 AM, emanuel stiebler via cctalk wrote:
> Hi All,
> somebody scanned documents for me in .pdfs.
> Looking into them, they are pages of jpgs embedded in .pdf ..
> (100 pages resulting in 350MBytes ...)
> 
> Any easy way to convert them into some b/w .pdf file?
> It is all text, no drawings ...
> 
> Pointers?
> 
> Thanks
> 

Typically I extract using pdfimages

$ pdfimages
pdfimages version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdfimages [options] <PDF-file> <image-root>

You can then use GraphicsMagick to threshold to bilevel (a suitable
threshold can be found by inspecting or histogramming the image e.g. in
Photoshop).

  gm mogrify -threshold XX% -monochrome

(or `gm convert` can convert each page to TIF for the next step)

Then I'd go via TIFF, combining and compressing all pages as G4
compression using `tiffcp -c g4`, then if you want a PDF instead of
multipage tiff, you can transcode to PDF with `tiff2pdf`.

tiffcp and tiff2pdf are libtiff utilities.

There might be a shortcut using different tools but those are the tools
I use.

--Toby