On 2020-04-09 10:16 AM, emanuel stiebler via cctalk wrote:
Hi All,
somebody scanned documents for me in .pdfs.
Looking into them, they are pages of jpgs embedded in .pdf ..
(100 pages resulting in 350MBytes ...)
Any easy way to convert them into some b/w .pdf file?
It is all text, no drawings ...
Pointers?
Thanks
Typically I extract using pdfimages
$ pdfimages
pdfimages version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdfimages [options] <PDF-file> <image-root>
You can then use GraphicsMagick to threshold to bilevel (a suitable
threshold can be found by inspecting or histogramming the image e.g. in
Photoshop).
gm mogrify -threshold XX% -monochrome
(or `gm convert` can convert each page to TIF for the next step)
Then I'd go via TIFF, combining and compressing all pages as G4
compression using `tiffcp -c g4`, then if you want a PDF instead of
multipage tiff, you can transcode to PDF with `tiff2pdf`.
tiffcp and tiff2pdf are libtiff utilities.
There might be a shortcut using different tools but those are the tools
I use.
--Toby