toby at telegraphics.com.au
Thu Apr 9 09:35:20 CDT 2020
On 2020-04-09 10:16 AM, emanuel stiebler via cctalk wrote:
> Hi All,
> somebody scanned documents for me in .pdfs.
> Looking into them, they are pages of jpgs embedded in .pdf ..
> (100 pages resulting in 350MBytes ...)
> Any easy way to convert them into some b/w .pdf file?
> It is all text, no drawings ...
Typically I extract using pdfimages
pdfimages version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdfimages [options] <PDF-file> <image-root>
You can then use GraphicsMagick to threshold to bilevel (a suitable
threshold can be found by inspecting or histogramming the image e.g. in
gm mogrify -threshold XX% -monochrome
(or `gm convert` can convert each page to TIF for the next step)
Then I'd go via TIFF, combining and compressing all pages as G4
compression using `tiffcp -c g4`, then if you want a PDF instead of
multipage tiff, you can transcode to PDF with `tiff2pdf`.
tiffcp and tiff2pdf are libtiff utilities.
There might be a shortcut using different tools but those are the tools
More information about the cctalk