Converting Documents - test-drb@ccmp.vtda.org - classiccmp.org

List overview All Threads
Download

Converting Documents

DEC OS/8 Question (getting an...

Old FORTRAN programs, libraries,...

emu＠e-bbes.com

9 Apr 2020 9 Apr '20

7:16 a.m.

Hi All, somebody scanned documents for me in .pdfs. Looking into them, they are pages of jpgs embedded in .pdf .. (100 pages resulting in 350MBytes ...) Any easy way to convert them into some b/w .pdf file? It is all text, no drawings ... Pointers? Thanks

Reply

Show replies by date

paulkoning＠comcast.net

9 Apr 9 Apr

7:24 a.m.

On Apr 9, 2020, at 10:16 AM, emanuel stiebler via cctalk <cctalk at classiccmp.org> wrote: Hi All, somebody scanned documents for me in .pdfs. Looking into them, they are pages of jpgs embedded in .pdf .. (100 pages resulting in 350MBytes ...) Any easy way to convert them into some b/w .pdf file? It is all text, no drawings ... Pointers? Thanks

A good source of information is Al Kossow's Bitsavers archive, the section where he describes the tools he uses. It's very unfortunate your original scan files are JPG; those are the wrong format for text or line art -- JPG is ONLY for photographs and similar continuous tone images. TIFF or PNG or B/W FAX formats are all superior, and often more compact. If by "convert to b/w" you mean to b/w images, Al's tools will help. If you mean extracting the actual text, that's a different matter, now you need an OCR tool. There are good commercial OCR programs around. No open source ones that I know of; I've seen one but it didn't work well enough to be worth the trouble. OCR may be extremely effective or not at all depending on the quality of the material. In really extreme cases you may have to type things in by hand; I've done that with 600 pages of blurry listings because there was a good reason to go to that effort. paul

Reply

toby＠telegraphics.com.au

7:35 a.m.

On 2020-04-09 10:16 AM, emanuel stiebler via cctalk wrote:

Hi All, somebody scanned documents for me in .pdfs. Looking into them, they are pages of jpgs embedded in .pdf .. (100 pages resulting in 350MBytes ...) Any easy way to convert them into some b/w .pdf file? It is all text, no drawings ... Pointers? Thanks

Typically I extract using pdfimages $ pdfimages pdfimages version 4.00 Copyright 1996-2017 Glyph & Cog, LLC Usage: pdfimages [options] <PDF-file> <image-root> You can then use GraphicsMagick to threshold to bilevel (a suitable threshold can be found by inspecting or histogramming the image e.g. in Photoshop). gm mogrify -threshold XX% -monochrome (or `gm convert` can convert each page to TIF for the next step) Then I'd go via TIFF, combining and compressing all pages as G4 compression using `tiffcp -c g4`, then if you want a PDF instead of multipage tiff, you can transcode to PDF with `tiff2pdf`. tiffcp and tiff2pdf are libtiff utilities. There might be a shortcut using different tools but those are the tools I use. --Toby

Reply

2187

days inactive

2187

days old

test-drb@ccmp.vtda.org

Manage subscription

2 comments

3 participants

tags (0)

participants (3)

emu＠e-bbes.com
paulkoning＠comcast.net
toby＠telegraphics.com.au