On Apr 9, 2020, at 10:16 AM, emanuel stiebler via
cctalk <cctalk at classiccmp.org> wrote:
Hi All,
somebody scanned documents for me in .pdfs.
Looking into them, they are pages of jpgs embedded in .pdf ..
(100 pages resulting in 350MBytes ...)
Any easy way to convert them into some b/w .pdf file?
It is all text, no drawings ...
Pointers?
Thanks
A good source of information is Al Kossow's Bitsavers archive, the section where he
describes the tools he uses.
It's very unfortunate your original scan files are JPG; those are the wrong format for
text or line art -- JPG is ONLY for photographs and similar continuous tone images. TIFF
or PNG or B/W FAX formats are all superior, and often more compact.
If by "convert to b/w" you mean to b/w images, Al's tools will help. If you
mean extracting the actual text, that's a different matter, now you need an OCR tool.
There are good commercial OCR programs around. No open source ones that I know of;
I've seen one but it didn't work well enough to be worth the trouble. OCR may be
extremely effective or not at all depending on the quality of the material. In really
extreme cases you may have to type things in by hand; I've done that with 600 pages of
blurry listings because there was a good reason to go to that effort.
paul