Converting Documents

Thu Apr 9 09:24:00 CDT 2020

> On Apr 9, 2020, at 10:16 AM, emanuel stiebler via cctalk <cctalk at classiccmp.org> wrote:
> 
> Hi All,
> somebody scanned documents for me in .pdfs.
> Looking into them, they are pages of jpgs embedded in .pdf ..
> (100 pages resulting in 350MBytes ...)
> 
> Any easy way to convert them into some b/w .pdf file?
> It is all text, no drawings ...
> 
> Pointers?
> 
> Thanks

A good source of information is Al Kossow's Bitsavers archive, the section where he describes the tools he uses.

It's very unfortunate your original scan files are JPG; those are the wrong format for text or line art -- JPG is ONLY for photographs and similar continuous tone images.  TIFF or PNG or B/W FAX formats are all superior, and often more compact.

If by "convert to b/w" you mean to b/w images, Al's tools will help.  If you mean extracting the actual text, that's a different matter, now you need an OCR tool.  There are good commercial OCR programs around.  No open source ones that I know of; I've seen one but it didn't work well enough to be worth the trouble.  OCR may be extremely effective or not at all depending on the quality of the material.  In really extreme cases you may have to type things in by hand; I've done that with 600 pages of blurry listings because there was a good reason to go to that effort.

	paul