Philip Pemberton wrote:
Hi guys,
I'm after a program that can convert TIFF files into PDFs. I've seen
Eric Smith's "Tumble" app, which works great... but only for B&W TIFFs.
While I can use Imagemagick to convert the images to B&W, that defeats
the point: there are photos on the scanned pages, and I'd rather like to
keep them as photos, not black splodges.
Hmm, what are you using for the Imagemagick command line? I've thrown TIFFs at
it before (at varying levels of bpp) and got sensible PDFs out the back end.
It's possible that it's a versioning/library issue, though - doesn't IM make a
call into someone elses' PDF library to do the actual assembly? Maybe it's
this part that's "broken" on your particular setup.
Also, has anyone come up with a "best practice
guide" for manual
scanning? At the moment I'm scanning like this:
B&W text only: 600dpi, black and white, threshold=50%.
I'd still do those at a few bpp unless the pages were totally free on
contaminants, creases etc. - I've seen cases where 1bpp introduces noise into
the image which can screw up a later OCR attempt. I'd question the threshold
too, unless you've got time to proof-read everything (things like scans from
dot-matrix printouts can vary quite a lot in tone I've found, so I think it's
better to keep things "as-is" and consider things like threshold tweaks as
part of a subsequent "post-process" or OCR phase)
I've usually done things at 300 or 400dpi (depending on the content) just to
keep the sizes down a bit - but with storage getting ever-cheaper there's
perhaps not the incentive to do that now and 600dpi is fine (more is probably
overkill unless trying to do things like fiche)
Obviously if there are better ways (in terms of
quality and/or speed)
I'd like to know before I scan a ton of testgear manuals...
I didn't have a sheet-feeder, but a lot of my stuff was comb-bound (and/or I
had lots of data spread across manuals with low page counts). I did thousands
of pages by hand, and it was somewhat soul-destroying. :/
Also, does anyone know of an app that can take the PDF
file, OCR it and
then insert the text as a background layer while leaving the image
alone?
Not me. I chose to delegate the OCR step to future generations (by which time
OCR will hopefully be a little better anyway) :-)
I couldn't handle scanning all the above content *and* proof-reading the
subsequent OCR (and personally I like physical printouts, so whether it's OCR
or images makes no difference to me)
cheers
Jules