John Foust wrote:
At 07:15 AM 10/12/2010, Steven Hirsch wrote:
I have one of these here. Would be glad to split
scanning duties with someone else.
I mentioned it here once before, but I adore my Fujitsu ScanSnap.
This is a ~$400 dedicated self-feeding scanner that comes bundled with
Acrobat, a simple doc organizing program, and the ability to OCR its PDFs
to make them searchable and copy-and-paste-able. Lots of options for
resolution, color or not, auto or not, rotation, saves sets of presets,
etc. Professional and useful.
I have the S1500M (Mac OS X flavor) and have been able to convince it
to produce files that I can make good enough for my purposes. What
follows are some notes on what I've found and how I deal with it.
Having been through the file boxes of old account statements, I turned
my attention to books which I had multiple copies of so that I can
have one at home and one at the office, for reference. One in PDF on
the MacBook would be somewhat handier.
I took Al's standards as a guideline. (See <http://bitsavers.org/>,
look for the heading "The PDF DOcument Format" and start reading
there, but I also got some verbal comments from Al and watched him do
some scans with his setup.) 400dpi black-and-white scans, maybe
600dpi if there is fine-pitch text. The dot pitches given are
indicated to support better OCR quality. Pages with primarily
photographic content may be scanned as "grayscale" or "color" but
this
leads to them being saved through a DCT filter (i.e. JPEG-flavor lossy
compression) in the PDF. Sometimes you have to choose what sucks less.
I scan covers as 150dpi color documents (yielding JPEGs) with a
flatbed scanner. Yeah, a bit chunky, but good enough for me.
It can be important to create a ScanSnap scanning profile (preset)
that forces black-and-white scans. In my experience letting the
ScanSnap software make the decision how to scan a page will have it
scanning yellowed or other non-white-paper pages in color with
resulting use of the lossy DCT filter (and lower dpi too).
One of the books I scanned early on was the Turbo C 2.0 Reference
Guide. Yeah, I know, it's already up on
bitsavers.org. I wanted to
see how the output from the ScanSnap Mac software compared with the
output from Al's process. The answer was, it produces PDF files that
are two to three times as large. This is because it encodes
black-and-white page scans with the Deflate filter, which is all very
well and good and not lossy, but the CCITTFax filter simply yields
better compression.
The Acrobat 8 Pro (bundled with the ScanSnap) can do this for you: you
set up a Preflight profile that includes the fixup "Compress all
monochrome images using CCITT Group 4", and then open the PDF and
execute that Preflight profile to re-compress the black-and-white page
scans.
I believe use of the Deflate filter is a (mis)feature of OS X Core
Graphics; I observe that if I edit a PDF with Preview and save it,
CCITTFax compressed monochrome images are re-compressed with the
Deflate filter and the file grows much larger.
Scanning is fast and easy and reasonably reliable. The ScanSnap is
pretty good about detecting paper jams and misfeeds and multiple-page
feeds; in 53000 pages through I think I've found one case where it
didn't detect a multi-page feed.
How do I detect that? I use Acrobat to number pages in the PDF with
numbers that correspond to the page numbers in the scanned book. If
they don't match up at the end I know I have a problem. Mind, for
some books this can take considerable time.
OCR is slow (a few pages per minute) but does sometimes come in handy
for searchability.
I'd like a version with an 11"-wide paper path: not only could I scan
some more stuff, I could feed more originals through sideways for a
shorter scan.
Numbering the pages in the PDF effectively makes two sets of page
numbers: physical page numbers (1, 2, 3, ..., n) and logical page
numbers (the page numbers you put down in the PDF). This works OK
when the PDF is viewed with Acrobat Reader or Preview: you see the
page numbers you put down. It's not so useful when the PDF is viewed
on an iOS or Android device; those readers show the physical page
numbers.
-Frank McConnell