Scanning question
Frank McConnell
fmc at reanimators.org
Fri Jul 19 17:13:09 CDT 2019
On Jul 18, 2019, at 14:50, Warner Losh via cctalk <cctalk at classiccmp.org> wrote:
> So, I have a bunch of old DEC Rainbow docs that aren't online. I also have
> a snapscan scanner that I use for bills and such.
I do this kind of thing, with a ScanSnap S1500M (M means Mac),
but mostly don't mind that the process is destructive to the
books. Really, the only things that survive this process are
the things that start out looseleaf, and as I’m trying to get
some silverfish food out of my life, most of them get recycled
too.
Really there is a ScanSnap for the text block and a flatbed for
the covers. I scan covers and/or dust jackets first, on the
flatbed, usually 300dpi color. Sometimes front cover with spine
if I think the design is interesting.
Books come apart. Yes, glue bound books get crumbly bindings
after 30 years or so and come apart easily. Newer glue bound
books come apart less easily because the binding is still gummy
and gooey. You will still want a paper cutter or shear to clean
up the gutter by about 1/8 inch (would perhaps use 4mm if my paper
cutter had a metric scale) and make its edge less ragged and less
gooey.
The ScanSnap wants to scan a bunch of sheets/pages and make a PDF
for you. It can do an automatic post-scan OCR if you let it, and
that works well for account statements and other short documents.
Its OCR (which I think is a version of AABBY that Fujitsu/PFUCA
licensed for use with the ScanSnap software) is not real good
at recognizing multiple columns or tables, it gets the characters
but not the layout.
The ScanSnap can also try to figure out whether a page image should
be scanned as black-and-white, as grayscale, or as color. There
are ways to control this if you’re not happy with its choices
through defining scanning profiles that influence and limit its
choices. So I scan black-and-white text as 400dpi or 600dpi (judgment
call). You may find you want to scan one book piecewise so you
can use black-and-white for the text-only parts and grayscale or
color for the photo plates.
http://bitsavers.org/pdf/hp/portablePlus/45559-90001_Portable_PLUS_Technical_Reference_Manual_Aug1985.pdf
is an example of one (a looseleaf manual) that I did with a
ScanSnap, and I think I did it all in black-and-white at 400dpi.
You can see the holes punched for the three-ring binder.
Al would put white over them to hide them, but that's because
his scanner yields per-page TIFFs where he can get at that.
I have got some shell/Perl/netpbm code that does things like
that with his sort of scanner filepiles like that, but haven’t
got round to something to turn a ScanSnap-produced PDF into a
bunch of per-page TIFFs.
You can use Adobe Acrobat Pro to gather a bunch of PDFs (and PNGs
and TIFFs) into a single PDF, put down page numbers, put down
bookmarks that mirror the table of contents. Eric Smith's tumble
can do some of these things but I also use Acrobat Pro 8 (which was
bundled with the S1500M) for OCR. Its OCR is based on something
other than AABBY (I.R.I.S. I think) and does better at multiple
columns of text.
I do not expect OCR to be perfect, ever. I hope it will be good
enough for me to find things I remember reading, and thus far
it has worked reasonably well at that. (This via macOS Spotlight.)
What is presented for view in Preview is the page image as scanned
and there is the possibility to re-OCR the PDF with newer software.
ScanSnap software looked much the same on Windows and macOS, and
may yet; haven’t seen recent versions of the Windows software.
There are differences in how they encode page images in PDFs, e.g.
on macOS the software will encode a black-and-white scanned page
image using a compression that is lossless but doesn't actually
compress very well, and I think this is because macOS code is
used to construct the PDF. I use an Acrobat Pro “preflight”
configuration to convert these to what is basically TIFF G4
encoding with run-length lossless compression that is better
at reducing PDF file size. On Windows, the generated PDF also
uses the run-length compression.
-Frank McConnell
More information about the cctalk
mailing list