John Lawson <jpl15(a)netcom.com> wrote:
I have tried scanning manuals a few times, sans much
luck. Perhaps
you could e-mail me privately with some knowledge on getting text
scanned properly... or maybe my sorry software is braindead in the
text dept. I have tried making jpegs, but by the time they're
composited, tweaked, and compressed, they're mostly illegible. :(
Since I'd like to share my philosophy of printed document preparation
with a larger audience, I'm not emailing it privately.
The secret (IMNSHO) is to recognize four things:
1) *NONE* of the available OCR is anywhere near good enough
2) This lifetime is too short to manually fix up the output of the OCR
process.
3) Despite #1 and #2, it still is worthwhile making scanned images
available in some form.
4) Even if OCR isn't good enough for document preservation, it's still
worthwhile as a supplement since you can't grep images.
Once you've resigned yourself to that, the solution is to scan the stuff
at a reasonable resolution (typically 300 DPI), save it as TIFF Class F
files using ITU-T Group 4 (T.6) compression, and run the stuff through
Adobe Acrobat Exchange's "Capture" module in "invisible text"
mode.
The capture module will OCR the text to the best of its ability, but
it will save the entire scanned image in the PDF file, so the document
can be displayed or printed in all its original glory (and with all the
original coffee stains, etc.). However, since the capture module does
a fair job of OCR and saves the text in the PDF file with the "invisible"
attribute, the reader can still use the search capabilities.
The resulting document sizes are of course somewhat large, but not
so huge as to be completely unmanageable.
As an example, I have two DECsystem-10 manuals and a portion of a third
currently available from one of my web sites:
http://www.36bit.org/dec/manual/
Printed Pages Document Size Average Bytes Per Page
------------- ------------- ----------------------
162 11.9 M 76,959
514 36.2 M 73,935
50 2.5 M 53,375
A 36-megabyte file admittedly takes a fair bit of time to download over
a modem link. But the other option was for people not to be able to get
it from me at all, because I don't mind spending hours to scan it once,
but I'm not willing to spend hours making a photocopy EVERY TIME someone
wants a copy.
My web server supports byte-serving, so people running Netscape or IE
with the Acrobat plug-in can browse the documents without having to
download them in their entirety.
So far the only people who have complained about it are people who didn't have
any reason to need the files anyhow. I don't really understand it, but a lot
of people seem to download everything they can get their hands on for no
particular reason. I used to have some very large documents available on my
web site in Postscript files that were available either ZIP'd or tar'd, and I
specifically stated on the web page that the contents were the same, so please
don't download both. An amazing number of people downloaded both anyhow. So
now I only provide things like that in tar files.
Eric