On Apr 24, 2015, at 9:48 AM, Noel Chiappa <jnc
at mercury.lcs.mit.edu> wrote:
From: shadoooo
I'm scanning at 600dpi grayscale, lossless
compression.
I've been scanning a few things too, and I found that 600dpi grayscale
produced absolutely enormous files (many, many MB's per page, for prints), no
matter what I tried to do, compression-wise.
600dpi black and white, followed by saving as TIFF's with CCITT Group 4
compression, produced immensely smaller files (small 100's of KB's for the
same pages), and they are quite readable (even the fine letter seems to be
readable - b/6 is quite distinguishable, etc).
If you?re looking to scan for human consumption, bitmap works ok. But I?ve found that
OCR programs seem to want grayscale. Why that is, I don?t know; they do seem to convert
it to bitmap at some point. Possibly the threshold logic is more complex.
That brings up thresholds. When scanning, or converting to, bitmap, you have to set the
gray threshold that is the cutoff between white and black. The default would typically be
128 (50%). Depending on the scanner and the condition of the originals, that threshold
may be fine, or it may be far off the optimal. A good approach is to scan a number of
representative pages in grayscale, and experiment with different threshold settings to see
which one is the best. Basically, you?re looking for the compromise between filled in
loops, and broken thin lines. For printed originals, this is probably not all that
critical; for typewritten material, it is far more so.
Speaking of which, anyone have a good suggestion for OCR nowadays? I
would really like to throw all the current PDF scans of manuals that
sits out there at OCR, to get the documents back to sane sizes, and also
make it possible to update the documents over time when needed...
Johnny