DEC scanned documents for Bitsavers (message for Al Kossow)

Fri Apr 24 09:59:06 CDT 2015

On 2015-04-24 16:07, Paul Koning wrote:
>
>> On Apr 24, 2015, at 9:48 AM, Noel Chiappa <jnc at mercury.lcs.mit.edu> wrote:
>>
>>> From: shadoooo
>>
>>> I'm scanning at 600dpi grayscale, lossless compression.
>>
>> I've been scanning a few things too, and I found that 600dpi grayscale
>> produced absolutely enormous files (many, many MB's per page, for prints), no
>> matter what I tried to do, compression-wise.
>>
>> 600dpi black and white, followed by saving as TIFF's with CCITT Group 4
>> compression, produced immensely smaller files (small 100's of KB's for the
>> same pages), and they are quite readable (even the fine letter seems to be
>> readable - b/6 is quite distinguishable, etc).
>
> If you’re looking to scan for human consumption, bitmap works ok.  But I’ve found that OCR programs seem to want grayscale.  Why that is, I don’t know; they do seem to  convert it to bitmap at some point.  Possibly the threshold logic is more complex.
>
> That brings up thresholds.  When scanning, or converting to, bitmap, you have to set the gray threshold that is the cutoff between white and black.  The default would typically be 128 (50%).  Depending on the scanner and the condition of the originals, that threshold may be fine, or it may be far off the optimal.  A good approach is to scan a number of representative pages in grayscale, and experiment with different threshold settings to see which one is the best.  Basically, you’re looking for the compromise between filled in loops, and broken thin lines.  For printed originals, this is probably not all that critical; for typewritten material, it is far more so.

Speaking of which, anyone have a good suggestion for OCR nowadays? I 
would really like to throw all the current PDF scans of manuals that 
sits out there at OCR, to get the documents back to sane sizes, and also 
make it possible to update the documents over time when needed...

	Johnny