Scanning docs for bitsavers

Tue Dec 3 04:18:24 CST 2019

actually   we scan to pdf  with back ocr  also text  also tiff also jpegwith the slooowww   hp 11x17 scan fax print thing i can scan entite document then save 1 save2 save3  save 4 without rescanning each time   ed  at smecc
In a message dated 12/3/2019 2:16:01 AM US Mountain Standard Time, cctalk at classiccmp.org writes:

Hi!
On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk <cctalk at classiccmp.org> wrote:
> At 01:57 PM 2/12/2019 -0700, you wrote:
> >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk <cctalk at classiccmp.org>
> >wrote:
> >
> > > When I corresponded with Al Kossow about format several years ago, he
> > > indicated that CCITT Group 4 lossless compression was their standard.
> As for G4 bilevel encoding, the only reasons it isn't treated with the same
> disdain as JBIG2, are:
> 1. Bandwaggon effect - "It must be OK because so many people use it."
> 2. People with little or zero awareness of typography, the visual quality of
>    text, and anything to do with preservation of historical character of
>    printed works. For them "I can read it OK" is the sole requirement.
> 
> G4 compression was invented for fax machines. No one cared much about visual
> quality of faxes, they just had to be readable. Also the technology of fax
> machines was only capable of two-tone B&W reproduction, so that's what G4
> encoding provided.

So it boils down to two distinct tasks:

  * Scan old paper documentation with a proven file format (ie. no
    compression artifacts, b/w or 16 gray-level for black-and-white
    text, tables and the like.

  * Make these images accessible as useable documentation.

The first step is that's work-intensive, the second step can probably
be easily redone every time we "learn" something about how to make the
documents more useful.

  For accessibility, PDF seems to be quite a nice choice, as long as
we see that as a representation only (and not as the information
source.) Convert the images to TIFF for example, possibly downsample,
possibly OCR and overlay it.

> But PDF literally cannot be used as a wrapper for the results, since
> it doesn't incorporate the required image compression formats. 
> This is why I use things like html structuring, wrapped as either a zip
> file or RARbook format. Because there is no other option at present.
> There will be eventually. Just not yet. PDF has to be either greatly
> extended, or replaced.

I think that PDF actually is a quite well-working output format, but
we'd see it as a compilation product of our actual source (images),
not as the final (and only) product.

> And that's why I get upset when people physically destroy rare old documents
> during or after scanning them currently. It happens so frequently, that by
> the time we have a technically adequate document coding scheme, a lot of old
> documents won't have any surviving paper copies.
> They'll be gone forever, with only really crap quality scans surviving.

:-(  Too bad, but that happens all the time.

Thanks,
  Jan-Benedict

--