Scanning docs for bitsavers
jbglaw at lug-owl.de
Tue Dec 3 03:15:52 CST 2019
On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk <cctalk at classiccmp.org> wrote:
> At 01:57 PM 2/12/2019 -0700, you wrote:
> >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk <cctalk at classiccmp.org>
> > > When I corresponded with Al Kossow about format several years ago, he
> > > indicated that CCITT Group 4 lossless compression was their standard.
> As for G4 bilevel encoding, the only reasons it isn't treated with the same
> disdain as JBIG2, are:
> 1. Bandwaggon effect - "It must be OK because so many people use it."
> 2. People with little or zero awareness of typography, the visual quality of
> text, and anything to do with preservation of historical character of
> printed works. For them "I can read it OK" is the sole requirement.
> G4 compression was invented for fax machines. No one cared much about visual
> quality of faxes, they just had to be readable. Also the technology of fax
> machines was only capable of two-tone B&W reproduction, so that's what G4
> encoding provided.
So it boils down to two distinct tasks:
* Scan old paper documentation with a proven file format (ie. no
compression artifacts, b/w or 16 gray-level for black-and-white
text, tables and the like.
* Make these images accessible as useable documentation.
The first step is that's work-intensive, the second step can probably
be easily redone every time we "learn" something about how to make the
documents more useful.
For accessibility, PDF seems to be quite a nice choice, as long as
we see that as a representation only (and not as the information
source.) Convert the images to TIFF for example, possibly downsample,
possibly OCR and overlay it.
> But PDF literally cannot be used as a wrapper for the results, since
> it doesn't incorporate the required image compression formats.
> This is why I use things like html structuring, wrapped as either a zip
> file or RARbook format. Because there is no other option at present.
> There will be eventually. Just not yet. PDF has to be either greatly
> extended, or replaced.
I think that PDF actually is a quite well-working output format, but
we'd see it as a compilation product of our actual source (images),
not as the final (and only) product.
> And that's why I get upset when people physically destroy rare old documents
> during or after scanning them currently. It happens so frequently, that by
> the time we have a technically adequate document coding scheme, a lot of old
> documents won't have any surviving paper copies.
> They'll be gone forever, with only really crap quality scans surviving.
:-( Too bad, but that happens all the time.
More information about the cctech