Scanning docs for bitsavers

Mon Dec 2 18:34:06 CST 2019

At 01:57 PM 2/12/2019 -0700, you wrote:
>On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk <cctalk at classiccmp.org>
>wrote:
>
>> When I corresponded with Al Kossow about format several years ago, he
>> indicated that CCITT Group 4 lossless compression was their standard.
>>
>
>There are newer bilevel encodings that are somewhat more efficient than G4
>(ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as
>widely supported, and AFAIK JBIG2 is still patent encumbered. As a result,
>G4 is still arguably the best bilevel encoding for general-purpose use. PDF
>has natively supported G4 for ages, though it gained JBIG and JBIG2 support
>in more recent versions.
>
>Back in 2001, support for G4 encoding in open source software was really
>awful; where it existed at all, it was horribly slow. There was no good
>reason for G4 encoding to be slow, which was part of my motivation in
>writing my own G4 encoder for tumble (an image-to-PDF utility). However, G4
>support is generally much better now.

Mentioning JBIG2 (or any of its predecessors) without noting that it is
completely unacceptable as a scanned document compression scheme, demonstrates
a lack of awareness of the defects it introduces in encoded documents.
See http://everist.org/NobLog/20131122_an_actual_knob.htm#jbig2
JBIG2 typically produces visually appalling results, and also introduces so
many actual factual errors (typically substituted letters and numbers) that
documents encoded with it have been ruled inadmissible as evidence in court.
Sucks to be an engineering or financial institution, which scanned all its
archives with JBIG2 then shredded the paper originals to save space.
The fuzzyness of JBIG is adjustable, but fundamentally there will always
be some degree of visible patchyness and risk of incorrect substitution.

As for G4 bilevel encoding, the only reasons it isn't treated with the same
disdain as JBIG2, are:
1. Bandwaggon effect - "It must be OK because so many people use it."
2. People with little or zero awareness of typography, the visual quality of
   text, and anything to do with preservation of historical character of
   printed works. For them "I can read it OK" is the sole requirement.

G4 compression was invented for fax machines. No one cared much about visual
quality of faxes, they just had to be readable. Also the technology of fax
machines was only capable of two-tone B&W reproduction, so that's what G4
encoding provided.

Thinking these kinds of visual degradation of quality are acceptable when
scanning documents for long term preservation, is both short sighted and
ignorant of what can already be achieved with better technique.

For example, B&W text and line diagram material can be presented very nicely
using 16-level gray shading, That's enough to visually preserve all the
line and edge quality. The PNG compression scheme provides a color indexed
4 bits/pixel format, combining with PNG's run-length coding. When documents
are scanned with sensible thresholds plus post-processed to ensure all white
paper is actually #FFFFFF, and solid blacks are actually #0, but edges retain
adequate gray shading, PNG achieves an excellent level of filesize compression.
The visual results are _far_ superior to G4 and JBIG2 coding, and surprisingly
the file sizes can actually be smaller. It's easy to achieve on-screen results
that are visually indistinguishable from looking at the paper original, with
quite acceptable filesizes.
And that's the way it should be.

Which brings us to PDF, that most people love because they use it all the
time, never looked into the details of its internals, and can't imagine
anything better.
Just one point here. PDF does not support PNG image encoding. *All* the
image compression schemes PDF does support, are flawed in various cases.
But because PDF structuring is opaque to users, very few are aware of 
this and its other problems. And therefore why PDF isn't acceptable as a
container for long term archiving of _scanned_ documents for historical
purposes. Even though PDF was at least extended to include an 'archival'
form in which all the font definitions must be included.

When I scan things I'm generally doing it in an experimental sense,
still exploring solutions to various issues such as the best way to deal
with screened print images and cases where ink screening for tonal images
has been overlaid with fine detail line art and text. Which makes processing
to a high quality digital image quite difficult.

But PDF literally cannot be used as a wrapper for the results, since
it doesn't incorporate the required image compression formats. 
This is why I use things like html structuring, wrapped as either a zip
file or RARbook format. Because there is no other option at present.
There will be eventually. Just not yet. PDF has to be either greatly
extended, or replaced.

And that's why I get upset when people physically destroy rare old documents
during or after scanning them currently. It happens so frequently, that by
the time we have a technically adequate document coding scheme, a lot of old
documents won't have any surviving paper copies.
They'll be gone forever, with only really crap quality scans surviving.

Guy