Hi!
On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk <cctalk at classiccmp.org>
wrote:
At 01:57 PM 2/12/2019 -0700, you wrote:
On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via
cctalk <cctalk at classiccmp.org>
wrote:
> When I corresponded with Al Kossow about format several years ago, he
> indicated that CCITT Group 4 lossless compression was their standard.
As for
G4 bilevel encoding, the only reasons it isn't treated with the same
disdain as JBIG2, are:
1. Bandwaggon effect - "It must be OK because so many people use it."
2. People with little or zero awareness of typography, the visual quality of
text, and anything to do with preservation of historical character of
printed works. For them "I can read it OK" is the sole requirement.
G4 compression was invented for fax machines. No one cared much about visual
quality of faxes, they just had to be readable. Also the technology of fax
machines was only capable of two-tone B&W reproduction, so that's what G4
encoding provided.
So it boils down to two distinct tasks:
* Scan old paper documentation with a proven file format (ie. no
compression artifacts, b/w or 16 gray-level for black-and-white
text, tables and the like.
* Make these images accessible as useable documentation.
The first step is that's work-intensive, the second step can probably
be easily redone every time we "learn" something about how to make the
documents more useful.
For accessibility, PDF seems to be quite a nice choice, as long as
we see that as a representation only (and not as the information
source.) Convert the images to TIFF for example, possibly downsample,
possibly OCR and overlay it.
But PDF literally cannot be used as a wrapper for the
results, since
it doesn't incorporate the required image compression formats.
This is why I use things like html structuring, wrapped as either a zip
file or RARbook format. Because there is no other option at present.
There will be eventually. Just not yet. PDF has to be either greatly
extended, or replaced.
I think that PDF actually is a quite well-working output format, but
we'd see it as a compilation product of our actual source (images),
not as the final (and only) product.
And that's why I get upset when people physically
destroy rare old documents
during or after scanning them currently. It happens so frequently, that by
the time we have a technically adequate document coding scheme, a lot of old
documents won't have any surviving paper copies.
They'll be gone forever, with only really crap quality scans surviving.
:-( Too bad, but that happens all the time.
Thanks,
Jan-Benedict
--