Scanning docs for bitsavers - test-drb@ccmp.vtda.org

3 Dec 2019

At 01:20 AM 3/12/2019 -0200, you wrote:
...
 I cannot understand your problems with PDF files.
I've created lots and lots of PDFs, with treated and untreated scanned
material. All of them are very readable and in use for years. Of course,
garbage in, garbage out. I take the utmost care in my scans to have good
enough source files, so I can create great PDFs.
Of course, Guy's commens are very informative and I'll learn more from it.
But I still believe in good preservation using PDF files. FOR ME it is the
best we have in encapsulating info. Forget HTMLs. 
I don't propose html as a viable alternative. It has massive inadequacies
for representing physical documents. I just use it for experimenting and
and as a temporary wrapper, because it's entirely transparent and maleable.
ie I have total control over the result (within the bounds of what html
can do.)
...
 Please, take a look at this PDF, and tell me: Isn't
that good enough for
preservation/use?
https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view 
OK, not too bad in comparison to many others. But a few comments:
* The images are fax-mode, and although the resolution is high enough for there to be
  no ambiguities, it still looks bad and stylistically greatly differs from the original.
  Pity I don't have a copy of the original, to make demonstration scans of a few
  illustrations to show what it could be like, for similar file size.
* The text is OCR, with a font I expect likely approximates the original fairly well.
  Though I'd like to see the original. I suspect the PDF font is a bit 'thic'
due to
  incorrect gray threshold.
  Also it's searchable, except that the OCR process included paper blemishes as
'characters'
  so if you copy-paste the text elsewhere you have to carefully vet it. And not all
searches
  will work.
  This is an illustration of the point that till we achieve human-leval AI, it's never
  going to be possible to go from images to abstracted OCR text automatically without
considerable
  human oversight and proof-reading. And... human-level AI won't _want_ to do drudgery
like that.
* Your automated PDF generation process did a lot of silly things, like chaotic attempts
to
  OCR 'elements' of diagrams. Just try moving a text selection box over the
diagrams, you'll
  see what I mean. Try several diagrams, it's very random.
* The PCB layouts, for eg PDF page #s 28, 29 - I bet the original used light shading to
represent
  copper, and details over the copper were clearly visible. But when you scanned it in
bi-level
  all that is lost. These _have_ to be in gray scale, and preferably post-processed to
posterize
  the flat shading areas (for better compression as well as visual accuracy.)
* Why are all the diagram pages variously different widths? I expect the original pages
(foldouts?)
  had common sizes. This variation is because either you didn't use a fixed recipee
for scanning
  and processing, or your PDF generation utility 'handled' that automatically (and
messed up.)
* You don't have control of what was OCR'd and what wasn't. For instance, why
OCR table contents,
  if the text selection results are garbage? For eg, select the entire block at the bottom
of
  PDF page 48. Does the highlighting create a sense of confidence this is going to work?
  Now copy and paste into a text editor. Is the result useful? (No.)
  OCR can be over-used.
* 'ownership' As well as your introduction page, you put your tag on every single
page.
  Pretty much everyone does something like this. As if by transcribing the source material
you
  acquired some kind of ownership or bragging rights. But no, others put a very great deal
of
  effort into creating that work, and you just made a digital copy. That the originators
probably
  would consider an aesthetic insult to their efforts. So, why the proud tags everywhere?
Summary: It's fine as a working copy for practical use. Better to have made it than
not, so long
as you didn't destroy the paper original in the process. But if you're talking
about an archival
historical record, that someone can look at in 500 years (or 5000) and know what the
original
actually looked like, how much effort went into making that ink crisp and accurate, then
no.
It's not good enough.
To be fair, I've never yet seen any PDF scan of any document that I'd consider
good enough.
Works created originally in PDF as line art are a different class, and typically OK.
Though
some other flaws of PDF do come into play. Difficulty of content export, problems with
global
page parameters, font failures, sequential vs content page numbers, etc.
With scanning there are multiple points of failure right through the whole process at
present,
ranging from misunderstandings of the technology among people doing scanning, problems
with
scanners (why are edge scanners so rare!?), lack of critical capabilities in
post-processing
utilities (line art on top of ink screening, it's a nightmare, also most people
can't use
Photoshop well, and it's necessary), failings built unavoidably into PDF, and not so
great
PDF viewer utilities. Apart from the intrinsic issues (aside from a few  advantages) with
on-screen display and controls compared to paper.
I hope I have not offended you. Btw my pickiness comes from growing up in a family with
commercial art, typography, printing and technical art involvement. And having in later
years
assisted a little with such things. So at least I know how much effort goes into such
things.
Keep the original. Methods and utilities will improve, and in 10 or 20 years it may be
possible
to make a visually perfect digital copy (with minimal effort), worthy of becoming a sole
record
of that thing (if history goes that way.)
Guy