Scanning docs for bitsavers

Tue Dec 3 05:08:33 CST 2019

At 01:20 AM 3/12/2019 -0200, you wrote:
>I cannot understand your problems with PDF files.
>I've created lots and lots of PDFs, with treated and untreated scanned
>material. All of them are very readable and in use for years. Of course,
>garbage in, garbage out. I take the utmost care in my scans to have good
>enough source files, so I can create great PDFs.
>
>Of course, Guy's commens are very informative and I'll learn more from it.
>But I still believe in good preservation using PDF files. FOR ME it is the
>best we have in encapsulating info. Forget HTMLs.

I don't propose html as a viable alternative. It has massive inadequacies
for representing physical documents. I just use it for experimenting and
and as a temporary wrapper, because it's entirely transparent and maleable.
ie I have total control over the result (within the bounds of what html
can do.)

>Please, take a look at this PDF, and tell me: Isn't that good enough for
>preservation/use?
>https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view

OK, not too bad in comparison to many others. But a few comments:
* The images are fax-mode, and although the resolution is high enough for there to be
  no ambiguities, it still looks bad and stylistically greatly differs from the original.
  Pity I don't have a copy of the original, to make demonstration scans of a few
  illustrations to show what it could be like, for similar file size.

* The text is OCR, with a font I expect likely approximates the original fairly well.
  Though I'd like to see the original. I suspect the PDF font is a bit 'thic' due to
  incorrect gray threshold.
  Also it's searchable, except that the OCR process included paper blemishes as 'characters'
  so if you copy-paste the text elsewhere you have to carefully vet it. And not all searches
  will work.

  This is an illustration of the point that till we achieve human-leval AI, it's never
  going to be possible to go from images to abstracted OCR text automatically without considerable
  human oversight and proof-reading. And... human-level AI won't _want_ to do drudgery like that.

* Your automated PDF generation process did a lot of silly things, like chaotic attempts to
  OCR 'elements' of diagrams. Just try moving a text selection box over the diagrams, you'll
  see what I mean. Try several diagrams, it's very random.

* The PCB layouts, for eg PDF page #s 28, 29 - I bet the original used light shading to represent
  copper, and details over the copper were clearly visible. But when you scanned it in bi-level
  all that is lost. These _have_ to be in gray scale, and preferably post-processed to posterize
  the flat shading areas (for better compression as well as visual accuracy.)

* Why are all the diagram pages variously different widths? I expect the original pages (foldouts?)
  had common sizes. This variation is because either you didn't use a fixed recipee for scanning
  and processing, or your PDF generation utility 'handled' that automatically (and messed up.)

* You don't have control of what was OCR'd and what wasn't. For instance, why OCR table contents,
  if the text selection results are garbage? For eg, select the entire block at the bottom of
  PDF page 48. Does the highlighting create a sense of confidence this is going to work?
  Now copy and paste into a text editor. Is the result useful? (No.)
  OCR can be over-used.

* 'ownership' As well as your introduction page, you put your tag on every single page.
  Pretty much everyone does something like this. As if by transcribing the source material you
  acquired some kind of ownership or bragging rights. But no, others put a very great deal of 
  effort into creating that work, and you just made a digital copy. That the originators probably
  would consider an aesthetic insult to their efforts. So, why the proud tags everywhere?

Summary: It's fine as a working copy for practical use. Better to have made it than not, so long
as you didn't destroy the paper original in the process. But if you're talking about an archival
historical record, that someone can look at in 500 years (or 5000) and know what the original 
actually looked like, how much effort went into making that ink crisp and accurate, then no. 
It's not good enough. 

To be fair, I've never yet seen any PDF scan of any document that I'd consider good enough.
Works created originally in PDF as line art are a different class, and typically OK. Though
some other flaws of PDF do come into play. Difficulty of content export, problems with global
page parameters, font failures, sequential vs content page numbers, etc.

With scanning there are multiple points of failure right through the whole process at present, 
ranging from misunderstandings of the technology among people doing scanning, problems with
scanners (why are edge scanners so rare!?), lack of critical capabilities in post-processing
utilities (line art on top of ink screening, it's a nightmare, also most people can't use
Photoshop well, and it's necessary), failings built unavoidably into PDF, and not so great
PDF viewer utilities. Apart from the intrinsic issues (aside from a few  advantages) with
on-screen display and controls compared to paper.

I hope I have not offended you. Btw my pickiness comes from growing up in a family with
commercial art, typography, printing and technical art involvement. And having in later years
assisted a little with such things. So at least I know how much effort goes into such things.

Keep the original. Methods and utilities will improve, and in 10 or 20 years it may be possible
to make a visually perfect digital copy (with minimal effort), worthy of becoming a sole record
of that thing (if history goes that way.)

Guy