Scanning docs for bitsavers
guykd at optusnet.com.au
Tue Dec 3 05:08:33 CST 2019
At 01:20 AM 3/12/2019 -0200, you wrote:
>I cannot understand your problems with PDF files.
>I've created lots and lots of PDFs, with treated and untreated scanned
>material. All of them are very readable and in use for years. Of course,
>garbage in, garbage out. I take the utmost care in my scans to have good
>enough source files, so I can create great PDFs.
>Of course, Guy's commens are very informative and I'll learn more from it.
>But I still believe in good preservation using PDF files. FOR ME it is the
>best we have in encapsulating info. Forget HTMLs.
I don't propose html as a viable alternative. It has massive inadequacies
for representing physical documents. I just use it for experimenting and
and as a temporary wrapper, because it's entirely transparent and maleable.
ie I have total control over the result (within the bounds of what html
>Please, take a look at this PDF, and tell me: Isn't that good enough for
OK, not too bad in comparison to many others. But a few comments:
* The images are fax-mode, and although the resolution is high enough for there to be
no ambiguities, it still looks bad and stylistically greatly differs from the original.
Pity I don't have a copy of the original, to make demonstration scans of a few
illustrations to show what it could be like, for similar file size.
* The text is OCR, with a font I expect likely approximates the original fairly well.
Though I'd like to see the original. I suspect the PDF font is a bit 'thic' due to
incorrect gray threshold.
Also it's searchable, except that the OCR process included paper blemishes as 'characters'
so if you copy-paste the text elsewhere you have to carefully vet it. And not all searches
This is an illustration of the point that till we achieve human-leval AI, it's never
going to be possible to go from images to abstracted OCR text automatically without considerable
human oversight and proof-reading. And... human-level AI won't _want_ to do drudgery like that.
* Your automated PDF generation process did a lot of silly things, like chaotic attempts to
OCR 'elements' of diagrams. Just try moving a text selection box over the diagrams, you'll
see what I mean. Try several diagrams, it's very random.
* The PCB layouts, for eg PDF page #s 28, 29 - I bet the original used light shading to represent
copper, and details over the copper were clearly visible. But when you scanned it in bi-level
all that is lost. These _have_ to be in gray scale, and preferably post-processed to posterize
the flat shading areas (for better compression as well as visual accuracy.)
* Why are all the diagram pages variously different widths? I expect the original pages (foldouts?)
had common sizes. This variation is because either you didn't use a fixed recipee for scanning
and processing, or your PDF generation utility 'handled' that automatically (and messed up.)
* You don't have control of what was OCR'd and what wasn't. For instance, why OCR table contents,
if the text selection results are garbage? For eg, select the entire block at the bottom of
PDF page 48. Does the highlighting create a sense of confidence this is going to work?
Now copy and paste into a text editor. Is the result useful? (No.)
OCR can be over-used.
* 'ownership' As well as your introduction page, you put your tag on every single page.
Pretty much everyone does something like this. As if by transcribing the source material you
acquired some kind of ownership or bragging rights. But no, others put a very great deal of
effort into creating that work, and you just made a digital copy. That the originators probably
would consider an aesthetic insult to their efforts. So, why the proud tags everywhere?
Summary: It's fine as a working copy for practical use. Better to have made it than not, so long
as you didn't destroy the paper original in the process. But if you're talking about an archival
historical record, that someone can look at in 500 years (or 5000) and know what the original
actually looked like, how much effort went into making that ink crisp and accurate, then no.
It's not good enough.
To be fair, I've never yet seen any PDF scan of any document that I'd consider good enough.
Works created originally in PDF as line art are a different class, and typically OK. Though
some other flaws of PDF do come into play. Difficulty of content export, problems with global
page parameters, font failures, sequential vs content page numbers, etc.
With scanning there are multiple points of failure right through the whole process at present,
ranging from misunderstandings of the technology among people doing scanning, problems with
scanners (why are edge scanners so rare!?), lack of critical capabilities in post-processing
utilities (line art on top of ink screening, it's a nightmare, also most people can't use
Photoshop well, and it's necessary), failings built unavoidably into PDF, and not so great
PDF viewer utilities. Apart from the intrinsic issues (aside from a few advantages) with
on-screen display and controls compared to paper.
I hope I have not offended you. Btw my pickiness comes from growing up in a family with
commercial art, typography, printing and technical art involvement. And having in later years
assisted a little with such things. So at least I know how much effort goes into such things.
Keep the original. Methods and utilities will improve, and in 10 or 20 years it may be possible
to make a visually perfect digital copy (with minimal effort), worthy of becoming a sole record
of that thing (if history goes that way.)
More information about the cctech