Scanning docs for bitsavers

Paul Koning paulkoning at
Tue Dec 3 11:46:52 CST 2019

> On Dec 2, 2019, at 11:12 PM, Grant Taylor via cctalk <cctalk at> wrote:
> On 12/2/19 9:06 PM, Grant Taylor via cctalk wrote:
>> In my opinion, PDFs are the last place that computer usable data goes. Because getting anything out of a PDF as a data source is next to impossible.
>> Sure, you, a human, can read it and consume the data.
>> Try importing a simple table from a PDF and working with the data in something like a spreadsheet.  You can't do it.  The raw data is there.  But you can't readily use it.
>> This is why I say that a PDF is the end of the line for data.
>> I view it as effectively impossible to take data out of a PDF and do anything with it without first needing to reconstitute it before I can use it.
> I'll add this:
> PDF is a decent page layout format.  But trying to view the contents in any different layout is problematic (at best).
> Trying to use the result of a page layout as a data source is ... problematic.

That's hardly surprising.  These properties are precisely the intent of PDF.  It's basically a portable variant of PostScript, with some cleanups (relatively sane Unicode support, transparency, hyperlinks, a few other things).  Its specific purpose is to encode page images, just as they appear on actual paper.  Indeed, PDF is often used as a "camera ready copy" format for material going to a print shop.  It works quite well for that.

For scanned documents, where each page is just an image, PDF is a decent container format.  For documents with actual text, it's far more problematic.

Using PDF as an intermediate form is every bit as inappropriate as using JPEG for line art or any other application where artefacts are impermissible.  The trouble (for both of these) is that many of the users don't know the limitations and blindly use the wrong tools.


More information about the cctech mailing list