Scanning docs for bitsavers

3 Dec 2019

...
  On Dec 2, 2019, at 11:12 PM, Grant Taylor via cctalk
<cctalk at classiccmp.org> wrote:
 On 12/2/19 9:06 PM, Grant Taylor via cctalk wrote:
  In my opinion, PDFs are the last place that
computer usable data goes. Because getting anything out of a PDF as a data source is next
to impossible.
 Sure, you, a human, can read it and consume the data.
 Try importing a simple table from a PDF and working with the data in something like a
spreadsheet.  You can't do it.  The raw data is there.  But you can't readily use
it.
 This is why I say that a PDF is the end of the line for data.
 I view it as effectively impossible to take data out of a PDF and do anything with it
without first needing to reconstitute it before I can use it. 
 I'll add this:
 PDF is a decent page layout format.  But trying to view the contents in any different
layout is problematic (at best).
 Trying to use the result of a page layout as a data source is ... problematic. 
That's hardly surprising.  These properties are precisely the intent of PDF.  It's
basically a portable variant of PostScript, with some cleanups (relatively sane Unicode
support, transparency, hyperlinks, a few other things).  Its specific purpose is to encode
page images, just as they appear on actual paper.  Indeed, PDF is often used as a
"camera ready copy" format for material going to a print shop.  It works quite
well for that.
For scanned documents, where each page is just an image, PDF is a decent container format.
For documents with actual text, it's far more problematic.
Using PDF as an intermediate form is every bit as inappropriate as using JPEG for line art
or any other application where artefacts are impermissible.  The trouble (for both of
these) is that many of the users don't know the limitations and blindly use the wrong
tools.
        paul

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Scanning docs for bitsavers