Question about PDF manipulation

8 Jun 2005

Apologies to the list for spending yet more bandwidth on this, but Bj?rn
actually presented an alternative in a reasonable way (thanks!), and I think
it is important to address that...
From: "Bj?rn Vermo" <bv at norbionics.com>
...
  I have used Acrobat since the first beta. ...
...
  I am writing this with the help of Display PDF, the
engine used for the
 display in Mac OS X. It is much superior to the inelegant kludges that
 make up other graphical display systems. Adobe is a company that
 understands about nice font rendering and the presentation of things.
 That is also what Acrobat is about, it is a way to make sure that the
 print shop is printing your document the way you intended. That is what
 a PDF file is good for, and in every other respect it is an inferior
 format. ...
...
  If you scan a book, you end up with bitmaps of the
pages. If you stuff
 these bitmaps into a PDF container, the only value you add is that they
 are kept together in sequence. The value you subtract is that they are
 no longer readily available for everybody, and anybody who wants to OCR
 them to make any kind of index or cross reference will have to use
 proprietary software or get them extracted from the container they are
 put in. 
Now, by your own admission, I have also added value to the way it will be
displayed and printed by Acrobat users.  This is also consistent with my
experience.  Acrobat really does understand scaling and viewability, to a
degree generally not matched elsewhere.
There is also an argument whether PDF format actually makes the document
unavailable to more people, or available to more people.
...
  Now, if you do it the simple way, you use a suitably
named directory
 instead of the PDF file. In that directory, you can keep individual
 PNGs for each page you scanned, named P000 and up (use whatever
 starting value is suitable to reflect the page numbers in the book). If
 it is structured that way, you make a subdirectory for each chapter and
 appendix, and place the page scans in there instead. Thus you retain
 the structure of the original without using any proprietary format, and
 everybody with a graphical display will be able to use the scanned
 book. There are numerous image viewers available for all platforms,
 many will be graphical browsers which will navigate your pages and
 directories rather better than an Acrobat reader. 
Here I completely disagree.  I have a graphical display, and I can't view a
TIFF file or a PNG file easily at all.  All the viewers the I have have
non-uniform, kludgy interfaces.  They also require me to start a separate
application or plug-in, just as PDF does.  None of them render the pages as
well as Acrobat, and the interfaces for browsing directories are laughable.
...
  Many suitable viewers
 are available from open source projects, so you can build one even on
 platforms which have been neglected for years. 
Yeah, I guess.  If I wanted to download, install, and maintain all that
infrastructure, just to read documents on my screen.
...
  The most elegant solution, though, is to use an
ordinary web browser to
 access the pages. It is trivially simple to make a website out of this
 directory structure, and there are many free server-side products
 available to make access user friendly without any effort by the
 maintainer. 
Here, I'd point out that my web browser is awful at viewing image data.  It
wants to present the image pixel-for-pixel the way it is stored, or the way
your web page told it to.  Which knows nothing of my screen resolution, let
alone my prefered work style.  With the Acrobat plug-in, at least I can set
the zoom and still read the work in question.
...
  It is in the web scenario we most clearly see the
"value subtracted"
 nature of PDF. If I want to look at the information of page 52, I have
 to get the whole document. That will waste bandwidth, and it makes the
 server more expensive to operate. 
Well, sorta.  What I actually do, and I doubt I am alone in this, is
download the thing, and read it locally (performs better), which means I
don't have to come back to you the next time I refer to the document.  Since
the usual case is that I will read most or all of the document, the PDF
actually saves you significant server bandwidth.
Plus, it automatically creates a back-up of the document, with a nearly
trivial effort on my part.
...
  Besides, I get thrown out of the
 normal working mode for my web browser and into the different mode of > the
Acrobat plugin (if that is supported on the platform I use,
...
  otherwise it will be the standalone reader or
GhostView or something). 
True, but this happens for essentially everything except GIF and JPEG.
(Most of the formats you are talking about will invoke the hideous Quicktime
plugin by default.)
...
  The next important feature of the open solution is
that it encourages a
 collaborative effort to add useful thigs like indexes, cross references
 and even full text versions. Take a look at Wikipedia to see what it is
 possible to accomplish when things are kept in open, universally
 accessible form. A repository of technical information could be set up
 the same way, and it would become gradually more useful as people added
 their comments and index hints for the scanned pages as
 metainformation. To OCR the pages just for use as aids to searching and
 indexing would be simple, the raw OCR output could be given the same
 name as the scanned page just with a .txt extension. If somebody later
 on were to proofread and mark it up, that would lead to a .xml
 document. These kinds of possibilities are only available if the
 documents are kept in a simple, logical structure that is accessible to
 as many as possible, not just for reading but for further refinement. 
I concur that working on the document is easier with the pages split up.
...
  In order to avoid technical lock-in today, my
preferred document format
 is XML with CSS styling, either with an XHTML DTD or, ideally, a DTD
 tailored to the usage area and reflected by the stylesheets. Bitmap
 images are ideally PNGs, photographs JPEG 2000, and vector images are
 SVG.  There exists a plethora of free tools to work with, transform and
 generate this kind of document. 
Here you are mistaken, on several fronts.  XML with CSS fails to "keep it
simple" or make it easy for me to contribute.  PNG is only renderable on my
system with the nearly useless Quicktime plugin.  I don't know what SVG is,
but I doubt I can render it at all.  Free tools are well and good until you
count up the time it will cost me to bring up a PC-based development
environment and keep it and the "free" tools working.
...
  The most harmful things for anybody who wants a
useable, syntactical
 web, are lock-in formats. The worst by far is Flash, with Microsoft
 Office formats closely following (even when the output is supposed to
 be HTML), but PDF is a good third. 
I agree on the hideousness of Flash and Office.  (What is a "useable,
syntactical web", and why do I want one?)
    Vince

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Question about PDF manipulation