Question about PDF manipulation

3 Jun 2005

On 2 Jun, 2005, at 19:40, Barry Watzman wrote:
...
  ...
...
  I really don't think that you understand the
nature and capabilities 
 of the
 product (Acrobat) that you are criticizing. 
I have used Acrobat since the first beta.
Your comment just proves to me that you seem to be totally clueless 
about the important issues in modern electronic dissemination of 
information, so I guess I have to explain it in more detail.

I am writing this with the help of Display PDF, the engine used for the 
display in Mac OS X. It is much superior to the inelegant kludges that 
make up other graphical display systems. Adobe is a company that 
understands about nice font rendering and the presentation of things. 
That is also what Acrobat is about, it is a way to make sure that the 
print shop is printing your document the way you intended. That is what 
a PDF file is good for, and in every other respect it is an inferior 
format.

I have been working with communication and transfer of information 
between dissimilar systems for the last 30 years, so I have some idea 
about what is beneficial and what is harmful. Doing it the simple way 
is often more useful than trying to add features. Using a proprietary 
solution is usually bad, even it it is supposed to be the most 
widespread and "everybody" is using it. There are exceptions, but they 
need to be researched and documented in each individual case.

If you scan a book, you end up with bitmaps of the pages. If you stuff 
these bitmaps into a PDF container, the only value you add is that they 
are kept together in sequence. The value you subtract is that they are 
no longer readily available for everybody, and anybody who wants to OCR 
them to make any kind of index or cross reference will have to use 
proprietary software or get them extracted from the container they are 
put in.

Now, if you do it the simple way, you use a suitably named directory 
instead of the PDF file. In that directory, you can keep individual 
PNGs for each page you scanned, named P000 and up (use whatever 
starting value is suitable to reflect the page numbers in the book). If 
it is structured that way, you make a subdirectory for each chapter and 
appendix, and place the page scans in there instead. Thus you retain 
the structure of the original without using any proprietary format, and 
everybody with a graphical display will be able to use the scanned 
book. There are numerous image viewers available for all platforms, 
many will be graphical browsers which will navigate your pages and 
directories rather better than an Acrobat reader. Many suitable viewers 
are available from open source projects, so you can build one even on 
platforms which have been neglected for years.

The most elegant solution, though, is to use an ordinary web browser to 
access the pages. It is trivially simple to make a website out of this 
directory structure, and there are many free server-side products 
available to make access user friendly without any effort by the 
maintainer.

It is in the web scenario we most clearly see the "value subtracted" 
nature of PDF. If I want to look at the information of page 52, I have 
to get the whole document. That will waste bandwidth, and it makes the 
server more expensive to operate. Besides, I get thrown out of the 
normal working mode for my web browser and into the different mode of 
the Acrobat plugin (if that is supported on the platform I use, 
otherwise it will be the standalone reader or GhostView or something).

The next important feature of the open solution is that it encourages a 
collaborative effort to add useful thigs like indexes, cross references 
and even full text versions. Take a look at Wikipedia to see what it is 
possible to accomplish when things are kept in open, universally 
accessible form. A repository of technical information could be set up 
the same way, and it would become gradually more useful as people added 
their comments and index hints for the scanned pages as 
metainformation. To OCR the pages just for use as aids to searching and 
indexing would be simple, the raw OCR output could be given the same 
name as the scanned page just with a .txt extension. If somebody later 
on were to proofread and mark it up, that would lead to a .xml 
document. These kinds of possibilities are only available if the 
documents are kept in a simple, logical structure that is accessible to 
as many as possible, not just for reading but for further refinement.

In order to avoid technical lock-in today, my preferred document format 
is XML with CSS styling, either with an XHTML DTD or, ideally, a DTD 
tailored to the usage area and reflected by the stylesheets. Bitmap 
images are ideally PNGs, photographs JPEG 2000, and vector images are 
SVG.  There exists a plethora of free tools to work with, transform and 
generate this kind of document.

The most harmful things for anybody who wants a useable, syntactical 
web, are lock-in formats. The worst by far is Flash, with Microsoft 
Office formats closely following (even when the output is supposed to 
be HTML), but PDF is a good third.

-- 
-bv

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Question about PDF manipulation