Linearizing PDF scans

Wayne Sudol Wayne.Sudol at hotmail.com
Mon Aug 16 16:20:54 CDT 2021


Out of curiosity, is there a reason you do not use Acrobat for creating
pdfs?

-----Original Message-----
From: cctech [mailto:cctech-bounces at classiccmp.org] On Behalf Of J. David
Bryan via cctech
Sent: Monday, August 16, 2021 12:40 PM
To: Classic Computing List
Subject: Re: Linearizing PDF scans

On Sunday, August 15, 2021 at 1:20, Alexander Schreiber wrote:

> My current toolchain for that:

Thanks; that was quite helpful.

One aspect that I find of great assistance in navigating large PDF manuals
is original page numbers.  Often a manual will contain references, e.g., to
"page 4-13" or "Appendix B-23".  Having just a set of ascending integers for
PDF page numbers and having to guess where Section 4 page 13 might be in
that list is difficult, especially when PDF page 1 doesn't correspond to
manual page 1-1 and the sections are very large.  Being able to enter a
referenced page number directly into a PDF reader's "go to page" dialog is
very convenient.


> >   http://www.leptonica.org/
> 
> Thanks for the pointer, I'm going to take a look - apparently 
> tesseract uses leptonica for some image processing work.

You're welcome.  Yes, tesseract is one of the major users of Leptonica.

When I first started using the library about ten years ago, I found the
documentation very reminiscent of those school mathematics textbooks that
said, "The proof is left as an exercise for the reader."  There were a
couple of examples on the host site but no comprehensive index of the 2500+
library routines.  The approach was, "read the source," which was fine if
one was familiar with image processing terms, such as affine
transformations, morphology, convolution, and octcube-based color
quantization.

It may be better now, but it was something of an intellectual challenge at
the time.


> What is that? Never heard of linearizing PDF before....

It's documented in the PDF Reference Manual from Adobe.  Apparently, it's 
been around since PDF 1.2.  The introduction to the chapter says:

   A linearized PDF file is one that has been organized in a special way
   to enable efficient incremental access in a network environment. The
   file is valid PDF in all respects, and it is compatible with all
   existing viewers and other PDF applications. Enhanced viewers can
   recognize that a PDF file has been linearized and can take advantage
   of that organization to enhance viewing performance.

...which, as others have mentioned, essentially is to allow page-at-a-time 
access via a browser without having to download the entire file first.

                                      -- Dave




More information about the cctech mailing list