Linearizing PDF scans

Mon Aug 16 14:39:44 CDT 2021

On Sunday, August 15, 2021 at 1:20, Alexander Schreiber wrote:

> My current toolchain for that:

Thanks; that was quite helpful.

One aspect that I find of great assistance in navigating large PDF manuals 
is original page numbers.  Often a manual will contain references, e.g., to 
"page 4-13" or "Appendix B-23".  Having just a set of ascending integers 
for PDF page numbers and having to guess where Section 4 page 13 might be 
in that list is difficult, especially when PDF page 1 doesn't correspond to 
manual page 1-1 and the sections are very large.  Being able to enter a 
referenced page number directly into a PDF reader's "go to page" dialog is 
very convenient.

> >   http://www.leptonica.org/
> 
> Thanks for the pointer, I'm going to take a look - apparently
> tesseract uses leptonica for some image processing work. 

You're welcome.  Yes, tesseract is one of the major users of Leptonica.

When I first started using the library about ten years ago, I found the 
documentation very reminiscent of those school mathematics textbooks that 
said, "The proof is left as an exercise for the reader."  There were a 
couple of examples on the host site but no comprehensive index of the 2500+ 
library routines.  The approach was, "read the source," which was fine if 
one was familiar with image processing terms, such as affine 
transformations, morphology, convolution, and octcube-based color 
quantization.

It may be better now, but it was something of an intellectual challenge at 
the time.

> What is that? Never heard of linearizing PDF before....

It's documented in the PDF Reference Manual from Adobe.  Apparently, it's 
been around since PDF 1.2.  The introduction to the chapter says:

   A linearized PDF file is one that has been organized in a special way
   to enable efficient incremental access in a network environment. The
   file is valid PDF in all respects, and it is compatible with all
   existing viewers and other PDF applications. Enhanced viewers can
   recognize that a PDF file has been linearized and can take advantage
   of that organization to enhance viewing performance.

...which, as others have mentioned, essentially is to allow page-at-a-time 
access via a browser without having to download the entire file first.

                                      -- Dave