Linearizing PDF scans
J. David Bryan
jdbryan at acm.org
Mon Aug 16 14:39:44 CDT 2021
On Sunday, August 15, 2021 at 1:20, Alexander Schreiber wrote:
> My current toolchain for that:
Thanks; that was quite helpful.
One aspect that I find of great assistance in navigating large PDF manuals
is original page numbers. Often a manual will contain references, e.g., to
"page 4-13" or "Appendix B-23". Having just a set of ascending integers
for PDF page numbers and having to guess where Section 4 page 13 might be
in that list is difficult, especially when PDF page 1 doesn't correspond to
manual page 1-1 and the sections are very large. Being able to enter a
referenced page number directly into a PDF reader's "go to page" dialog is
> > http://www.leptonica.org/
> Thanks for the pointer, I'm going to take a look - apparently
> tesseract uses leptonica for some image processing work.
You're welcome. Yes, tesseract is one of the major users of Leptonica.
When I first started using the library about ten years ago, I found the
documentation very reminiscent of those school mathematics textbooks that
said, "The proof is left as an exercise for the reader." There were a
couple of examples on the host site but no comprehensive index of the 2500+
library routines. The approach was, "read the source," which was fine if
one was familiar with image processing terms, such as affine
transformations, morphology, convolution, and octcube-based color
It may be better now, but it was something of an intellectual challenge at
> What is that? Never heard of linearizing PDF before....
It's documented in the PDF Reference Manual from Adobe. Apparently, it's
been around since PDF 1.2. The introduction to the chapter says:
A linearized PDF file is one that has been organized in a special way
to enable efficient incremental access in a network environment. The
file is valid PDF in all respects, and it is compatible with all
existing viewers and other PDF applications. Enhanced viewers can
recognize that a PDF file has been linearized and can take advantage
of that organization to enhance viewing performance.
...which, as others have mentioned, essentially is to allow page-at-a-time
access via a browser without having to download the entire file first.
More information about the cctech