Linearizing PDF scans

List overview All Threads
Download

newer

older

CDC drive needs a home! (FTGH!)

Lecture: The Whirlwind I,...

jdbryan＠acm.org

13 Aug 2021 13 Aug '21

4:20 p.m.

Is it still useful to linearize PDFs? I've been scanning and PDFing manuals for 16 years, and I've been linearizing them regularly. My understanding is that this made them accessible on a page-by-page basis in Web browsers without requiring a complete file download first. But given the increase in typical bandwidth in 16 years, I wonder if this is still useful. It is an extra step, and it does make the files somewhat larger. Recommendations? Does linearizing confer any advantage locally once the entire file is downloaded? Thanks. -- Dave

Show replies by date

alexandre.tabajara＠gmail.com

13 Aug 13 Aug

4:23 p.m.

Hey Dave, I've been digitizing docs for free for years Is any kind of standard, recomendation, group, mail list, to discuss the subject? All pdfs in www.tabalabs.com.br/esquemateca were scanned by me. I'm always open to critics and suggestions thanks Alexandre, PU2SEX ---8<---Corte aqui---8<--- http://www.tabajara-labs.blogspot.com http://www.tabalabs.com.br ---8<---Corte aqui---8<--- Em sex., 13 de ago. de 2021 ?s 17:20, J. David Bryan via cctech < cctech at classiccmp.org> escreveu:

...

jdbryan＠acm.org

6:15 p.m.

On Friday, August 13, 2021 at 17:23, Alexandre Souza wrote:

...

Is any kind of standard, recomendation, group, mail list, to discuss the subject?

I am not aware of any. I started with Al Kossow's basic recommendations, modified slightly: - scan at 600 dpi - use TIFF G4 where feasible - use tumble to convert to PDF I then wrote and use a couple of simple image-processing utilities based on the Leptonica image library: http://www.leptonica.org/ ...to clean up the scans (the library makes the programs pretty trivial). They start with the raw scans and: - mask the edges to remove hole punches, etc. - size to exactly 8.5" x 11" (or larger, for fold-out pages) - remove random noise dots (despeckle) - rotate to straighten (deskew) - descreen photos on pages into continuous-tone images - quantize and solidify screened color areas into solid areas - assign page numbers and bookmarks in the PDF A good example PDF produced by these programs is: http://www.bitsavers.org/pdf/hp/64000/software/64500-90912_Mar-1986.pdf The cover is a "solidified" black/gray/white image, manual pages 1-2 and 1-4 are continuous-tone JPEG images overlaying bilevel text images, and the rest of the pages are masked, deskewed, bilevel text images. The PDF bookmarks and logical page numbers are auto-generated from the original scan filenames. The final step is linearizing the PDFs, but I'm wondering whether this is still useful. -- Dave

tony.aiuto＠gmail.com

6:23 p.m.

On Fri, Aug 13, 2021, 6:15 PM J. David Bryan via cctech < cctech at classiccmp.org> wrote:

...

On Friday, August 13, 2021 at 17:23, Alexandre Souza wrote:

Is any kind of standard, recomendation, group, mail list, to discuss the subject?

It is of negative value. Any single container for a document makes it easier to handle than a bunch of pages discrete files that must be managed as a unit. Bandwidth is cheaper than human labor. Don't optimize the wrong thing.

...

aek＠bitsavers.org

14 Aug 14 Aug

2:04 p.m.

On 8/13/21 3:15 PM, J. David Bryan via cctech wrote:

...

On Friday, August 13, 2021 at 17:23, Alexandre Souza wrote:

Is any kind of standard, recomendation, group, mail list, to discuss the subject?

Jay Jager is trying to deal with scanning manuals with colored text and backgrounds. Is your workflow for dealing with this around somewhere?

bfranchuk＠jetnet.ab.ca

2:22 p.m.

On 2021-08-14 12:04 p.m., Al Kossow via cctalk wrote:

...

On 8/13/21 3:15 PM, J. David Bryan via cctech wrote: > On Friday, August 13, 2021 at 17:23, Alexandre Souza wrote: > >> Is any kind of standard, recomendation, group, mail list, to discuss >> the subject? > > I am not aware of any.? I started with Al Kossow's basic recommendations, > modified slightly: > > ?? - scan at 600 dpi > ?? - use TIFF G4 where feasible > ?? - use tumble to convert to PDF > > I then wrote and use a couple of simple image-processing utilities > based on > the Leptonica image library: > > ?? http://www.leptonica.org/ > > ...to clean up the scans (the library makes the programs pretty trivial). > They start with the raw scans and: > > ?? - mask the edges to remove hole punches, etc. > ?? - size to exactly 8.5" x 11" (or larger, for fold-out pages) > ?? - remove random noise dots (despeckle) > ?? - rotate to straighten (deskew) > ?? - descreen photos on pages into continuous-tone images > ?? - quantize and solidify screened color areas into solid areas > ?? - assign page numbers and bookmarks in the PDF > > A good example PDF produced by these programs is: > > > http://www.bitsavers.org/pdf/hp/64000/software/64500-90912_Mar-1986.pdf > > The cover is a "solidified" black/gray/white image, manual pages 1-2 and > 1-4 are continuous-tone JPEG images overlaying bilevel text images, > and the > rest of the pages are masked, deskewed, bilevel text images.? The PDF > bookmarks and logical page numbers are auto-generated from the original > scan filenames. > > The final step is linearizing the PDFs, but I'm wondering whether this is > still useful. > > ?????????????????????????????????????? -- Dave

I tend to have my PDF's on portable device, so PDF's need to be easy to use on those devices. Ben.

jdbryan＠acm.org

6:54 p.m.

On Saturday, August 14, 2021 at 12:22, ben via cctech wrote:

...

I tend to have my PDF's on portable device, so PDF's need to be easy to use on those devices.

Linearizing keeps PDFs as complete files but rearranges (and expands) them internally so that individual pages can be rendered from a subrange of bytes sent by the server. Before linearization, the information necessary to render a given page is scattered around the PDF file, so typically the whole file must be downloaded before displaying the page. Once the whole file is present locally, there's no difference in access or display whether linearized or not. I guess the operative question is whether people tend to view PDF pages on a server or download files and view them locally. Thanks. -- Dave

cmhanson＠eschatologist.net

17 Aug 17 Aug

7:51 p.m.

On Aug 14, 2021, at 3:54 PM, J. David Bryan via cctalk <cctalk at classiccmp.org> wrote:

...

I guess the operative question is whether people tend to view PDF pages on a server or download files and view them locally.

Even when viewing files locally, linearization is important for not having to load too much of the file for basic operations like navigating and scrolling. PDF requires page independence so linearization isn't what puts the information for each page together; instead, linearization is what ensures the content in the PDF is in presentation order. You can jump around and scroll in a PDF by doing byte-range fetches even without linearization, but you may have to do more work (e.g. you can't load "the next few pages" for scrolling in just a single fetch, it may require one fetch per page). -- Chris

jdbryan＠acm.org

14 Aug 14 Aug

6:36 p.m.

On Saturday, August 14, 2021 at 11:04, Al Kossow via cctech wrote:

...

Jay Jager is trying to deal with scanning manuals with colored text and backgrounds. Is your workflow for dealing with this around somewhere?

I'm still using the same set of programs that I sent to you in April 2016 (as image-utilities.zip). Have Jay contact me directly, and I'll see what I can do to help. Spot color (colored text being a specific case) is the most difficult issue to handle, in my experience, especially as the colors are usually screened

...

from CMYK. Solidifying screened colored areas and quantizing into a small

number of solid colors that can be rendered, e.g., as a four-color PNG page in a PDF, is not trivial, especially when text or line drawings touch or overlay the screened areas. -- Dave

als＠thangorodrim.ch

7:20 p.m.

On Fri, Aug 13, 2021 at 06:15:21PM -0400, J. David Bryan via cctech wrote:

...

On Friday, August 13, 2021 at 17:23, Alexandre Souza wrote:

Is any kind of standard, recomendation, group, mail list, to discuss the subject?

I am not aware of any. I started with Al Kossow's basic recommendations, modified slightly: - scan at 600 dpi - use TIFF G4 where feasible - use tumble to convert to PDF

My current toolchain for that: - scans at 600 dpi grayscale - compresses the raw scans with zip for archival and possible reruns (yes, I got bit by overly aggressive compression optimization in djvu, ask me about it, *grr*) - runs them through gm convert input.tiff -normalize -despeckle +dither -type bilevel output.tiff - uses tiff_findskew and pnmrotate to deskew them - compresses the tiff files with G4 for feeding into tesseract - uses tesseract to create PDFs with overlaid OCRed text per page and separately dump the OCR text into a txt for later indexing - bundles the per-page PDFs into a single PDF with pdfunite - finally archives, as a single commit into a git repo - the zip compressed raw scans - the OCR overlaid PDF - the OCRed txt While tesseract isn't perfect, it does a pretty good job. Copy-pasting OCRed text from one of those PDFs opened in evince works remarkably well. It is mostly used to avoid piling up mountains of paper from stuff like invoices, tax bills as well as the occasional "I'm not sure if I'm ever going to look at this manual again, let's archive it just in case". I probably should bundle the whole mess of scripts up and put it on github some day.

...

I then wrote and use a couple of simple image-processing utilities based on the Leptonica image library: http://www.leptonica.org/

Thanks for the pointer, I'm going to take a look - apparently tesseract uses leptonica for some image processing work.

...

...to clean up the scans (the library makes the programs pretty trivial). They start with the raw scans and: - mask the edges to remove hole punches, etc. - size to exactly 8.5" x 11" (or larger, for fold-out pages) - remove random noise dots (despeckle) - rotate to straighten (deskew) - descreen photos on pages into continuous-tone images - quantize and solidify screened color areas into solid areas - assign page numbers and bookmarks in the PDF A good example PDF produced by these programs is: http://www.bitsavers.org/pdf/hp/64000/software/64500-90912_Mar-1986.pdf

That is a very nice and clean scan!

...

The cover is a "solidified" black/gray/white image, manual pages 1-2 and 1-4 are continuous-tone JPEG images overlaying bilevel text images, and the rest of the pages are masked, deskewed, bilevel text images. The PDF bookmarks and logical page numbers are auto-generated from the original scan filenames. The final step is linearizing the PDFs, but I'm wondering whether this is still useful.

What is that? Never heard of linearizing PDF before, I've so far been concerned to eventually adjust my pipeline to properly support PDF/A (the archival version), but haven't gotten around to look into it. Kind regards, Alex. -- "Opportunity is missed by most people because it is dressed in overalls and looks like work." -- Thomas A. Edison

jdbryan＠acm.org

16 Aug 16 Aug

3:39 p.m.

On Sunday, August 15, 2021 at 1:20, Alexander Schreiber wrote:

...

My current toolchain for that:

Thanks; that was quite helpful. One aspect that I find of great assistance in navigating large PDF manuals is original page numbers. Often a manual will contain references, e.g., to "page 4-13" or "Appendix B-23". Having just a set of ascending integers for PDF page numbers and having to guess where Section 4 page 13 might be in that list is difficult, especially when PDF page 1 doesn't correspond to manual page 1-1 and the sections are very large. Being able to enter a referenced page number directly into a PDF reader's "go to page" dialog is very convenient.

...

http://www.leptonica.org/

Thanks for the pointer, I'm going to take a look - apparently tesseract uses leptonica for some image processing work.

You're welcome. Yes, tesseract is one of the major users of Leptonica. When I first started using the library about ten years ago, I found the documentation very reminiscent of those school mathematics textbooks that said, "The proof is left as an exercise for the reader." There were a couple of examples on the host site but no comprehensive index of the 2500+ library routines. The approach was, "read the source," which was fine if one was familiar with image processing terms, such as affine transformations, morphology, convolution, and octcube-based color quantization. It may be better now, but it was something of an intellectual challenge at the time.

...

What is that? Never heard of linearizing PDF before....

It's documented in the PDF Reference Manual from Adobe. Apparently, it's been around since PDF 1.2. The introduction to the chapter says: A linearized PDF file is one that has been organized in a special way to enable efficient incremental access in a network environment. The file is valid PDF in all respects, and it is compatible with all existing viewers and other PDF applications. Enhanced viewers can recognize that a PDF file has been linearized and can take advantage of that organization to enhance viewing performance. ...which, as others have mentioned, essentially is to allow page-at-a-time access via a browser without having to download the entire file first. -- Dave

trash80＠internode.on.net

14 Aug 14 Aug

10:55 p.m.

Hi Dave - I think it used to be called Byte Range Serving i.e. it would only serve up the page requested so URL's like somewebsite.com/myfile.pdf#page=4 would only send page 4 to the browser - I think this is what you are talking about Back in my earlier PDF days and the web we were quite a**l about setting PDF's up for this due to much tighter bandwidth constraints but on my limited understanding it required support from the web server to actually give effect to this. I don't know if PDF's are optimised out of the box these days for this but if you optimise a PDF for web delivery it should have the markers in it for byte range serving. While the markers may add a bit to the file size, which I suspect would be negligible, the action of optimising it for web delivery should reduce file size quite noticeably anyway. If you have an option for optimising for web delivery in your PDF tool then try pumping out a PDF file both ways and compare file size. Is it useful these days - probably not so much because of better bandwidth in my view (although directing a browser to open to a specific page can still be useful) but that is conditional on having well prepared PDF files. I'm not aware of any local benefit in having files prepared this way other than having good quality PDF's that are well presented and therefore easier to use. Kevin Parker 0418 815 527 -----Original Message----- From: cctech <cctech-bounces at classiccmp.org> On Behalf Of J. David Bryan via cctech Sent: Saturday, 14 August 2021 6:21 AM To: Classic Computing List <cctech at classiccmp.org> Subject: Linearizing PDF scans Is it still useful to linearize PDFs? I've been scanning and PDFing manuals for 16 years, and I've been linearizing them regularly. My understanding is that this made them accessible on a page-by-page basis in Web browsers without requiring a complete file download first. But given the increase in typical bandwidth in 16 years, I wonder if this is still useful. It is an extra step, and it does make the files somewhat larger. Recommendations? Does linearizing confer any advantage locally once the entire file is downloaded? Thanks. -- Dave

jdbryan＠acm.org

15 Aug 15 Aug

1:29 a.m.

On Sunday, August 15, 2021 at 12:55, Kevin Parker wrote:

...

I think it used to be called Byte Range Serving i.e. it would only serve up the page requested so URL's like somewebsite.com/myfile.pdf#page=4 would only send page 4 to the browser - I think this is what you are talking about

That's it exactly.

...

...but on my limited understanding it required support from the web server to actually give effect to this.

I believe that's right. At least all of the servers I used seemed to support this option.

...

I don't know if PDF's are optimised out of the box these days for this but if you optimise a PDF for web delivery it should have the markers in it for byte range serving. While the markers may add a bit to the file size, which I suspect would be negligible, the action of optimising it for web delivery should reduce file size quite noticeably anyway.

I use Ghostscript to perform the linearization as a post-process of the tumble-produced PDFs. It seems to add about 5-10% in size to the dozen or so files I've produced in both formats. Assuming one only looks at a few pages, it would certainly reduce the amount of data served, though, of course, if one requested the entire file, it would actually be a slight disadvantage.

...

Is it useful these days - probably not so much because of better bandwidth in my view (although directing a browser to open to a specific page can still be useful) but that is conditional on having well prepared PDF files.

It's an extra, albeit automated, step in my process, so it requires a limited effort on my part. But for a version or two, GS linearization was broken, so I wound up with a mix of linearized and non-linearized files. Which had me wondering whether it was worth going back and linearizing the ones that weren't. I could see it being most useful for something like IC databooks, where one might only want a one-time look up of a couple of pages out of a several hundred page PDF. For something like a service manual, though, I'd anticipate that folks would want to download the whole manual rather than a page here and a page there. As you say, it requires server support, and to be honest I've not checked recently to see if servers bother byte-serving anymore. Maybe pipes are too big to worry about it. Anyway, I was wondering if I was a dinosaur to keep linearizing these things if no one else was. Thanks for your thoughts. -- Dave

abuse＠cabal.org.uk

2:36 p.m.

On Sun, Aug 15, 2021 at 01:29:37AM -0400, J. David Bryan via cctalk wrote:

...

On Sunday, August 15, 2021 at 12:55, Kevin Parker wrote:

[...]

...

...but on my limited understanding it required support from the web server to actually give effect to this.

I believe that's right. At least all of the servers I used seemed to support this option.

The option in question is called "range requests" and is documented in the original HTTP/1.1 standard from way back in 1997. Any web server worth its salt should support it automatically when serving static files. It's used for resuming downloads, for example. [...]

...

Assuming one only looks at a few pages, it would certainly reduce the amount of data served, though, of course, if one requested the entire file, it would actually be a slight disadvantage.

A larger disadvantage is the pause to download the next page when leafing through a PDF, which can be quite distracting to people who can read without moving their lips. [...]

...

As you say, it requires server support, and to be honest I've not checked recently to see if servers bother byte-serving anymore. [...]

If a server lacks support for range requests, it is either very old or a small hobby project, and shouldn't be let anywhere near the public Internet.

1419

days inactive

1423

days old

test-drb@ccmp.vtda.org

Manage subscription

13 comments

9 participants

tags (0)

participants (9)

abuse＠cabal.org.uk
aek＠bitsavers.org
alexandre.tabajara＠gmail.com
als＠thangorodrim.ch
bfranchuk＠jetnet.ab.ca
cmhanson＠eschatologist.net
jdbryan＠acm.org
tony.aiuto＠gmail.com
trash80＠internode.on.net