Scanning Suggestions (Bookmarks & Colour)

Antonio Carlini a.carlini at ntlworld.com
Fri Aug 27 15:50:14 CDT 2021


I have a few manuals to scan and I'm looking for suggestions, about how 
to add bookmarks and how to handle colour.

Bookmarks should be easier, so lets start with that. I want to add 
bookmarks (or whatever they are called) so that it is easy to navigate 
to page "2-48" or "C-17" in a document. Many of the PDFs on bitsavers 
have that and I've found it very useful so I'd like to do that for my 
future scans. I've tried with pdftk (the Java port as the original is no 
longer available on my distro) but that failed. So I tried GhostScript 
and that also failed, while also rewriting the PDF to be considerably 
larger. Is there simple way to achieve this (ideally from the CLI)?


Now for the scanning itself.

For manuals that are simple monochrome, I plan to scan at 600dpi bilevel 
G4 encoded, wrapped in PDF.
For photographs or shaded areas that don't necessarily come out well 
under those settings, I plan to use 8-bit greyscale. I'd prefer to use 
600dpi but I may have to fall back to 300dpi if the per-page fiile size 
shoots up too much.

The real issue is colour. I know that various people have looked at the 
issue of how to efficiently scan pages that are mostly black and white 
but have some coloured text (RSX-11 manuals and early VMS manuals did 
this to highlight terminal input, for example). I don't think this is a 
solved problem and I'm not expecting a solution, what I'm really looking 
for is to check that what I'm about to produce will have all the 
information that a future efficient algorithm is likely to need.

I'm going to start by scanning the whole manual as though it had no 
colour (so 600 dpi bilevel G4 encoded, except for pages with photos and 
shading and so on). Then I'm going to go back and rescan the pages that 
have colour and scan those at 600 dpi and save as a JPG. Then I'll 
produce a final PDF with the colour pages inserted. I'll also produce a 
PDF with the B&W pages that were replaced by colour pages (I assume OCR 
will be better served by non-jaggy scans).

So the final outputs will be:
manual.pdf  - the whole manual, including whole pages scanned as colour 
if any colour is present on them
manual_BW.pdf  - the G4-encoded bilevel pages that were replaced by 
colour pages

Thanks


Antonio


-- 

Antonio Carlini
antonio at acarlini.com



More information about the cctech mailing list