Holger Veit wrote:
...
2. Scanning: You find a sample issue at
http://www.ais.fraunhofer.de/~veit/v2n7.pdf (2MB). This was scanned B&W
400dpi, stored as TIF and converted with Acrobat.
...
Regards
Holger
Holger, others have pointed out rightly that for low contrast pages, you have to
spend some time playing with thresholding to get it to digitize OK.
What I want to address here it the file size. If you assemble a PDF from a
collection of page images (jpgs, TIFF) acrobat appears to simply puts a wrapper
around them and that is that.
If you read in a PDF and use the File/Reduce File Size... menu, it often doesn't
have much effect (I think if you have high DPI images it may resample them,
which may not be what you want). I tried it on your file and there was barely
any reduction in size.
However, if you scan the files from within acrobat (Create PDF/From scanner...),
it applies a lot more intelligence to the task. It is also affected by a few of
the preferences you can set.
Something that has worked for me for reprocessing existing files is this
procedure (I've only used it for 1 bpp images). Read in the PDF. File/Save As
TIFF. This produces one TIFF page for each source page. Then use the Create
PDF/From Multiple Files... and read back in all the TIFF images. It will
recompress them. Now if you are using G4 compression, this will help only if
the page images were encoded with something worse (like LZW). The important
step is to set the preference for reading in TIF files to allow JBIG2
compression -- this saves about 20% for typical pages, and can be dramatically
better for images with halftoning. The real savings come when you select JBIG2
(lossy). Yes, it does change the image in imperceptible ways. For some, this
is heresy, but i'd point out that you are scanning at 1 bpp, so why be a
stickler about what you get?
To make this concrete, your original document is 28 pages and is 2266 KB. After
my preprocessing step, it is 849 KB and looks every bit as good to my eyes. See
for yourself:
http://home.pacbell.net/frustum/v2n7-repacked.pdf
Looking closely (like 800% magnification) your scan shows a LOT of dithering on
all characters. Perhaps this is the result of the low contrast source, but more
likely your scanner is doing dithering. I've seen this on the high speed office
scanner at my work -- the dithering is nice visually when making copies, but it
introduces a lot of edge deltas that G4 compression spend a lot of bits
encoding. My cheap home scanner doesn't dither so agressively and produces
smaller scans.
Then zoom in on the repacked pdf that I made. The vertical edges of characters
have a lot less dithering. No doubt this leads to the smaller file size.
One problem that I'm wrestling with is this. If you scan via acrobat and use a
grayscale or color option, acrobat tries to identify regions of pages, perhaps
the whole page, that can be quantized down to 1 bpp B&W for the best
compression. Sometimes it works brilliantly, other times it decides sections of
a page are best encoded as jpeg images, resulting in barfalicious artifacts.