On Sat, Sep 13, 2003 at 09:50:22AM -0700, Eric Smith wrote:
"Antonio Carlini" <arcarlini(a)iee.org>
wrote:
As long as you scan the stuff now while you have
it, you can OCR at your
leisure when the technology improves (and requires far less
proof-reading).
Note that you should NEVER save scans of text and line art in a lossy
form such as JPEG. JPEG works for continuous-tone images such as
photographs by deliberately throwing away high-frequency components.
Test and line art contain sharp black-to-white transitions (and vice
versa, of course) which get smeared by this compression, resulting in
a blurry image.
I can only second that. I've cursed times and again at some fools who
decided to scan some paper documents (fine so far) and use JPEG (lossy
compressing intended for continous tone stuff like photo images) on
black and white scans. The results are ugly, sometimes hard to read and
a bitch to print properly. Oh, and this just made the work of OCRing
this a _lot_ harder.
For text and line art, a lossless bilevel compression
such as G3 or
G4 fax format (used in some TIFF files), JBIG, JBIG2, Flate (used in
some PNG files). You can't assume that because you save in TIFF or PNG
that you get a specific form of compression, since they are very
broad standards that support multiple compression types.
Sometimes people tell me that JPEG is alright if you only compress
slightly. The edges still get blurry, and the resulting file size
is generally *MUCH* larger than if you use G4 or JBIG.
Of course the files are bigger. The lossy algorithm for JPEG was
designed to work on continous toned images (where it works fine) and
just runs into the wall with black and white stuff. Where the algorithm
expects to find lots of low/middle frequency and some high frequency
data, it suddenly is faced with high frequency data alone. No smooth
color value curves that can be nicely compressed. Using JPEG for
compressing black and white is like using a Ferrari for pulling a
trailer full of grain - it gets the stuff moving, but you really, really
should use a proper truck for this job.
I've written a program to take B&W TIFF files
and color or B&W JPEG
files and produce a PDF file:
http://tumble.brouhaha.com/
Thanks for writing this program. I'm in the process of archiving the
interesting articles from a stack of computer magazines and am currently
experimenting with the best way to convert dead trees to PDF files.
So far, scanning the paper as lineart at 600 dpi, saving as fax G4
compressed tiff and using tumble to combine those into PDF files yields
the best (best quality, smallest files) results.
Regards,
Alex.
--
"Opportunity is missed by most people because it is dressed in overalls and
looks like work." -- Thomas A. Edison