Please permit me to relate my experience with this, albeit not with
character documents.
I use an OCR program to process printed music into a format that my
transcription programs can understand (don't get me started on incompatible
file formats!). Now, this is typically a low accuracy process--musical
notation is essentially handwritten notation. Prior to the advent of
computer transcription, music was published either by publishing a
photographic rendition of the handwritten score or engraved by hand onto
copper plates. It's a real art and computer transcription programs rarely
approach the quality that a hand-engraved score can have, but they're
getting better and they might actually rival real hand-engraved scores
sometime.
So, you're fortunate if you can get 80% accuracy with an OCR engine.
Obviously, the manual cleanup process is very time-intensive. But it's all
we've got in the music world.
TIFF is the standard for all musical OCR. I've tried using what appears to
be a very clean JPEG, converting to TIFF and the results have always been
very disappointing. Perhaps it's a bug in the image conversion software.
I get much better results with B&W GIF converted to TIFF. Note that much
of musical notation is lines.
The Library of Congress publishes its online music libraries using TIFF and
it works very well. JPEG is used for photographic images.
Cheers,
Chuck