On Mon, 2009-12-14 14:15:30 +0000, Philip Pemberton <classiccmp at philpem.me.uk>
wrote:
[...]
Also, does anyone know of an app that can take the PDF
file, OCR it
and then insert the text as a background layer while leaving the
image alone? I'm pretty sure Acrobat can do this, but like most
Adobe software, the price tag is somewhat... eye-watering. "If you
have to ask how much it costs, you can't afford it."
Here's what I once used to work on scanned book pages. My scanner (A4
scanner) always scanned two A5 pages onto one A4 page and created
multi-page TIFFs. Due to memory constraints, each TIFF only contains
about 12 to 16 of those A5 pages.
The attached script first explodes the multi-page TIFFs to
single-page, then cuts those into two pages, fixes up DPI values,
OCRes the single pages and finally puts it back all together, forming
a PDF file containing the scanned stuff.
With some more polish:
* automatically detect bw/greyscale/colour
* configurable source page size
* configurable destination page size
* configurable cutting when two pages are scanned together into one
TIFF page
this could be real fun. Maybe I'll continue to work on it once my
move to a new flat is done. If anybody works on it and made it better
in any way, I'd like to see the patches!
The script may be used under the terms of the GNU General Public
License Version 3.
MfG, JBG
--
Jan-Benedict Glaw jbglaw at lug-owl.de +49-172-7608481
Signature of: "Debugging is twice as hard as writing the code in the first place.
the second : Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan