On Sun, 2009-10-04 21:20:46 +0200, Jan-Benedict Glaw <jbglaw at lug-owl.de> wrote:
On Wed, 2009-09-30 13:12:25 -0700, Al Kossow <aek
at bitsavers.org> wrote:
Jan-Benedict Glaw wrote:
They surely could OCR scanned PDFs, but I'm
not sure if they
will do that.
I can assure you that they HAVE been doing that on bitsavers
content for over a year. That was one of the reasons I decided
to start converting the content. People were contacting me wondering
why the OCRed version wasn't on line.
Maybe the current workflow for creating the PDFs on
bitsavers.org
could be a bit better documented? The docs over there only mention the
simple conversion via thumble. It's latest NEWS update is dated
20031209, so quite aged, too. I guess that parts of the workflow are
different these days?
To place it into public discussion, my li'l script (used to cut
2side scans from multi-page TIFFs and make a nice PDF book of that)
basically uses `tiffsplit' to create one-page TIFFs, `convert' to
conver to .pbm format, `unpaper' to cut (and straighten) out the two
book pages per TIFF/PBM page, then use `ocroscript rec-tess
--tesslanguage=en ...' to OCR each simgle page and finally use
`HocrConverter.py' [1] to assemble single-page straightened TIFF pages
and the HOCR scan results to a PDF.
All parts of that software stack is in Debian, except the Hocr stuff
(for scanning the pages and generate a file that also contains the
position of scanned text on the page).
MfG, JBG
[1]
http://xplus3.net/downloads/HocrConverter.gz, linked from
http://xplus3.net/2009/04/02/convert-hocr-to-pdf/
--
Jan-Benedict Glaw jbglaw at lug-owl.de +49-172-7608481
Signature of: 23:53 <@jbglaw> So, ich kletter' jetzt mal ins Bett.
the second : 23:57 <@jever2> .oO( kletter ..., hat er noch Gitter vorm Bett, wie
fr?her meine Kinder?)
00:00 <@jbglaw> jever2: *patsch*
00:01 <@jever2> *aua*, wof?r, Gedanken sind frei!
00:02 <@jbglaw> Nee, freie Gedanken, die sind seit 1984 doch aus!
00:03 <@jever2> 1984? ich bin erst seit 1985 verheiratet!