On Apr 24, 2015, at 10:59 AM, Johnny Billquist <bqt
at update.uu.se> wrote:
...
Speaking of which, anyone have a good suggestion for OCR nowadays? I would really like to
throw all the current PDF scans of manuals that sits out there at OCR, to get the
documents back to sane sizes, and also make it possible to update the documents over time
when needed?
Acrobat had (has?) an OCR feature which I?ve used at times. It?s amazingly primitive and
works to a ?just barely acceptable? level.
I once found an open source OCR program, tried it, concluded it wasn?t worth the trouble
of downloading it and forgot the name.
At this point, I use Abbyy FineReader, which is a commercial product (not really cheap,
but not outrageously expensive either). Windows based, unfortunately. Among other
things, it has a ?training? mode where you can teach it what the letters in your source
material look like. If you?re dealing with stuff that?s at all marginal ? like line
printer listings or typewriter material, never mind dot matrix ? spending an hour or two
in training mode makes an incredible difference. I?ve been using this OCR to read CDC
6600 wire lists, which are a challenge (low quality typewriter text). I also tried it the
other day on old lineprinter listings of the THE operating system; those were too far gone
to be useable, partly because they are upper case only printouts with . overprinting on
letters that are upper case in the original mixed case source files. Getting OCR to tell
an O from an O with . overprint, reliably, just wasn?t doable.
So OCR will work within reason, but there is still material that can only be handled by
human eyes.
paul