DEC scanned documents for Bitsavers (message for Al Kossow)

Paul Koning paulkoning at comcast.net
Fri Apr 24 10:09:02 CDT 2015


> On Apr 24, 2015, at 10:59 AM, Johnny Billquist <bqt at update.uu.se> wrote:
> ...
> Speaking of which, anyone have a good suggestion for OCR nowadays? I would really like to throw all the current PDF scans of manuals that sits out there at OCR, to get the documents back to sane sizes, and also make it possible to update the documents over time when needed…

Acrobat had (has?) an OCR feature which I’ve used at times.  It’s amazingly primitive and works to a “just barely acceptable” level.

I once found an open source OCR program, tried it, concluded it wasn’t worth the trouble of downloading it and forgot the name. 

At this point, I use Abbyy FineReader, which is a commercial product (not really cheap, but not outrageously expensive either).  Windows based, unfortunately.  Among other things, it has a “training” mode where you can teach it what the letters in your source material look like.  If you’re dealing with stuff that’s at all marginal — like line printer listings or typewriter material, never mind dot matrix — spending an hour or two in training mode makes an incredible difference.  I’ve been using this OCR to read CDC 6600 wire lists, which are a challenge (low quality typewriter text).  I also tried it the other day on old lineprinter listings of the THE operating system; those were too far gone to be useable, partly because they are upper case only printouts with . overprinting on letters that are upper case in the original mixed case source files.  Getting OCR to tell an O from an O with . overprint, reliably, just wasn’t doable.

So OCR will work within reason, but there is still material that can only be handled by human eyes.

	paul




More information about the cctalk mailing list