Al Kossow wrote:
HP
developed an OCR engine called Tesseract that is supposed to be
pretty good. They released it to the open-source world, and Google has
picked it up and started working on it.
classiccmp list member James Markevitch has been working on an OCR program
as well, optimized for column formated input, like listings.
Cross-platform, or one specific OS?
I started putting some stuff together to allow a user to graphically describe
a scanned page (so you'd roughly mark out what were images, what were columns
of text etc.) prior to feeding to an OCR engine, as experience of commercial
products has been that they tend to get it wrong too much to be left to run
without user input. Unfortunately the Linux OCR engines available proved to be
just too poor in quality to make it worthwhile, so I shelved it until
something better came along - maybe Tesseract will do the job.
I was just talking to Doron Swade (the person
responsible for the Difference
Engine at the British Science Museum) and he is interested in OCR of
mathematical tables (also column-oriented like listings).
I've never actually met Doron, although his name tends to crop up an awful
lot. I think he's possibly up at our museum next Friday, but I'll be on a
plane at that point...
cheers
Jules