On Sep 1, 2006, at 4:05 PM, Jules Richardson wrote:
HP developed an OCR engine called Tesseract that is
supposed to be
pretty good. They released it to the open-source world, and Google
has
picked it up and started working on it.
classiccmp list member James Markevitch
has been working on an OCR
program
as well, optimized for column formated input, like listings.
Cross-platform, or one specific OS?
At first glance, it appears to be Linux-specific, but that's
generally pretty easy to un-do. The important part is it's not Windoze
software.
I started putting some stuff together to allow a user
to graphically
describe a scanned page (so you'd roughly mark out what were images,
what were columns of text etc.) prior to feeding to an OCR engine, as
experience of commercial products has been that they tend to get it
wrong too much to be left to run without user input. Unfortunately the
Linux OCR engines available proved to be just too poor in quality to
make it worthwhile, so I shelved it until something better came along
- maybe Tesseract will do the job.
It's possible...might be worth looking into.
-Dave
--
Dave McGuire
Cape Coral, FL