On Fri, 24 Jun 2011, Dan Gahlinger wrote:
I have a few rather thick text printouts from the
mid-1970's on 132
column paper,standard fanfold stuff printed out from DEC teletypes and
line printers I'm wondering what the best way to scan this in would be,
to get actual text outputthat's readable and usable ? In most cases
there is no way even the best OCR could tell the difference betweenan
"L", "l", "1" or "I", and "O" or
"0" is just as bad. Hand-typing over 6"
thick printout is not my idea of fun. Any bright ideas? there's one in
particular I want to scan in and get documented, as there's an old-wives
tale about the code I want to verify if it's true (it's an original
1970's printout of Zork in Fortran that is supposedly "auto-correcting"
after a fashion)not that I buy it...
It's an interesting task. The good news is that a LOT of disambiguation
can be done by context. Such as, a letter between numerals, or a numeral
between letters, are less probably than matching adjacent type.
If it is FORTRAN listings, then there are a lot of algorithmically
available repairs. For example, what characters can be in columns 1 - 5?
If the previous card ^H^H^H^H line does not have a 'C' in column 1 nor any
character in column 6, then what characters can be in column 7?
Who cares what characters are in 73-80?
The ideal best OCR software would have a probabilistic ranking, and start
by querying the operator (with a graphics image of the page!) for those
ambiguous characters with the lowest probability of certainty.
A heuristic enhancement would then increase or decrease probability
rankings for subsequent identical confusions based on what the operator's
response had been.
Through use of the heuristic enhancement capability, the OCR software
could start with a reasonable font, but could even be started with NO
prior font knowledge! Hire the neihbor kid to type whatever shows up in
the graphics image on the screen; soon many characters would be matched
successfully; eventually, characters requiring operator intervention would
be extremely rare.
Probabilistic ranking can do quite a bit if set up properly. For example,
what characters would be most likely after a 'Q'? ('U', period, comma,
or
space) What are the most likely characters following a space? (Hint:
AFTER A SPACE, it is NOT ETAOINSHRLDU)
The OCR software can start with a substantial DB of fonts.
But, with heuristic enhancement, it could even start with NO font DB!
Hire the neighbor kid to type whatever show up in the graphics image on
the scree; soon it will recognize a lot of the characters; eventually, it
will recognize damn near all of them. After sufficient use, the only
times operator intervention would be needed would be for damaged
characters, ligatures, etc.
These algorithms have been implemented on experimental bases.
Is there any commercial software that does an adequate job?
--
Grumpy Ol' Fred cisin at
xenosoft.com