Overdue context sensitivity in OCR (Was: Best way to scan 132 column fanfold mid-1970s text printout

24 Jun 2011

But what of "COMMON" statements in Fortran?
Is it COMMON "III" or "LLL" or "L1L" or "LIL" or
"ILL" or...well you see what I mean...
and stuff like if (A = O) was that the variable "O" or the number 0 ?yes I know
that's not a proper fortran if... but you get the idea
even typing by hand reading by eye can be really hard, as some others would attest.
check the sourceforge project TREK7 for some ideas.
PS - if ANYONE can help me finish trek7 by writing up terminal assigments in Fortan for
Linux,Please let me know - I want to get the multiplayer going.but you can't really do
assign(file/read '/dev/pts/7') or something to read/write another users
STDIN/STDOUT in linux...or can you?
I forgot more fortran than I ever learned... I can't code it any more :(
Dan.
...
  Date: Fri, 24 Jun 2011 13:12:41 -0700
 From: cisin at xenosoft.com
 To: cctalk at classiccmp.org
 Subject: Overdue context sensitivity in OCR (Was: Best way to scan 132 column	fanfold
mid-1970s text printout

 On Fri, 24 Jun 2011, Dan Gahlinger wrote:
  I have a few rather thick text printouts from the
mid-1970's on 132
 column paper,standard fanfold stuff printed out from DEC teletypes and
 line printers I'm wondering what the best way to scan this in would be,
 to get actual text outputthat's readable and usable ? In most cases
 there is no way even the best OCR could tell the difference betweenan
 "L", "l", "1" or "I", and "O" or
"0" is just as bad. Hand-typing over 6"
 thick printout is not my idea of fun. Any bright ideas? there's one in
 particular I want to scan in and get documented, as there's an old-wives
 tale about the code I want to verify if it's true (it's an original
 1970's printout of Zork in Fortran that is supposedly "auto-correcting"
 after a fashion)not that I buy it...  
 It's an interesting task. The good news is that a LOT of disambiguation
 can be done by context.  Such as, a letter between numerals, or a numeral
 between letters, are less probably than matching adjacent type.

 If it is FORTRAN listings, then there are a lot of algorithmically
 available repairs.  For example, what characters can be in columns 1 - 5?
 If the previous card ^H^H^H^H line does not have a 'C' in column 1 nor any
 character in column 6, then what characters can be in column 7?
 Who cares what characters are in 73-80?

 The ideal best OCR software would have a probabilistic ranking, and start
 by querying the operator (with a graphics image of the page!) for those
 ambiguous characters with the lowest probability of certainty.

 A heuristic enhancement would then increase or decrease probability
 rankings for subsequent identical confusions based on what the operator's
 response had been.

 Through use of the heuristic enhancement capability, the OCR software
 could start with a reasonable font, but could even be started with NO
 prior font knowledge!  Hire the neihbor kid to type whatever shows up in
 the graphics image on the screen; soon many characters would be matched
 successfully; eventually, characters requiring operator intervention would
 be extremely rare.

 Probabilistic ranking can do quite a bit if set up properly.  For example,
 what characters would be most likely after a 'Q'?  ('U', period, comma,
or
 space)  What are the most likely characters following a space? (Hint:
 AFTER A SPACE, it is NOT ETAOINSHRLDU)

 The OCR software can start with a substantial DB of fonts.

 But, with heuristic enhancement, it could even start with NO font DB!
 Hire the neighbor kid to type whatever show up in the graphics image on
 the scree; soon it will recognize a lot of the characters; eventually, it
 will recognize damn near all of them.  After sufficient use, the only
 times operator intervention would be needed would be for damaged
 characters, ligatures, etc.

 These algorithms have been implemented on experimental bases.
 Is there any commercial software that does an adequate job?

 --
 Grumpy Ol' Fred     		cisin at xenosoft.com  		 	   		  

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Overdue context sensitivity in OCR (Was: Best way to scan 132 column fanfold mid-1970s text printout