Overdue context sensitivity in OCR (Was: Best way to scan 132 column fanfold mid-1970s text printout

28 Jun 2011

...
  > Probabilistic ranking can do quite a bit if set
up properly.  For example,
 > what characters would be most likely after a 'Q'?  ('U', period,
comma, or
 > space)  What are the most likely characters following a space? (Hint:
 > AFTER A SPACE, it is NOT ETAOINSHRLDU) 
On Tue, 28 Jun 2011, Stan Sieler wrote:
...
  Your last question, about the space, got me curious...
 I took your email (including the embedded quote of Dan's email) and counted:
 I took the liberty of fixing a few typos (e.g.: paper,standard was
 changed to "paper, standard" (otherwise that "s" wouldn't be
counted),
 and considered the first letter of a word after a paragraph or sentence
 break still counted as being after a space. 
A LOT can be learned from cryptography (which I am almost totally ignorant
of) ,for OCR.  For example, if we assume (RISKY??) that whitespace is
known, a single letter word is PROBABLY one of a very few choices.
An important principle to remember is that we can derive a lot of
information from the data by paying attention to, and USING, the
probabilities and their combinations.
Bill Cooper:  But isn't "fuzzy logic" just combinatorial probability?
Lofti Zadeh:  Probably.
Using the probabilities may require a LOT of data to work from, not just a
few amateur wordy posts.
Using another one of the most gross and obvious examples, in general
English text, the letter following a 'Q' is generally a 'U', except in
characters within acronyms or within or part numbers; in political text,
if a word ends in 'q', the word is so likely to be Iraq that much effort
can be saved by just looking for confirmation of that rather than
struggling with uncertain characters before looking at the word.
Likewise in THIS context, the word ending in 'q' is most likely Compaq.
The content can also cause ENORMOUS variation.  Differences in subject
vocabulary can have different frequency statistics.
Different scientific disciplines have different vocabularies.
And, of course, if somebody had the unenviable task of trying to decode
"text-speak" ("C U 4 lunch @ 12?", etc.) there would be different
frequencies.  (Or email illiteracy of there/their/they're, two/to/too)
...
  I counted uppercase letters with their lowercase
counterparts.
 The most popular four letters were four of the five of "ETAOI":
 t  (# instances: 61)
 i  (# instances: 49)
 a  (# instances: 47)
 o  (# instances: 37)
 c  (# instances: 35)
 w  (# instances: 32)
 b  (# instances: 28)
 s  (# instances: 28)
 f  (# instances: 20)
 p  (# instances: 17)
 d  (# instances: 14)
 e  (# instances: 14)
 . . .
 But, I'm curious...what were the letters you thought would appear more
 frequently? 
Not much more than just that letter frequencies of first letters of words
should not necessarily be assumed to match the overall frequency, whereas
maybe the second or third letter might be a close match.
'E' is not "r at re" as the start of a word, but certainly is NOT the
most
common first letter.   Would it be relevant to glance at the thickness of
the sections for the different letters in the dictionary?  Well, yes and
no.  The section length in the dictionary may provide a clue as to how
many words start with a given letter, but inaccurately due to the
variations in lengths of the definition), and certainly not how frequently
said words are used  ("the" is generally used more often than
"thespian")
Some portions of the relative frequencies of letter use are very stable,
and some other portions vary wildly.
I'm curious about the effects of typos and mispellinqs.  You altered the
text by correcting our errors.  What effect on decoding will the frequency
of the words "hte" and "teh" have?
There has been some serious research on these subjects, but inadequate
application into appropriate fields where that  knowledge would be
useful, such as OCR, error correction, etc.  other than cryptography.
--
Grumpy Ol' Fred                     cisin at xenosoft.com

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Overdue context sensitivity in OCR (Was: Best way to scan 132 column fanfold mid-1970s text printout