If you OCR, always archive the bitmaps too - Re: Regarding Manuals

Fred Cisin cisin at xenosoft.com
Sun Sep 27 17:27:50 CDT 2015


On Sun, 27 Sep 2015, Johnny Billquist wrote:
> That would be possible, I guess. But I would so like to remember, refind what 
> I used back then. The results it produced was pretty much identical to the 
> original. Manuals, in comparison, would be pretty straight forward. (Less 
> fonts, and less strange layouts than books, in my eye. Figures still needs to 
> be bitmaps, though.)

While I have no way of knowing what you were using back in the day, 
something else to keep in mind about books of text V manuals, . . .
with a book of text, there is significantly more context for every item. 
As a trivial example, in English language text, a 'Q' is virtually never 
followed by anything other than 'u', space, or punctuation.  Therefore, if 
there is a letter following a 'Q', it can be assumed to be 'u' unless 
proven otherwise.  Not so for part numbers, variable names, etc.  Applying 
"spell-checking" to a document gives a very high initial set of 
probabilities for letters that might otherwise be unclear.  If a given 
font has a very stylistic 'e', then its likelihood can be checked just 
with letter frequency, and if extremely common and surrounded by other 
letters, it is quite unlikely to be a slashed '0'.   In general, '0', 'O', 
'1', 'l', 'I' can generally be differentiated by context, such as whether 
surrounded by numbers or letters, but NOT as reliably based on shape, or 
even pixel matching.

Therefore, some OCR programs that make use of some of those kinds of 
techniques might do great on text, but be bordering on unusable for tech 
documents.


My idea was to make human assisted OCR, by displaying the OCR in progress, 
with color coding of characters based on their probability of accuracy. 
Then, cheap labor could manually enter characters, starting with those 
that had lowest probability of accuracy.  Minor heuristic algorithms could 
then use the incoming data of additional character/pixel pattern pairs to 
improve the guesses of subsequent characters.  The cumulative data pairs 
would learn additional fonts.

The cheap labor could be neighborhood kids, off-shore out-sourcing, or 
even grad students, depending on how much you care about their quality of 
life and cost of living.  For premium quality, use workers who even have 
some knowledge of the material.

BUT, one must never pick a worker who was brought up to interchange '0' 
and 'O', '1' and 'l', etc.  (Remember when some typewriters didn't HAVE 
both characters?)



More information about the cctalk mailing list