If you OCR, always archive the bitmaps too - Re: Regarding Manuals
Fred Cisin
cisin at xenosoft.com
Sun Sep 27 17:27:50 CDT 2015
On Sun, 27 Sep 2015, Johnny Billquist wrote:
> That would be possible, I guess. But I would so like to remember, refind what
> I used back then. The results it produced was pretty much identical to the
> original. Manuals, in comparison, would be pretty straight forward. (Less
> fonts, and less strange layouts than books, in my eye. Figures still needs to
> be bitmaps, though.)
While I have no way of knowing what you were using back in the day,
something else to keep in mind about books of text V manuals, . . .
with a book of text, there is significantly more context for every item.
As a trivial example, in English language text, a 'Q' is virtually never
followed by anything other than 'u', space, or punctuation. Therefore, if
there is a letter following a 'Q', it can be assumed to be 'u' unless
proven otherwise. Not so for part numbers, variable names, etc. Applying
"spell-checking" to a document gives a very high initial set of
probabilities for letters that might otherwise be unclear. If a given
font has a very stylistic 'e', then its likelihood can be checked just
with letter frequency, and if extremely common and surrounded by other
letters, it is quite unlikely to be a slashed '0'. In general, '0', 'O',
'1', 'l', 'I' can generally be differentiated by context, such as whether
surrounded by numbers or letters, but NOT as reliably based on shape, or
even pixel matching.
Therefore, some OCR programs that make use of some of those kinds of
techniques might do great on text, but be bordering on unusable for tech
documents.
My idea was to make human assisted OCR, by displaying the OCR in progress,
with color coding of characters based on their probability of accuracy.
Then, cheap labor could manually enter characters, starting with those
that had lowest probability of accuracy. Minor heuristic algorithms could
then use the incoming data of additional character/pixel pattern pairs to
improve the guesses of subsequent characters. The cumulative data pairs
would learn additional fonts.
The cheap labor could be neighborhood kids, off-shore out-sourcing, or
even grad students, depending on how much you care about their quality of
life and cost of living. For premium quality, use workers who even have
some knowledge of the material.
BUT, one must never pick a worker who was brought up to interchange '0'
and 'O', '1' and 'l', etc. (Remember when some typewriters didn't HAVE
both characters?)
More information about the cctalk
mailing list