I've been trying for years to get usable scans of old computer listings.
Look at the attached file. It was printed with a Diablo HyType and new
ribbon. There are only 64 unique characters (6 bit character set) in the
entire listing. Courier (non-proportional) font. And yet the OCRs are
miserable, even ChatGPTs feeble effort.
Marc
On Wed, Dec 3, 2025 at 2:55 PM Paul Koning via cctalk <cctalk(a)classiccmp.org>
wrote:
On Dec 3, 2025, at 10:55 AM, Adrian Godwin via
cctalk <
cctalk(a)classiccmp.org> wrote:
I don't think it's the general quality of the patent print that's poor,
it's the line-printer listing section from
https://www.hp9845.net/9845/downloads/patents/US4089059.pdf starting at
about page 213 of the pdf , possibly section 26 of the patent.
The print in that section is much paler than the rest - typical of a worn
line-printer ribbon. I doubt the printed copy is any better. I'm only
trying to OCR the listing, not the rest of the patent.
That's quite a cleaen listing, actually, cleaner than most I have worked
with and dramatically better than some. The sort of slightly-damaged
characters that appear should be no problem at all for the "training"
feature of ABBYY Fine Reader to deal with. What you'd have to do is run a
number of pages through it in training mode, so it sees a number of
variations of the individual characters. And as I mentioned, you'd do all
the scanning in the mode where it only accepts what it was trained with, no
"builtin" patterns. That way it won't make up stuff that isn't part of
the
character set but happens to match something built-in, like a
pound-sterling sign.
It may be that scanning the listing as a table (with the various columns
as table columns) will work well, and give you the layout explicitly. Or
it can be scanned as plain text, but in that case the spacing will mostly
turn into individual spaces and you'd need post-processing to insert tabs
etc. to make it look right again. Given the simple assembler syntax
involved that sort of post-processing would not be hard.
paul