OCR old software listing.
Toby Thain
toby at telegraphics.com.au
Fri Dec 28 23:47:20 CST 2018
On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
> submitted to DECUS by Bill Seiler.
>
> The format is scans of the PAL-11S listing output. It is easy to crop the
> image to only contain actual source. Then running OCR on it. Tried a few
> online versions and tesseract.
>
> The problem is that the paper that the listing is printed on has lines.
> Very black lines. It makes the OCR go completely crazy. Source lines
> without black lines OCR ok. The others do not. The files need massive
> amount of manual intervention.
>
> Does anyone have an idea how to process files like this?
>
> A good way to remove the black lines?
Hi Mattis
Here's a first cut. Can probably be improved slightly. Let me know how
much this still confuses Tesseract.
https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif
--Toby
>
> There are only 19 source files with three or four pages each so I don't
> think it makes sense to try to train tesseract to do it (training tesseract
> seems to be a huge undertaking).
>
> https://i.imgur.com/dvY973s.png
>
> /Mattis
>
More information about the cctech
mailing list