On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
Finally I got hold of the sources for the PDP-11 SPACE
WAR that was
submitted to DECUS by Bill Seiler.
The format is scans of the PAL-11S listing output. It is easy to crop the
image to only contain actual source. Then running OCR on it. Tried a few
online versions and tesseract.
The problem is that the paper that the listing is printed on has lines.
Very black lines. It makes the OCR go completely crazy. Source lines
without black lines OCR ok. The others do not. The files need massive
amount of manual intervention.
Does anyone have an idea how to process files like this?
A good way to remove the black lines?
Hi Mattis
Here's a first cut. Can probably be improved slightly. Let me know how
much this still confuses Tesseract.
https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif
--Toby
There are only 19 source files with three or four pages each so I don't
think it makes sense to try to train tesseract to do it (training tesseract
seems to be a huge undertaking).
https://i.imgur.com/dvY973s.png
/Mattis