On 2018-12-29 12:47 AM, Toby Thain via cctalk wrote:
On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
Finally I got hold of the sources for the PDP-11
SPACE WAR that was
submitted to DECUS by Bill Seiler.
The format is scans of the PAL-11S listing output. It is easy to crop the
image to only contain actual source. Then running OCR on it. Tried a few
online versions and tesseract.
The problem is that the paper that the listing is printed on has lines.
Very black lines. It makes the OCR go completely crazy. Source lines
without black lines OCR ok. The others do not. The files need massive
amount of manual intervention.
Does anyone have an idea how to process files like this?
A good way to remove the black lines?
Hi Mattis
Here's a first cut. Can probably be improved slightly. Let me know how
much this still confuses Tesseract.
https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif
That is a multipage TIF, and the page order key is listed below.
I just noticed that a handful of pages seem to be missing, so I'll look
into that.
CHAR--0000
CHAR--0001
CHAR--0002
CHRTAB--0000
CHRTAB--0001
CHRTAB--0002
COMPAR--0000
COMPAR--0001
COMPAR--0002
COMPAR--0003
EXPLOD--0000
EXPLOD--0001
EXPLOD--0002
GRAVTY--0000
GRAVTY--0001
GRAVTY--0002
GRAVTY--0003
MULPLY--0000
MULPLY--0001
MULPLY--0002
PARM--0000
PARM--0001
PARM--0002
PARM--0003
PARM--0005
PARM--0006
PARM--0007
PARM--0008
PARM--0009
PWRUP--0000
PWRUP--0001
RESET--0000
RESET--0001
RKT1--0000
RKT1--0001
RKT2--0000
RKT2--0001
SCORE--0000
SCORE--0001
SINCOS--0000
SINCOS--0001
SINCOS--0002
SLINE--0000
SLINE--0001
SPCWAR--0000
SPCWAR--0001
SPCWAR--0002
SUN--0000
SUN--0001
SUN--0002
UPDAT1--0000
UPDAT1--0001
UPDAT1--0002
UPDAT2--0000
UPDAT2--0002
point--0000
point--0001
--Toby
There are only 19 source files with three or four pages each so I don't
think it makes sense to try to train tesseract to do it (training tesseract
seems to be a huge undertaking).
https://i.imgur.com/dvY973s.png
/Mattis