OCR old software listing.

Toby Thain toby at telegraphics.com.au
Sat Dec 29 00:32:01 CST 2018


On 2018-12-29 12:47 AM, Toby Thain via cctalk wrote:
> On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
>> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
>> submitted to DECUS by Bill Seiler.
>>
>> The format is scans of the PAL-11S listing output. It is easy to crop the
>> image to only contain actual source. Then running OCR on it. Tried a few
>> online versions and tesseract.
>>
>> The problem is that the paper that the listing is printed on has lines.
>> Very black lines. It makes the OCR go completely crazy. Source lines
>> without black lines OCR ok. The others do not. The files need massive
>> amount of manual intervention.
>>
>> Does anyone have an idea how to process files like this?
>>
>> A good way to remove the black lines?
> 
> Hi Mattis
> 
> Here's a first cut. Can probably be improved slightly. Let me know how
> much this still confuses Tesseract.
> 
> https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif
> 

That is a multipage TIF, and the page order key is listed below.

I just noticed that a handful of pages seem to be missing, so I'll look
into that.

CHAR--0000
CHAR--0001
CHAR--0002
CHRTAB--0000
CHRTAB--0001
CHRTAB--0002
COMPAR--0000
COMPAR--0001
COMPAR--0002
COMPAR--0003
EXPLOD--0000
EXPLOD--0001
EXPLOD--0002
GRAVTY--0000
GRAVTY--0001
GRAVTY--0002
GRAVTY--0003
MULPLY--0000
MULPLY--0001
MULPLY--0002
PARM--0000
PARM--0001
PARM--0002
PARM--0003
PARM--0005
PARM--0006
PARM--0007
PARM--0008
PARM--0009
PWRUP--0000
PWRUP--0001
RESET--0000
RESET--0001
RKT1--0000
RKT1--0001
RKT2--0000
RKT2--0001
SCORE--0000
SCORE--0001
SINCOS--0000
SINCOS--0001
SINCOS--0002
SLINE--0000
SLINE--0001
SPCWAR--0000
SPCWAR--0001
SPCWAR--0002
SUN--0000
SUN--0001
SUN--0002
UPDAT1--0000
UPDAT1--0001
UPDAT1--0002
UPDAT2--0000
UPDAT2--0002
point--0000
point--0001


> --Toby
> 
>>
>> There are only 19 source files with three or four pages each so I don't
>> think it makes sense to try to train tesseract to do it (training tesseract
>> seems to be a huge undertaking).
>>
>> https://i.imgur.com/dvY973s.png
>>
>> /Mattis
>>
> 
> 



More information about the cctech mailing list