OCR old software listing.

Toby Thain toby at telegraphics.com.au
Sat Dec 29 13:16:49 CST 2018


On 2018-12-29 1:32 AM, Toby Thain via cctalk wrote:
> On 2018-12-29 12:47 AM, Toby Thain via cctalk wrote:
>> On 2018-12-26 4:29 PM, Mattis Lind via cctalk wrote:
>>> Finally I got hold of the sources for the PDP-11 SPACE WAR that was
>>> submitted to DECUS by Bill Seiler.
>>>
>>> The format is scans of the PAL-11S listing output. It is easy to crop the
>>> image to only contain actual source. Then running OCR on it. Tried a few
>>> online versions and tesseract.
>>>
>>> The problem is that the paper that the listing is printed on has lines.
>>> Very black lines. It makes the OCR go completely crazy. Source lines
>>> without black lines OCR ok. The others do not. The files need massive
>>> amount of manual intervention.
>>>
>>> Does anyone have an idea how to process files like this?
>>>
>>> A good way to remove the black lines?
>>
>> Hi Mattis
>>
>> Here's a first cut. Can probably be improved slightly. Let me know how
>> much this still confuses Tesseract.
>>
>> https://docs.telegraphics.com.au/mattis/spcwar_pdp11_edit.tif
>>
> 
> That is a multipage TIF, and the page order key is listed below.
> 
> I just noticed that a handful of pages seem to be missing, so I'll look
> into that.
> 

Fixed that. I was also able to improve the quality. Same link.

The full page manifest is:

CHAR--0000
CHAR--0001
CHAR--0002
CHRTAB--0000
CHRTAB--0001
CHRTAB--0002
CHRTAB--0003
COMPAR--0000
COMPAR--0001
COMPAR--0002
COMPAR--0003
EXPLOD--0000
EXPLOD--0001
EXPLOD--0002
GRAVTY--0000
GRAVTY--0001
GRAVTY--0002
GRAVTY--0003
MULPLY--0000
MULPLY--0001
MULPLY--0002
PARM--0000
PARM--0001
PARM--0002
PARM--0003
PARM--0004
PARM--0005
PARM--0006
PARM--0007
PARM--0008
PARM--0009
PWRUP--0000
PWRUP--0001
RESET--0000
RESET--0001
RKT1--0000
RKT1--0001
RKT2--0000
RKT2--0001
SCORE--0000
SCORE--0001
SINCOS--0000
SINCOS--0001
SINCOS--0002
SINCOS--0003
SLINE--0000
SLINE--0001
SPCWAR--0000
SPCWAR--0001
SPCWAR--0002
SUN--0000
SUN--0001
SUN--0002
UPDAT1--0000
UPDAT1--0001
UPDAT1--0002
UPDAT2--0000
UPDAT2--0001
UPDAT2--0002
point--0000
point--0001


> 
>> --Toby
>>
>>>
>>> There are only 19 source files with three or four pages each so I don't
>>> think it makes sense to try to train tesseract to do it (training tesseract
>>> seems to be a huge undertaking).
>>>
>>> https://i.imgur.com/dvY973s.png
>>>
>>> /Mattis
>>>
>>
>>
> 
> 



More information about the cctalk mailing list