OCR old software listing.

26 Dec 2018

...
  On December 26, 2018 at 4:29 PM Mattis Lind via cctech
<cctech at classiccmp.org> wrote:
 Finally I got hold of the sources for the PDP-11 SPACE WAR that was
 submitted to DECUS by Bill Seiler.
 The format is scans of the PAL-11S listing output. It is easy to crop the
 image to only contain actual source. Then running OCR on it. Tried a few
 online versions and tesseract.
 The problem is that the paper that the listing is printed on has lines.
 Very black lines. It makes the OCR go completely crazy. Source lines
 without black lines OCR ok. The others do not. The files need massive
 amount of manual intervention.
 Does anyone have an idea how to process files like this?
 A good way to remove the black lines?
 There are only 19 source files with three or four pages each so I don't
 think it makes sense to try to train tesseract to do it (training tesseract
 seems to be a huge undertaking).
 https://i.imgur.com/dvY973s.png
 /Mattis One thing you might try is to pull the scan images into matlab/gnu octave
and do a 2d FFT, remove the frequency band of the lines, inverse fft, and save.  I've
had good luck removing regular patterns of noise from images that way.
Will
"He may look dumb but that's just a disguise."  -- Charlie Daniels
"The names of global variables should start with    // "  -- https://isocpp.org

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

OCR old software listing.