CCCP - test-drb@ccmp.vtda.org - classiccmp.org

List overview All Threads
Download

CCCP

specifications for IBM XT-286

Price Check on Aisle 5

nico＠farumdata.dk

11 Sep 2003 11 Sep '03

10:26 p.m.

I just found a paper "The Soviet Bloc's Unified System of Computers", written by N.C.Davis and S.E.Goodman, published in "computing Surveys, Vol. 10, no.2, Juni 1978" Depending on how many people are interested in this piece of history, I can make up a Word document, including a scan of the tables and figures prevented. Nico

Reply

Show replies by date

nico＠farumdata.dk

11 Sep 11 Sep

10:40 p.m.

Depending on how many people are interested in this piece of history, I can make up a Word document, including a scan of the tables and figures prevented. Nico

... figures presented :-(

Reply

julesrichardsonuk＠yahoo.co.uk

12 Sep 12 Sep

1:10 a.m.

New subject: OCR'ing old manuals

Hi, Any recommendations for OCR software (preferably Windows 2000 - spit!) that'll allow the user to read from a image file rather than the software being incorporated with scanner software? Pete kindly loaned me a whole pile of documentation, and I can get it back to him far quicker if I can just scan all the pages as TIFF images for now and worry about passing it through OCR software at a later date! It would also be nice if said software was a) free :) and b) able to automatically recurse through a directory tree (or at least run from the command line without user input so that I can script that) cheers Jules ===== Backward conditioning: putting saliva in a dog's mouth in an attempt to make a bell ring. ________________________________________________________________________ Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://mail.messenger.yahoo.co.uk

Reply

jbmcb＠hotmail.com

4:11 a.m.

New subject: OCR'ing old manuals

Omnipage Pro works nicely, will read/scan just about anything ( I'm using it to slowly OCR stuff the PDFs from the SPIES archive ) Unfortunatly it costs a bundle. You gets what you pay for. ----- Original Message ----- From: "Jules Richardson" <julesrichardsonuk(a)yahoo.co.uk> To: <cctalk(a)classiccmp.org> Sent: Friday, September 12, 2003 6:01 AM Subject: OCR'ing old manuals

Hi, Any recommendations for OCR software (preferably Windows 2000 - spit!)

that'll

allow the user to read from a image file rather than the software being incorporated with scanner software? Pete kindly loaned me a whole pile of documentation, and I can get it back

to

him far quicker if I can just scan all the pages as TIFF images for now

and

worry about passing it through OCR software at a later date! It would also be nice if said software was a) free :) and b) able to automatically recurse through a directory tree (or at least run from the command line without user input so that I can script that) cheers Jules ===== Backward conditioning: putting saliva in a dog's mouth in an attempt to

make a bell ring.

________________________________________________________________________ Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://mail.messenger.yahoo.co.uk

Reply

paul＠frixxon.co.uk

8:05 a.m.

New subject: OCR'ing old manuals

Jason McBrien wrote:

Omnipage Pro works nicely, will read/scan just about anything

Seconded. It'll quite happily run from a watched directory and will work its way through almost any type of file, including PDFs, and dump the results to another directory in whatever form you want. I'm very pleased with the quality of the output text. It will cough and die if you accidentally pass it some large schematics and you may then have to go and manually delete its temporary files before it'll start up again, but other than that it'll quite happily run unattended on a spare PC.

( I'm using it to slowly OCR stuff the PDFs from the SPIES archive )

Snap! (Though only the DEC stuff for now.) If only Omnipage Pro was available for Linux, I'd be shot of Windows for good. - Paul

Reply

kth＠srv.net

7:34 a.m.

New subject: OCR'ing old manuals

Jules Richardson wrote:

Hi, Any recommendations for OCR software (preferably Windows 2000 - spit!) that'll allow the user to read from a image file rather than the software being incorporated with scanner software? Pete kindly loaned me a whole pile of documentation, and I can get it back to him far quicker if I can just scan all the pages as TIFF images for now and worry about passing it through OCR software at a later date! It would also be nice if said software was a) free :) and b) able to automatically recurse through a directory tree (or at least run from the command line without user input so that I can script that)

Distributed Proofers seems to prefer Abbey Finereader, but it also costs money. It apparently does very well with images that others choke on.

Reply

geneb＠deltasoft.com

8:05 a.m.

New subject: OCR'ing old manuals

Pete kindly loaned me a whole pile of documentation, and I can get it back to him far quicker if I can just scan all the pages as TIFF images for now and worry about passing it through OCR software at a later date! It would also be nice if said software was a) free :) and b) able to automatically recurse through a directory tree (or at least run from the command line without user input so that I can script that)

Distributed Proofers seems to prefer Abbey Finereader, but it also costs money. It apparently does very well with images that others choke on.

It's actually called Abbyy Finereader. It's amazingly good at what it does, but it does cost. g.

Reply

geoffr＠zipcon.net

3:41 p.m.

New subject: OCR'ing old manuals

Omnipage Pro At 03:01 AM 9/12/03, you wrote:

Hi, Any recommendations for OCR software (preferably Windows 2000 - spit!) that'll allow the user to read from a image file rather than the software being incorporated with scanner software? Pete kindly loaned me a whole pile of documentation, and I can get it back to him far quicker if I can just scan all the pages as TIFF images for now and worry about passing it through OCR software at a later date! It would also be nice if said software was a) free :) and b) able to automatically recurse through a directory tree (or at least run from the command line without user input so that I can script that) cheers Jules ===== Backward conditioning: putting saliva in a dog's mouth in an attempt to make a bell ring. ________________________________________________________________________ Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://mail.messenger.yahoo.co.uk

Reply

arcarlini＠iee.org

13 Sep 13 Sep

5:22 a.m.

New subject: OCR'ing old manuals

I used OmniPage Pro (11 I think, but there are newer versions now). I also tried FineReader V4 or V5 (whichever turned up on the front of PC Plus one month) and it was OK too (at that time V6 was current). Neither of them could cope with the Options/Module list that Eric Smith posted some time back so I've only made very limited use of them. As long as you scan the stuff now while you have it, you can OCR at your leisure when the technology improves (and requires far less proof-reading). Antonio -- --------------- Antonio Carlini arcarlini(a)iee.org

Reply

eric＠brouhaha.com

7:58 a.m.

New subject: OCR'ing old manuals

"Antonio Carlini" <arcarlini(a)iee.org> wrote:

As long as you scan the stuff now while you have it, you can OCR at your leisure when the technology improves (and requires far less proof-reading).

Note that you should NEVER save scans of text and line art in a lossy form such as JPEG. JPEG works for continuous-tone images such as photographs by deliberately throwing away high-frequency components. Test and line art contain sharp black-to-white transitions (and vice versa, of course) which get smeared by this compression, resulting in a blurry image. For text and line art, a lossless bilevel compression such as G3 or G4 fax format (used in some TIFF files), JBIG, JBIG2, Flate (used in some PNG files). You can't assume that because you save in TIFF or PNG that you get a specific form of compression, since they are very broad standards that support multiple compression types. Sometimes people tell me that JPEG is alright if you only compress slightly. The edges still get blurry, and the resulting file size is generally *MUCH* larger than if you use G4 or JBIG. JBIG gets 10-20% better compression than G4, but it is patented, so I don't use it. Flate usually compresses somewhat better than G4, and is not patented, but I'm not sure how it compares to JBIG. I'm not using it because support is not widespread yet. G4 works well because it can be wrapped in PDF and used by any PDF viewer. Bilevel compression doesn't work well on continuous tone images, so JPEG should be used for those. The main dilemma for scanning is pages that contain a mix of text/ line art and continuous tone images. My personal reccomendation for these is to either scan the page twice, in B&W and color (or gray scale), or to scan it once to uncompressed color (or gray scale) then convert to both bilevel and JPEG in software. Apparently some "best practice" policies for document archiving specifically state that a document should only be scanned once. I think they're just trying to minimize handling of fragile documents, so I don't think they really mean that taking two scans of a page (consecutively without manipulating the physical document) is bad. These "best practice" policies also recommend a minimum of 600 DPI, which is reasonable for continuous tone images but is normally overkill for text and line art. I typically use 300 or 400 DPI. I've written a program to take B&W TIFF files and color or B&W JPEG files and produce a PDF file: http://tumble.brouhaha.com/ My future plans for tumble include compositing text and line art with continuous tone images on a single page. I've got a script for GIMP to take an uncompressed color or gray scale scan (in PGM or PPM format), allow manual selection of the continuous tone images, then save two separate files. I've been thinking about trying to automate this by having the filter use histograms and FFT to locate the images. I'm not sure when I'll have time to work on this further, though. Eric

Reply

als＠thangorodrim.de

9:08 a.m.

New subject: OCR'ing old manuals

On Sat, Sep 13, 2003 at 09:50:22AM -0700, Eric Smith wrote:

"Antonio Carlini" <arcarlini(a)iee.org> wrote:

As long as you scan the stuff now while you have it, you can OCR at your leisure when the technology improves (and requires far less proof-reading).

Note that you should NEVER save scans of text and line art in a lossy form such as JPEG. JPEG works for continuous-tone images such as photographs by deliberately throwing away high-frequency components. Test and line art contain sharp black-to-white transitions (and vice versa, of course) which get smeared by this compression, resulting in a blurry image.

I can only second that. I've cursed times and again at some fools who decided to scan some paper documents (fine so far) and use JPEG (lossy compressing intended for continous tone stuff like photo images) on black and white scans. The results are ugly, sometimes hard to read and a bitch to print properly. Oh, and this just made the work of OCRing this a _lot_ harder.

For text and line art, a lossless bilevel compression such as G3 or G4 fax format (used in some TIFF files), JBIG, JBIG2, Flate (used in some PNG files). You can't assume that because you save in TIFF or PNG that you get a specific form of compression, since they are very broad standards that support multiple compression types. Sometimes people tell me that JPEG is alright if you only compress slightly. The edges still get blurry, and the resulting file size is generally *MUCH* larger than if you use G4 or JBIG.

Of course the files are bigger. The lossy algorithm for JPEG was designed to work on continous toned images (where it works fine) and just runs into the wall with black and white stuff. Where the algorithm expects to find lots of low/middle frequency and some high frequency data, it suddenly is faced with high frequency data alone. No smooth color value curves that can be nicely compressed. Using JPEG for compressing black and white is like using a Ferrari for pulling a trailer full of grain - it gets the stuff moving, but you really, really should use a proper truck for this job.

I've written a program to take B&W TIFF files and color or B&W JPEG files and produce a PDF file: http://tumble.brouhaha.com/

Thanks for writing this program. I'm in the process of archiving the interesting articles from a stack of computer magazines and am currently experimenting with the best way to convert dead trees to PDF files. So far, scanning the paper as lineart at 600 dpi, saving as fax G4 compressed tiff and using tumble to combine those into PDF files yields the best (best quality, smallest files) results. Regards, Alex. -- "Opportunity is missed by most people because it is dressed in overalls and looks like work." -- Thomas A. Edison

Reply

julesrichardsonuk＠yahoo.co.uk

14 Sep 14 Sep

2:50 a.m.

New subject: OCR'ing old manuals

--- Eric Smith <eric(a)brouhaha.com> wrote: >

Note that you should NEVER save scans of text and line art in a lossy form such as JPEG.

Absolutely - I tend to save everything as TIFF format and in as high a resolution as seems practical (occasionally I'll use Paint Shop Pro's format if I need something with layer support). I'm not a big fan of JPEG images... I'm hoping to get away with scanning things at 300dpi (in this case it's all printed documentation with a few diagrams, rather than colour images). Not only will that save space but I can also use my older scanner (which won't do more than 300dpi I believe, but is at least SCSI and so should transfer data to the host a little quicker). I'm a little wary of saving things in mono as somebody else mentioned - I'm sure that could have a negative effect on the OCR process at a later date. Greyscale (8 bit) I expect is fine though.

I've written a program to take B&W TIFF files and color or B&W JPEG files and produce a PDF file: http://tumble.brouhaha.com/

I'll have a look at that - might come in handy. :-) Image processing/manipulation I do find pretty interesting.,,, cheers Jules ===== Backward conditioning: putting saliva in a dog's mouth in an attempt to make a bell ring. ________________________________________________________________________ Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://mail.messenger.yahoo.co.uk

Reply

arcarlini＠iee.org

5:25 a.m.

New subject: OCR'ing old manuals

I'm a little wary of saving things in mono as somebody else mentioned - I'm sure that could have a negative effect on the OCR process at a later date. Greyscale (8 bit) I expect is fine though.

I've not found OCR to perform any better on greyscale than 1-bit per pixel. IIRC you also cannot use G4 with greyscale, although my memory may be wrong on this. Antonio -- --------------- Antonio Carlini arcarlini(a)iee.org

Reply

paul＠frixxon.co.uk

5:30 a.m.

New subject: OCR'ing old manuals

Antonio Carlini wrote:

IIRC you also cannot use G4 with greyscale although my memory may be wrong on this.

Correct. - Paul

Reply

teoz＠neo.rr.com

6:24 a.m.

New subject: OCR'ing old manuals

----- Original Message ----- From: "Paul Williams" <paul(a)frixxon.co.uk> To: <cctalk(a)classiccmp.org> Sent: Sunday, September 14, 2003 10:21 AM Subject: Re: OCR'ing old manuals

Antonio Carlini wrote:

IIRC you also cannot use G4 with greyscale although my memory may be wrong on this.

Correct. - Paul

Does somebody have an ftp or website with manuals scanned to pdfs? Anybody ever see the extended manuals for Apple A/UX 2 or 3 in pdf format?

Reply

allain＠panix.com

6:34 a.m.

New subject: OCR'ing old manuals

I'm a little wary of saving things in mono as somebody else mentioned - I'm sure that could have a negative effect on the OCR process at a later date. Greyscale (8 bit) I expect is fine though.

I usually use greyscale with a good interpolator to increase my dpi 2 to 3x linearly (increasing the dotcount 4 to 9x overall) and Then cast it to 1 bit. Works great. Just going directly from 8 bit to 1 bit at the same dpi is a waste of good information. John A.

Reply

allain＠panix.com

13 Sep 13 Sep

9:52 a.m.

New subject: OCR'ing old manuals

Neither of them could cope with the Options/Module list that Eric Smith posted some time back...

Somebody re-post this URL and I give a try at it with what I've got here, a conjoined image editor/OCR suite. John A.

Reply

eric＠brouhaha.com

3:14 p.m.

New subject: OCR'ing old manuals

Neither of them could cope with the Options/Module list that Eric Smith posted some time back...

Somebody re-post this URL and I give a try at it with what I've got here, a conjoined image editor/OCR suite.

http://www.brouhaha.com/~eric/retrocomputing/dec/doc/oml/ Personally, I don't think OCR for this will be of even the slightest use, but you are of course welcome to try.

Reply

vance＠neurotica.com

12 Sep 12 Sep

noon

I'm interested, but Word won't be readable here. Peace... Sridhar On Fri, 12 Sep 2003, Nico de Jong wrote:

I just found a paper "The Soviet Bloc's Unified System of Computers", written by N.C.Davis and S.E.Goodman, published in "computing Surveys, Vol. 10, no.2, Juni 1978" Depending on how many people are interested in this piece of history, I can make up a Word document, including a scan of the tables and figures prevented. Nico

Reply

hansp＠citem.org

4:39 p.m.

vance(a)neurotica.com wrote:

I'm interested, but Word won't be readable here.

I'm interested also. I'd be happy to post a PDF version of the word doc. -- hbp

Reply

esharpe＠uswest.net

15 Sep 15 Sep

7:41 a.m.

good project go for it! ----- Original Message ----- From: "Hans B Pufal" <hansp(a)citem.org> To: <cctalk(a)classiccmp.org> Sent: Friday, September 12, 2003 6:32 PM Subject: Re: CCCP

vance(a)neurotica.com wrote:

I'm interested, but Word won't be readable here.

I'm interested also. I'd be happy to post a PDF version of the word doc. -- hbp

Reply

7962

days inactive

7965

days old

test-drb@ccmp.vtda.org

Manage subscription

20 comments

15 participants

tags (0)

participants (15)

allain＠panix.com
als＠thangorodrim.de
arcarlini＠iee.org
eric＠brouhaha.com
esharpe＠uswest.net
geneb＠deltasoft.com
geoffr＠zipcon.net
hansp＠citem.org
jbmcb＠hotmail.com
julesrichardsonuk＠yahoo.co.uk
kth＠srv.net
nico＠farumdata.dk
paul＠frixxon.co.uk
teoz＠neo.rr.com
vance＠neurotica.com