On Tue, 9 Mar 1999, Pete Turnbull wrote:
That's not what I'd call "high".
That means that on average, you have to
correct or interpret every tenth character. I'd call less than 99% "low",
not high. Our Department looked at this a few years ago, and rejected
anything less than 95%, I think. Even that means correcting (or as one
person put it, "clicking on") one character in every twenty.
That's not what I meant. I did not study the results closely and so I
wrote "high 90%" as a disclaimer to mean something like 98, 98.5, 99,
99.5, or 99.9. Perhaps I should have used the word "range". It seemed to
me that I was getting less than 1 to no more than two words per hundred
that needed correcting and I don't remember any punctuation or numerical
errors.
William Donzelli wrote:
The best solution for this is to keep the scans AND
the OCR'd text. That
way, with a simple database, one could do searches on the text, and get
most of the hits, yet actually read the images.
A good observation, which brings up the question whether anyone has
database templates and what database are they using. How does one deal
with separate text like sidebars and captions? Should you save an image
of the page and individual images in the database along with text? This
rules out mych legacy db software. Perhaps keep individual files and
database the directory?
Anyone using document management software? There seems to be 3 or 4 low
priced ones for windows, a couple for the Mac, maybe something for
another platform (anything for Linux?) and everything else is
stratospherically priced.
Chuck McManis wrote:
300 DPI B&W is good for most printed manuals
_without_ graphics because it
is a 1:1 ratio with what most printers can print. 200 DPI gives you a 2:3
ratio of real pixels to printer pixels and I've seen that introduce banding
on the printed output.
300 dpi is ideal for print. Is that the best ultimate goal - scan at
some multiple of 300 (over or under) in order to optimize for eventual
printout?
-- Stephen Dauphin