-----Original Message-----
From: cctalk [mailto:cctalk-bounces at
classiccmp.org] On Behalf Of Johnny
Billquist
Sent: 27 September 2015 13:18
To: cctalk at
classiccmp.org
Subject: Re: If you OCR, always archive the bitmaps too - Re: Regarding
Manuals
On 2015-09-27 03:41, Toby Thain wrote:
On 2015-09-26 5:51 PM, Johnny Billquist wrote:
On 2015-09-26 23:42, Toby Thain wrote:
> On 2015-09-26 4:28 PM, Johnny Billquist wrote:
>> On 2015-09-26 12:16, Johnny Billquist wrote:
>>> On 2015-09-25 22:35, Al Kossow wrote:
>>>> I have been going back and applying OCR to the ones on bitsavers.
>>>> Are there some in particular that you have a problem with?
>>>
>>> Aha. I wasn't aware of that. I've downloaded copies many years ago
>>> that I've been keeping locally. I'll check out the current
>>> versions on bitsavers then.
>>
>> Al, exactly how have they been OCRed? Looking at them, it would
>> appear that what you see is still the bitmaps of all the pages, but
>> then you have the basic text also available for selection/searching.
>>
>> My issue with that is that the documents are huge, and the
>> experience just scrolling through them is pretty bad.
>
> Imho, though I am sure I am not alone:
>
> Software which "recreates" the typography of a document from OCR
> does not produce an acceptable substitute, I've yet to see a book
> that wasn't ruined by it.
>
> Just worth mentioning for anyone who might be tempted - For this
> reason and others, the bitmaps must NEVER be discarded (Although of
> course bitmaps can be archived in a different file if people want to
> supply OCR as well.)
Look at the results in the link I posted. I was more than happy with
that result.
I've seen plenty of technical books ruined by this technique, which is
why I beg anyone doing this to not divorce the bitmaps from the OCR'd
result.
I suppose some books might be relatively immune, but technical texts
seem to be quite sensitive to poor interpretation by OCR, logically enough.
I suppose it is the eternal argument between preservation and use. I use
these documents every day. I don't care about the pixels, but the content.
Museums and the like are obviously more interested in the preservation.
I get the feeling you didn't actually check the text that I OCRed from a book.
That text is an example what I'm looking for.
I did. I found it hard to read as it has OCr'd with mixed typefaces.