If you OCR, always archive the bitmaps too - Re: Regarding Manuals

Toby Thain toby at telegraphics.com.au
Sat Sep 26 16:42:26 CDT 2015


On 2015-09-26 4:28 PM, Johnny Billquist wrote:
> On 2015-09-26 12:16, Johnny Billquist wrote:
>> On 2015-09-25 22:35, Al Kossow wrote:
>>> I have been going back and applying OCR to the ones on bitsavers.
>>> Are there some in particular that you have a problem with?
>>
>> Aha. I wasn't aware of that. I've downloaded copies many years ago that
>> I've been keeping locally. I'll check out the current versions on
>> bitsavers then.
>
> Al, exactly how have they been OCRed? Looking at them, it would appear
> that what you see is still the bitmaps of all the pages, but then you
> have the basic text also available for selection/searching.
>
> My issue with that is that the documents are huge, and the experience
> just scrolling through them is pretty bad.

Imho, though I am sure I am not alone:

Software which "recreates" the typography of a document from OCR does 
not produce an acceptable substitute, I've yet to see a book that wasn't 
ruined by it.

Just worth mentioning for anyone who might be tempted - For this reason 
and others, the bitmaps must NEVER be discarded (Although of course 
bitmaps can be archived in a different file if people want to supply OCR 
as well.)

--Toby

>
> Sadly I don't even remember what software I used for OCR about 10 years
> ago, but I had something for Windows back then, which actually figured
> out fonts and all, and created a plain Word document from the OCR
> process. That was a really nice piece of software, which preserved
> formatting, fonts and all. I have a short example of the results at
> http://www.update.uu.se/~bqt/Clarkson.pdf, which was just a scan of two
> pages from a book. I created the pdf from Word.
> A process like that is what I'd like, except for figures, which needs to
> be kept as bitmaps, I suspect.
>
>      Johnny
>



More information about the cctalk mailing list