Subject: Re: Manuals being scanned - test-drb@ccmp.vtda.org

14 May 2007

...
   Google is particularly bad about fetching documents
over and over
again. 
 mmm...  any evidence they are using OCR to index pdf's?
 of all the places I'd *like* them to OCR, it's bitsavers.
 in fact, mmmmm, I'd like to connect the two dots.  bitsavers
 + google (and, and/all mit, standford, cmu, ... software archives)
 something to start mentioning at various fund raising
 cocktail parties :-) 
 To be clear - the problem is that Google consumes bandwidth by
 repeatedly downloading static documents, verses downloading dynamic
 content whose index status might be new or dirty? 
Guys, if you want bitkeeper OCR'd, have you tried just *asking*
Google if they'd do it?  They are scanning and OCR'ing huge
amounts of paper all the time, working with various libraries.
As far as I know, they're doing this for free, because they want all
the world's data...
It wouldn't hurt to ask...  anyone have any contacts at Google?
Also, the Internet Archive will accept a DVD by mail of your archives
if spidering is too expensive.  I gave them a DVD of the Edinburgh
Computer History Project last year.  (Not sure if the contents are
visible online anywhere, but at least they're in the archives and
reasonably safe against me trashing my hard drive again...)
G