Cameron Kaiser wrote:
Doesn't mean they're OCRed. Run a 'strings' on the pdf -- the text is
there.
I suspect that the <stream> encoding in the PDF's probably needs to be
deciphered to
see the text. I also suspect there may be a variant that has plain
text. Note that both tiff
and postscript can be encapsulated in pdf's with very little
modification and in the case
of some postscript files, the plain text will be visible and ascii
readable, and in others
it will not (such as EPS).
The original documents in the thread seem to have been touched and perhaps
OCR'ed by Acrobat 6. I used Acrobat 6 to capture documents, and found a
similar problem when I used it, and had to turn off the "document" capture
and change to picture capture options. When I did that, the size increased
dramatically, so I ended up abandoning that format.
After a lot of experimentation, I frankly have not found much to change
about
how Bitsavers is encoded and stored. At a future time someone may OCR the
archive, but right now I don't think that the resources exist to scan
and store
the documents in a fashion that current OCR software will work. But the
black
and white images work for humans, so sometime soneone may get an OCR
package, modify it, and make a run at the database.
Jim