On Wed, 2009-09-30 11:53:22 -0600, Richard <legalize at xmission.com> wrote:
There is a blog post and a comment discussion about
the demise of manx
here: <http://hisdeedsaredust.com/2009/09/manx-is-dead/>
I've already seen that, though I'm quite sceptic about the Google
stuff. They surely could OCR scanned PDFs, but I'm not sure if they
will do that. (That'll bring more copyrighted material into the
database, which previously wasn't available as /text/.) OTOH, indexing
it after OCRing doesn't rework the PDFs to be more fancy, eg. add a
real TOC around the former document, create bibloographic metadata or
put the OCRed text nicely into the PDFs.
I like Bitsavers a lot (as well as other spread sources of scanned
material), but I see those PDFs only as a nice container to have a
useable format up to the time where we can do /better/ with the
scanned pages. Point is: We now can! The software is all out there,
with everything being opensources and free as well.
Until now, scanning documents was mainly about conserving the paper
and being able to share it without snail-mailing around the stuff,
which always also contains the danger of loosing it. But we can now
really polish the stuff. That's /not/ a substitute for Bitsavers et
al.--we need them. The PDFs over there are a perfect format for
archiving the scanned pages. But we'd place generated PDFs next to
them, containing real Table of Contents, biblopgraphy entries and
possibly even Indices, among the OCRed text.
I'd probably just hack my scripts (they're a bit specific for working
on one special type of scan I did) and do some test as a start for a
Manx2 :)
MfG, JBG
--
Jan-Benedict Glaw jbglaw at lug-owl.de +49-172-7608481
Signature of: Wenn ich wach bin, tr?ume ich.
the second :