On Sun, 2009-09-27 22:28:03 +0200, Pontus <pontus at update.uu.se> wrote:
Al Kossow wrote:
Man... and I was
thinking just the other week that I should do some
mirroring of it :/
I've been long thinking about scanned documents and the manx database
specifically.
Some time ago, I got my hands on a scanned pages (multi-page TIFFs,
each page containing two book pages) and worked some time on them. It
resulted in some script doing cutting/straigthening, reassembling a
PDF and OCRing it, placing the text invisibly over the images (so you
can cut'n'paste it, depending on the OCR quality of course :) )
I've started working further on it, eg. building some web frontend to
add tags (eg. isbn=1234567, author=foo, ...). That's not yet really
cool (it's experimentation after all), but I think that could easily
be extended to do:
* Full-text search on the PDFs.
* Help building nice PDFs from scanned images
* Maintain a useable metadata index (though I'm not too keen on the
tag concept therefor)
I'll /not/ publish it yet, nor give public access to my scanned book
mentioned above, but I think I'll polish it a bit and offer it.
Indexing the
bitsavers.org material would be a nice first step. Do you
think that would/could be useful for searching all the stuff? (And
maybe for re-creating the PDFs with added OCRed text. This, at some
time, might be particularly useful if ever scanned microfiche pages
show up.)
MfG, JBG
--
Jan-Benedict Glaw jbglaw at lug-owl.de +49-172-7608481
Signature of: They that give up essential liberty to obtain temporary safety,
the second : deserve neither liberty nor safety. (Ben Franklin)