On Sun, 2004-06-27 at 19:40, Antonio Carlini wrote:
(Well, xpdf on OpenVMS VAX is slow, but then I
guess my expectations are at fault there :-))
It seems to be bloody horrible on modern versions of Linux, too :-(
(well, at least on Redhat 9) Slow as hell, plus the rendering quality is
pretty awful.
True that most systems have PDF viewers, but they're more likely to have
an image displayer and a text editor ;-)
I believe (but have not tried) that you can go
from PDF to text in this case without any great
difficulty (I don't recall what happens to images).
What's the current licencing for PDF tools? I've pretty much avoided it
since the days when the reader was free, but anything (at least from
Adobe) which created or manipulated PDF files cost $$$
I believe the data format itself was copyrighted - but presumably isn't
these days what with all the 3rd-party viewers out there?
Actually, I
suppose seperate images can help here too as people can
navigate straight away to what they want, plus they don't need to
download the whole of a huge pdf file before they can start reading.
I prefer to grab the whole thing anyway. Today I might
just want the frobozz pinout, but tomorrow I'm almost
certain to need the lead engineer's middle initial,
by which time I'll have forgotten where I found the
docs in the first place.
Oh sure, me too. I make use of wget an awful lot to create local copies
of useful bits of websites for instance, but if I'm looking for
something then and there then it's nice to be able to at least look at
the navigation up-front (particularly to see if the whole thing's
actually relevant anyway!) and quickly start reading the most-useful
bits whilst downloading the whole lot as a background job.
problem is
that
you need to be *really* sure that your OCR versions are good
before you
can risk taking the raw scans offline, which means having a lot of
Once I've generated a raw scan (or picked up someone elses)
I expect to keep it around essentially forever. OCR has improved
immensely in the last few years, but not to the point where
I can throw a scan of a poor quality photocopy at it and expect
something that looks like the original with zero errors.
(The Module/Options list that Eric Smith scanned would be
an excellent torture test for any candidate "perfect" OCR program).
Another point is that if you have high quality scans, why
keep them to yourself? By all means have low-res versions
available for those who just need a page or two or just
need to look something up quickly and don't care about
the artefacts, but make the "masters" available too. If you
don't have the space yourself, there are people on this list
who seem to have no problem with online disk space.
Fair point. I'd never completely delete high-quality scans - but as you
say, there are quite a few people around who seem to be set up for
hosting huge amounts of data!
Hmm, how editable are PDF files by the way? On the OCR front, I'd expect
anyone OCRing anything to proofread it afterwards and correct mistakes
(which is of course vital for technical data anyway - technical data
with mistakes in is useless!). So unless wordprocessor-like tools exist
to edit PDF files then I wouldn't think they're much good as an
intermediate format, because people need to be able to go in there and
easily correct mistakes made by the OCR software.
cheers
Jules