On Sun, 2004-06-27 at 17:59, Tom Owad wrote:
The archival images will be 600dpi greyscale TIFFs.
They will not be
converted to pdf, but just stored as TIFFs.
I tend to do that too. I'd much rather use my favourite image viewer to
flick through images than deal with a pdf file, plus I can just put
everything in a tar or zip archive if I do need to distribute as a
single file.
I've not reached a conclusion for what the best format is for OCRed data
which may contain text and images. RTF is maybe the most portable, but
no doubt it doesn't handle embedded imagery. MS Word is just a nasty
format and none too portable either.
I never thought I'd say it, but maybe wrapping the data up in *simple*
HTML markup is the best way - at least then it is readable in a
plain-text editor, and finding a machine with a web browser is probably
easier than finding a machine with Word installed.
The images intended for download and distribution will
be 200dpi
greyscale JPEGs. Using these, I expect a 128-page download to be about
20 MB.
Actually, I suppose seperate images can help here too as people can
navigate straight away to what they want, plus they don't need to
download the whole of a huge pdf file before they can start reading.
I know a lot of you expressed concerns about JPEGs,
but I haven't been
able to get anywhere near the compression using other methods, for
greyscale images. Am I overlooking any options?
Probably not. JPEG is lossy after all, so I expect it'll always do
better than a non-lossy format. It's a tradeoff between size and quality
- personally I'd always go for the quality for historical data.
Actually, the BBC Documentation Project for the old Acorn 8 bit machines
at
http://www.bbcdocs.com is worth a look. They seem to do quite well in
terms of getting both scanning and OCR volunteers. Only problem is that
you need to be *really* sure that your OCR versions are good before you
can risk taking the raw scans offline, which means having a lot of
people doing a lot of proofreading. But OCR data will typically be a
fraction of the size of raw image scans, regardless of what resolution /
format you hold images in. Just keep raw image scans in an offline
archive then just in case.
cheers,
Jules