Second try
------- Weitergeleitete Nachricht / Forwarded message -------
Von: Hans Franke <Hans.Franke(a)mch20.sbs.de>
An: "General Discussion: On-Topic and Off-Topic Posts"
<cctalk(a)classiccmp.org>
Betreff: Re: Let's develop an open-source media archive standard
Datum: Wed, 11 Aug 2004 15:16:36 +0200
Am 11 Aug 2004 12:08 meinte Jules Richardson:
On Wed, 2004-08-11 at 10:50, Steve Thatcher wrote:
> Hi all, after reading all this morning's posts, I thought I would throw out some
thoughts.
> XML as a readable format is a great idea.
I haven't done any serious playing with XML in the
last couple of years,
but back when I did, my experience was that XML is not a good format for
mixing human-readable and binary data within the XML structure itself.
Only if you intend to keep it 100% human readable.
To make matters worse, the XML spec (at least at the
time) did not
define whether it was possible to pass several XML documents down the
same data stream (or, as we'd likely need for this, XML documents mixed
with raw binary). Typically, parsers of the day expected to take control
of the data stream and expected it to contain one XML document only -
often closing the stream themselves afterwards.
Now, that's a feature of the reading application. XML does not
stat what happens next since this is outside the scope. It is
perfectly op to look for the next Document, or the next start
tag of the same document type, or for whatever.
I did end up writing my own parser in a couple of KB
of code which was a
little more flexible in data stream handling (so XML's certainly not a
heavyweight format, and could likely be handled on pretty much any
machine), but it would be nice to make use of off-the-shelf parsers for
platforms that have them where possible.
Right, but especialy when we're coming down to classic platforms,
such building blocks are not always usable, and in general way
oversized. On a 48k Apple (or a 64 K 4 MHz CP/M machine) we don't
have the space to just port a C-app that has 'only' 100k of code
size. So reader/writer applications for the original environment
have to be small and special to type.
As you've also said, my initial thought for a data
format was to keep
human-readable config seperate from binary data. The human-readable
config would contain a table of lengths/offsets for the binary data
giving the actual definition. This does have the advantage that if the
binary data happens to be a linear sequence of blocks (sectors in the
case of a disk image) then the raw image can easily be extracted if
needs be (say, to allow conversion to a different format)
Well, that is only true if you define binary data as 8 Bit and
all means of transport as 100% transparent. Just, this hasn't
worked that way in the past, and I doubt that we will be safe
from changes in the future.
As for the character size: we had in the past everything from
6 to 12 Bit (ok, I can't remember 11 Bit characters/words) as
'binary' characters. Of course 6,7 and 8 Bit Bytes can be easy
stored in a 8 Bit Byte, but what about 9 Bit (Bull) or 12 (DEC)?
At that point you already have to incooperate speciual trans-
formation rules which are not necersary transparent.
Also for the requirement of a transparent transport: When
transfering files between different architectures we usualy
have code or even format conversions. Most notable code
conversion would be, for example, ISO 8859-1 <-> EBCDIC which
totally destroys the 'binary' part. Or take format conversions
as done on the way between Unix style files and (Win-)DOS, LF
vs CR/LF. Whenever you leave the A-Z and 0-9 range we are
likely to encounter such problems.
Shure, one could code an app capable to read ASCII/Binary on
a EBCDIC Machine and vice versa, but to my experience (doing
programming since 25 years in mixed environments) it's not
only a boring job, but also one of the most sensitive to
errors.
Any kind of standard format must be true machine independent.
Thus (at least when using the recommended representation) be
able to be transferred across all platforms thinkable of.
> I looked at the CAPS format and in part that
would be okay. I would like
> to throw in an idea of whatever we create as a standard actually have
> three sections to it.
So, first section is all the 'fuzzy' data
(author, date, version info,
description etc.), second section describes the layout of the binary
data (offsets, surfaces, etc.), and the third section is the raw binary
data itself? If so, I'm certainly happy with that :-)
I would rather go for an anoted format, where more detailed
information can be added at any point, and not necersary
in certain sections. Especialy since the 'fuzzy' data is
usualy not needed for the job itself.
One aside - what's the natural way of defining
data on a GCR floppy? Do
heads/sectors/tracks still make sense as an addressing mode, but it's
just that the number of sectors per track varies according to the track
number? Or isn't it that simple?
Well, that's already outside of what a standard definition
can define without doubt.
To my understanding interpretation of Data is always part
of a real application. As soon as it touches machine or
format specific implementation details a standard may only
give guidelines how to store them properly, but not how to
interprete. That's part of an actual reader implementation.
And each rader will of course only understand parts he's
made for - e.g. a Apple DOS 3.3 reader will have no idea
what a tape label for a IBM tape is not to mention be able
to differentiate between the various header types.
Reader/Writer apps will always be as specific as they are
right now, when handling a proprietary format. The big
advantage is that intermediate tools, like archiving,
indexing, etc.pp can be shared. Well, in fact it's the
only advantage, except the fact that one doesn't have to
figure out a new format each time, and the simple format
does allow the ad hoc inclusion of new machines/systems.
Gruss
H.
--- Ende der weitergeleiteten Nachricht / End of forwarded message ---
--
VCF Europa 6.0 am 30.April und 01.Mai 2005 in Muenchen
http://www.vcfe.org/