It was thus said that the Great Hans Franke once stated:
Rather then restricting the encodeing of the XML file to a
specific charset, we need to restrict the USAGE within the
standard to certain characters, regardless of the encodeing.
Unless otherwise noted, XML files are assumed to be encoded in UTF-8,
*but* an XML parser is required to abort at the first error in the XML file.
If a parser is reading an XML file without an explicit character set
encoding scheme (which means it's assuming UTF-8) and it reads a character
that is illegal (say the file was encoded in ISO-8859-3) it gives up
(usually with an "illegal character at such-n-such position" error).
Right now, this is a real problem with XML deployment (it gets even
wierder when XML files are transported via HTTP but I'm getting ahead of
myself) so when I suggested that (if we are using XML) that each *must*
start with:
<?xml version="1.0" encoding="US-ASCII"?>
It was a way of self-defense. Perhaps it can be relaxed some and require:
<?xml version="1.0" encoding="some XML defined character encoding
scheme"?>
and if the encoding scheme isn't defined, it's an error and further
processing of the archive should stop.
I suggest to restrict the caracters used in tags,
attribute
names and attributes to 'A-Z' (uppercase), '0-9' and '-'.
Unfortunately, XML is defined with lowercase (or it may be case
sensitive---I do know that all XML I've seen is with lowercase tags, and
it's pretty much a standard).
-spc (hmmm ... )