On 11/26/18 7:21 AM, Guy Dunphy wrote:
I was speaking poetically. Perhaps "the mail
software he uses was written
by morons" is clearer.
;-)
Oh yes, tell me about the html 'there is no such
thing
as hard formatting and you can't have any even when
you want it' concept. Thank you Tim Berners Lee.
I've not delved too deeply into the lack of hard formatting in HTML.
I've also always considered HTML to be what you want displayed, with
minimal information about how you want it displayed. IMHO CSS helps
significantly with the latter part.
Intriguing. $readingList++.
Except that 'non-breaking space' is mostly
about inhibiting line wrap at
that word gap.
I wouldn't have thought "mostly" or "inhibiting line wrap". I
view the
non-breaking space as a way to glue two parts of text together and treat
them as one unit, particularly for display and partially for selection.
Granted, much of the breaking is done when the text can not continue (in
it's natural direction), frequently needing to start anew on the next line.
But anyway, there's little point trying to
psychoanalyze the writers of
that software. Probably involved pointy-headed bosses.
I like to understand why things have been done the way they were.
Hopefully I can learn from the reasons.
Of course not. It was for American English only. This
is one of the
major points of failure in the history of information processing.
Looking backwards, (I think) I can understand why you say that. But
based on my (possibly limited) understanding of the time, I think that
ASCII was one of the primordial building blocks that was necessary. It
was a standard (one of many emerging standards of the time) that allowed
computers from different manufacturers interoperate and represent
characters with the same binary pattern. Something that we now (mostly)
take for granted and something that could not be assured at the time or
before.
Containing extended Unicode character sets via UTF-8,
doesn't make it a
non-hard-formatted medium. In ASCII a space is a space, and multi-spaces
DON'T collapse. White space collapse is a feature of html, and whether
an email is html or not is determined by the sending utility.
Having read the rest of your email and now replying, I feel that we may
be talking about two different things. One being ASCII's standard
definition of how to represent different letters / glyphs in a
consistent binary pattern. The other being how information is stored in
an (un)structured sequence of ASCII characters.
As you see, this IS NOT HTML, since those extra spaces
and your diagram
below would have collapsed if it was html. Also saving it as text and
opening in a plain text ed or hex editor absolutely reveals what it is.
I feel it is important to acknowledge your point and to state that I'm
moving on.
Hmm... the problem is it's intended to be serious,
but is still far from
exposure-ready. So if I talk about it now, I risk having specific terms
I've coined in the doco (including the project name) getting meme-jammed
or trademarked by others. The plan is to release it all in one go,
eventually. Definitely will be years before that happens, if ever.
Fair enough.
However, here's a cut-n-paste (in plain text) of a
section of the
Introduction (html with diags.)
ACK
----------
Almost always, a first attempt at some unfamiliar, complex task produces
a less than optimal result. Only with the knowledge gained from actually
doing a new thing, can one look back and see the mistakes made. It usually
takes at least one more cycle of doing it over from scratch to produce
something that is optimal for the needs of the situation. Sometimes,
especially where deep and subtle conceptual innovations are involved,
it takes many iterations.
Part way through the first large (for me at the time) project that I
worked on, I decided that the project (and likely others) needed three
versions before being production ready:
1) First whack at solving the problem. LOTS about the problem is
learned, including the true requirements and the unknown dependencies
along the say. This will not be the final shipping version. - Think
of this as the Alpha release.
2) This is a complete re-write of the project based on what was learned
in #1. - Think of this as the Beta release.
3) This is less of a re-write and more of a bug fix for version 2. -
Think of this as the shipping release.
Human development of computing science (including
information coding
schemes) has been effectively a 'first time effort', since we kept on
developing new stuff built on top of earlier work. We almost never went
back to the roots and rebuilt everything, applying insights gained from
the many mistakes made.
With few notable (partial) exceptions, I largely agree.
In reviewing the evolution of information coding
schemes since very
early stages such as the Morse code, telegraph signal recording,
typewriters, etc, through early computing systems, mass data storage and
file systems, computer languages from Assembler through Compilers and
Interpreters, and so on, several points can be identified at which early
(inadequate) concepts became embedded then used as foundations for further
developments. This made the original concepts seem like fundamentals,
difficult to question (because they are underlying principles for much
more complex later work), and virtually impossible to alter (due to the
vast amounts of code dependent on them.)
Agreed.
And yet, when viewed in hindsight many of the early
concepts are seriously
flawed. They effectively hobbled all later work dependent on them.
Okay.
Examples of these pivotal conceptual errors:
Defects in the ASCII code table. This was a great improvement at the
time, but fails to implement several utterly essential concepts. The lack
of these concepts in the character coding scheme underlying virtually
all information processing since the 1960s, was unfortunate. Just one
(of many) bad consequences has been the proliferation of 'patch-up'
text coding schemes such as proprietry document formats (MS Word for eg),
postscript, pdf, html (and its even more nutty academia-gone-mad variants
like XML), UTF-8, unicode and so on.
Hum....
This is a scan from the 'Recommended USA Standard
Code for Information
Interchange (USASCII) X3.4 - 1967' The Hex A-F on rows 10-15, added
here. Hexadecimal notation was not commonly in use in the 1960s. Fig. ___
The original ASCII definition table.
ASCII's limitations were so severe that even the text (ie ASCII) program
code source files used by programmers to develop literally everything
else in computing science, had major shortcomings and inconveniences.
I don't think I'm willing to accept that at face value.
A few specific examples of ASCII's flaws:
? Missing concept of control vs data channel separation. And so we
needed the "< >" syntax of html, etc.
I don't buy that, at all.
ASCII has control codes to that I think could be (but isn't) used for
some of this. Start of Text (STX) & End of Text (ETX), or Shift Out
(SO) & Shift In (SI), or Device Control 1 - 4 (DC1 - DC4), or File /
Group / Record / Unit Separators (FS / GS / RS / US) all come to mind.
Either you're going to need two parallel byte streams, one for data and
another for control (I'm ignoring timing between them), -or- you're
going to need a way to indicate the need to switch between byte
(sub)streams in the overall byte (super)streams. Much of what I've seen
is the latter.
It just happens that different languages have decided to use different
(sequences of) characters / bytes to do this. HTML (possibly all XML)
use "<" and ">". ASP uses "<%" and
"%>". PHP uses "<?(php)" and ">?".
Apache HTTPD SSI uses "<!--#" and "-->". I can't readily
think of
others, but I know there are a plethora. These are all signals to
indicate the switch between data and control stream.
? Inability to embed meta-data about the text in
standard programatically
accessible form.
I'll agree that there's no distinction of data, meta, or otherwise, in a
string of ASCII bytes. But I don't expect there to be.
Is there any distinction in the Roman alphabet (or any other alphabet in
this thread) to differentiate the sequence of bytes that makes up the
quote verses the metadata that is the name of the person that said the
quote? Or what about the date that it was originally said?
This is about the time that I really started to feel that you were
talking about a file format (for lack of a better description) than how
the bytes were actually encoded, ASCII or EBCDIC or otherwise.
? Absense of anything related to text adornments, ie
italics, underline
and bold. The most basic essentials of expressive text, completely
ignored.
Again, alphabets don't have italics or underline or bold or other. They
have to depend on people reading them, and inferring the metadata, and
using tonal inflection to convey that metadata.
? Absense of any provision for creative typography. No
awareness of
fonts, type sizes, kerning, etc.
I don't believe that's anywhere close to ASCII's responsibility.
? Lack of logical 'new line', 'new
paragraph' and 'new page' codes.
I personally have issues with the concept of what a line is, or when to
start a new one. (Aside: I'm a HUGE fan of format=flowed text.)
We do have conventions for indicating a new paragraph, specifically two
new lines.
Is there an opportunity to streamline that? Probably.
I also have unresolved issues of what a page is. (Think reactive web
pages that gracefully adjust themselves as you dynamically resize the
window.)
There is also Form Feed (FF), which is used in printers to advance to
the next page. (Where the page is defined as the physical size of
paper, independently of the amount of text that will / won't fit on a
given page.)
? Inadequate support of basic formatting elements such
as tabular
columns, text blocks, etc.
ASCII has a very well defined tab character. Both for horizontal and
vertical. (Though I can't remember ever seeing vertical tab being used.)
I think there is some use for File / Group / Record / Unit Separators
(FS / GS / RS / US) for some of these uses, particularly for columns and
text blocks.
? Even the extremely fundamental and essential concept
of 'tab
columns' is impropperly implemented in ASCII, hence almost completely
dysfunctional.
Why do you say it's improperly implemented?
It sounds as if you are commenting about what programs do when
confronting a tab, not the actual binary pattern that represents the tab
character.
What would you like to see done differently?
? No concept of general extensible-typed functional
blocks within text,
with the necessary opening and closing delimiters.
Now I think you're asking too much of a character encoding scheme.
I do think that you can ask that of file formats.
? Missing symmetry of quote characters. (A consequence
of the absense
of typed functional blocks.)
I think that ASCII accurately represents what the general American
populous was taught in elementary school. Specifically that there is
functionally a single quote and a double quote. Sure, there are opening
and closing quotes, both single and double, but that is effectively
styling and doesn't change the semantic meaning of the text.
? No provision for code commenting. Hence the gaggle
of comment
delimiting styles in every coding language since. (Another consequence
of the absense of typed functional blocks.)
How is that the responsibility of the standard used to encode characters
in a binary pattern?
That REALLY sounds like it's the responsibility of the thing that uses
the underlying standard characters.
? No awareness of programatic operations such as
Inclusion, Variable
substitution, Macros, Indirection, Introspection, Linking, Selection, etc.
I see zero way that is the binary encoding format's responsibility.
I see every way that is the responsibility of the higher layer that is
using the underlying binary encoding.
? No facility for embedding of multi-byte character
and binary code
sequences.
I can see how ASCII doesn't (can't?) encode multi-byte characters. Some
can argue that ASCII can't even encode a full 8 bit byte character.
But from the standpoint of storing / sending / retrieving (multiples of
8-bit) bytes, how is this ASCII's problem?
IMHO this really jumps the shark (as if we hadn't already) from an
encoding scheme to a file format.
? Missing an informational equivalent to the pure
'zero' symbol of
number systems. A specific "There is no information here" symbol. (The
NUL symbol has other meanings.) This lack has very profound implications.
You're going to need to work to convince me of that.
Mathematics has zero, 0, for a really long time. (Yes, there was a time
before we had 0.) But there is no numerical difference between 0 and 00
and 0000. So, why do we need the latter two?
How many grains of sand does it take to make a pile?
My opinion: None. You simply need to define that it is a pile. Then it
become a question of "how many grains of sand are in the pile". Zero
(0) is a perfectly acceptable number. I also feel like a anything but a
number is not an answer to the question of how many.
? No facility to embed multiple data object types
within text streams.
How is this ASCII's problem?
How do you represent other data object types if you aren't using ASCII?
Sure, there's raw binary, but that just means that you're using your own
encoding scheme which is even less of a common / well known standard
than ASCII.
We have all sorts of ways to encode other data objects in ASCII and then
include it in streams of bytes.
Again, encoding verses file format.
? No facility to correlate coded text elements to
associated visual
typographical elements within digital images, AV files, and other
representational constructs. This has crippled efforts to digitize the
cultural heritage of humankind.
Now I think you're lamenting the lack of computer friendly bytes
representing the text that is in the picture of a sign. Functionally
what the ALT attribute of HTML's <IMG> tag is.
IMHO this is so far beyond a standard meant to make sure that people
represent A the same way on multiple computers.
? Non-configurable geometry of text flow, when
representing the text
in 2D planes. (Or 3D space for that matter.)
What is a page ^W 2D plane? ;-)
I don't think oral text has the geometry of text flow or a page either.
Again, IMHO, not ASCII's fault, or even it's wheelhouse.
? Many of the 32 'control codes' (characters
0x00 to 0x1F) were allocated
to hardware-specific uses that have since become obsolete and fallen
into disuse. Leaving those codes as a wasted resource.
Fair point.
I sometimes lament that they control codes aren't used more.
? ASCII defined only a 7-bit (128 codes) space, rather
than the full
8-bit (256 codes) space available with byte sized architectures. This
left the 'upper' 128 code page open to multiple chaotic, conflicting
usage interpretations. For example the IBM PC code page symbol sets
(multiple languages and graphics symbols, in pre-Unicode days) and the
UTF-8 character bit-size extensions.
I wonder what character sets looked like for other computers with
different word lengths. How many more, or fewer, characters were encoded?
Did it really make a difference?
Would it make any real difference if words were 32-bits long?
What if we moved to dictionary words represented by encoding schemes
instead of individual characters?
Or maybe we should move to encoding concepts instead of words. That way
we might have some loose translation of the words for mother / father /
son / daughter between languages. Maybe. I'm sure there would still be
issues. Gender and tense not withstanding.
Then there's the issues of possession and tense.
I feel like all of these are beyond the purpose and intent of ASCII, a
way to consistently encode characters in the hopes that they might be
able to be used across disparate computer systems.
? Inability to create files which encapsulate the
entirety of the visual
appearance of the physical object or text which the file represents,
without dependence on any external information. Even plain ASCII text
files depend on the external definition of the character glyphs that the
character codes represent. This can be a problem if files are intended
to serve as long term historical records, potentially for geological
timescales. This problem became much worse with the advent of the vast
Unicode glyph set, and typset formats such as PDF.
Now even more than ever, it sounds like you're talking about a file
format and not ASCII as a scheme meant to consistently encode characters.
The PDF 'archival' format (in which all
referenced fonts must be defined
in the file) is a step in the right direction ? except that format
standard is still proprietary and not available for free.
Don't get me started on PDF. IMHO PDF is where information goes to die.
Once data is in a PDF, the only reliable way to get the data back out to
be consumed by something else is through something like human eyes.
(Sure it may be possible to deconstruct the PDF, but it's fraught with
so many problems.)
----------
Sorry to be a tease.
Teas is not how I'd describe it. I feel like it was more of a bait
(talking about shortcomings with ASCII's) and switch (talking about
shortcomings with file formats).
That being said, I do think you made some extremely salient points about
file formats.
Soon I'd like to have a discussion about the
functional evolution of
the various ASCII control codes, and how they are used (or disused) now.
But am a bit too busy atm to give it adequate attention.
I think that would be an interesting discussion.
--
Grant. . . .
unix || die