New subject: A modest side project : redefining text encoding (Was: Text encoding Babel. Was Re: George Keremedjiev

26 Nov 2018

At 10:52 PM 25/11/2018 -0700, you wrote:

...
   Then adds a
plain ASCII space 0x20 just to be sure. 
I don't think it's adding a plain ASCII space 0x20 just to be sure. 
Looking at the source of the message, I see =C2=A0, which is the UTF-8 
representation followed by the space.  My MUA that understands UTF-8 
shows that "=C2=A0 " translates to "  ".  Further, "=C2=A0
=C2=A0" 
translates to "   ". 
I was speaking poetically. Perhaps "the mail software he uses was
written by morons" is clearer.

...
 Some of the reading that I did indicates that many
things, HTML 
included, use white space compaction (by default), which means that 
multiple white space characters are reduced to a single white space 
character. 
Oh yes, tell me about the html 'there is no such thing as hard formatting
and you can't have any even when you want it' concept. Thank you Tim Berners Lee.
  http://everist.org/NobLog/20130904_Retarded_ideas_in_comp_sci.htm
  http://everist.org/NobLog/20140427_p-term_is_retarded.htm

...
   So, when Ed wants multiple white spaces, his MUA has
to do 
something to state that two consecutive spaces can't be compacted. 
Hence the non-breaking space. 
Except that 'non-breaking space' is mostly about inhibiting line wrap at
that word gap. But anyway, there's little point trying to psychoanalyze
the writers of that software. Probably involved pointy-headed bosses.

...
 As stated in another reply, I don't think ASCII was
ever trying to be 
the Babel fish.  (Thank you Douglas Adams.) 
Of course not. It was for American English only. This is one of the major
points of failure in the history of information processing.

...
   Takeaway: Ed,
one space is enough. I don't know how you got the idea 
 people might miss seeing a single space, and so you need to type two or 
 more. 
I wondered if it wasn't a typo or keyboard sensitivity issue.  I 
remember I had to really slow down the double click speed for my grandpa 
(R.I.P.) so that he could use the mouse.  Maybe some users actuate keys 
slowly enough that the computer thinks that it's repeated keys.  ??\_(???)_/?? 
Well now he's flaunting it in his latest posts. Never mind. :)

...
   And since
plain ASCII is hard-formatted, extra spaces are NOT ignored 
 and make for wider spacing between words. 
It seems as if you made an assumption.  Just because the underlying 
character set is ASCII (per RFC 821 & 822, et al) does not mean that the 
data that they are carrying is also ASCII.  As is evident by the 
Content-Type: header stating the character set of UTF-8. 
Containing extended Unicode character sets via UTF-8, doesn't make it a
non-hard-formatted medium. In ASCII a space is a space, and multi-spaces
DON'T collapse. White space collapse is a feature of html, and whether an
email is html or not is determined by the sending utility.

...
 Especially when textual white space compression does
exactly that, 
ignore extra white spaces.

> Which  looks    very       odd, even if your mail utility didn't try to 
> do something 'special' with your unusual user input. 
As you see, this IS NOT HTML, since those extra spaces and your diagram below would
have collapsed if it was html. Also saving it as text and opening in a plain text ed
or hex editor absolutely reveals what it is.

...
 I frequently use multiple spaces with ASCII diagrams.

+------+
| This |
|  is  |
|   a  |
|  box |
+------+ 

...
   Btw, I changed
the subject line, because this is a wider topic. I've been 
 meaning to start a conversation about the original evolution of ASCII, 
 and various extensions. Related to a side project of mine. 
I'm curious to know more about your side project. 
Hmm... the problem is it's intended to be serious, but is still far from
exposure-ready.
So if I talk about it now, I risk having specific terms I've coined in the doco
(including
the project name) getting meme-jammed or trademarked by others. The plan is to release it
all in one go, eventually. Definitely will be years before that happens, if ever.

However, here's a cut-n-paste (in plain text) of a section of the Introduction (html
with diags.)
----------
Almost always, a first attempt at some unfamiliar, complex task produces a less than
optimal result. Only with the knowledge gained from actually doing a new thing, can one
look back and see the mistakes made. It usually takes at least one more cycle of doing it
over from scratch to produce something that is optimal for the needs of the situation.
Sometimes, especially where deep and subtle conceptual innovations are involved, it takes
many iterations.

Human development of computing science (including information coding schemes) has been
effectively a 'first time effort', since we kept on developing new stuff built on
top of earlier work. We almost never went back to the roots and rebuilt everything,
applying insights gained from the many mistakes made.

In reviewing the evolution of information coding schemes since very early stages such as
the Morse code, telegraph signal recording, typewriters, etc, through early computing
systems, mass data storage and file systems, computer languages from Assembler through
Compilers and Interpreters, and so on, several points can be identified at which early
(inadequate) concepts became embedded then used as foundations for further developments.
This made the original concepts seem like fundamentals, difficult to question (because
they are underlying principles for much more complex later work), and virtually impossible
to alter (due to the vast amounts of code dependent on them.)

And yet, when viewed in hindsight many of the early concepts are seriously flawed. They
effectively hobbled all later work dependent on them.

Examples of these pivotal conceptual errors:

Defects in the ASCII code table. This was a great improvement at the time, but fails to
implement several utterly essential concepts. The lack of these concepts in the character
coding scheme underlying virtually all information processing since the 1960s, was
unfortunate. Just one (of many) bad consequences has been the proliferation of
'patch-up' text coding schemes such as proprietry document formats (MS Word for
eg), postscript, pdf, html (and its even more nutty academia-gone-mad variants like XML),
UTF-8, unicode and so on.

	[pic]

This is a scan from the 'Recommended USA Standard Code for Information Interchange
(USASCII) X3.4 - 1967'
The Hex A-F on rows 10-15, added here. Hexadecimal notation was not commonly in use in the
1960s.
Fig. ___ The original ASCII definition table.

ASCII's limitations were so severe that even the text (ie ASCII) program code source
files used by programmers to develop literally everything else in computing science, had
major shortcomings and inconveniences.

A few specific examples of ASCII's flaws:

    Missing concept of control vs data channel separation. And so we needed the "<
>" syntax of html, etc.
    Inability to embed meta-data about the text in standard programatically accessible
form.
    Absense of anything related to text adornments, ie italics, underline and bold. The
most basic essentials of expressive text, completely ignored.
    Absense of any provision for creative typography. No awareness of fonts, type sizes,
kerning, etc.
    Lack of logical 'new line', 'new paragraph' and 'new page'
codes.
    Inadequate support of basic formatting elements such as tabular columns, text blocks,
etc.
    Even the extremely fundamental and essential concept of 'tab columns' is
impropperly implemented in ASCII, hence almost completely dysfunctional.
    No concept of general extensible-typed functional blocks within text, with the
necessary opening and closing delimiters.
    Missing symmetry of quote characters. (A consequence of the absense of typed
functional blocks.)
    No provision for code commenting. Hence the gaggle of comment delimiting styles in
every coding language since. (Another consequence of the absense of typed functional
blocks.)
    No awareness of programatic operations such as Inclusion, Variable substitution,
Macros, Indirection, Introspection, Linking, Selection, etc.
    No facility for embedding of multi-byte character and binary code sequences.
    Missing an informational equivalent to the pure 'zero' symbol of number
systems. A specific "There is no information here" symbol. (The NUL symbol has
other meanings.) This lack has very profound implications.
    No facility to embed multiple data object types within text streams.
    No facility to correlate coded text elements to associated visual typographical
elements within digital images, AV files, and other representational constructs. This has
crippled efforts to digitize the cultural heritage of humankind.
    Non-configurable geometry of text flow, when representing the text in 2D planes. (Or
3D space for that matter.)
    Many of the 32 'control codes' (characters 0x00 to 0x1F) were allocated to
hardware-specific uses that have since become obsolete and fallen into disuse. Leaving
those codes as a wasted resource.
    ASCII defined only a 7-bit (128 codes) space, rather than the full 8-bit (256 codes)
space available with byte sized architectures. This left the 'upper' 128 code page
open to multiple chaotic, conflicting usage interpretations. For example the IBM PC code
page symbol sets (multiple languages and graphics symbols, in pre-Unicode days) and the
UTF-8 character bit-size extensions.
    Inability to create files which encapsulate the entirety of the visual appearance of
the physical object or text which the file represents, without dependence on any external
information. Even plain ASCII text files depend on the external definition of the
character glyphs that the character codes represent. This can be a problem if files are
intended to serve as long term historical records, potentially for geological timescales.
This problem became much worse with the advent of the vast Unicode glyph set, and typset
formats such as PDF. The PDF 'archival' format (in which all referenced fonts must
be defined in the file) is a step in the right direction ? except that format standard is
still proprietary and not available for free. 
----------

Sorry to be a tease.

Soon I'd like to have a discussion about the functional evolution of
the various ASCII control codes, and how they are used (or disused) now. 
But am a bit too busy atm to give it adequate attention.

Guy