Text encoding Babel. Was Re: George Keremedjiev

Guy Dunphy guykd at optusnet.com.au
Mon Nov 26 08:21:52 CST 2018

At 10:52 PM 25/11/2018 -0700, you wrote:

>> Then adds a plain ASCII space 0x20 just to be sure.
>I don't think it's adding a plain ASCII space 0x20 just to be sure. 
>Looking at the source of the message, I see =C2=A0, which is the UTF-8 
>representation followed by the space.  My MUA that understands UTF-8 
>shows that "=C2=A0 " translates to "  ".  Further, "=C2=A0 =C2=A0" 
>translates to "   ".

I was speaking poetically. Perhaps "the mail software he uses was
written by morons" is clearer.

>Some of the reading that I did indicates that many things, HTML 
>included, use white space compaction (by default), which means that 
>multiple white space characters are reduced to a single white space 

Oh yes, tell me about the html 'there is no such thing as hard formatting
and you can't have any even when you want it' concept. Thank you Tim Berners Lee.

>  So, when Ed wants multiple white spaces, his MUA has to do 
>something to state that two consecutive spaces can't be compacted. 
>Hence the non-breaking space.

Except that 'non-breaking space' is mostly about inhibiting line wrap at
that word gap. But anyway, there's little point trying to psychoanalyze
the writers of that software. Probably involved pointy-headed bosses.

>As stated in another reply, I don't think ASCII was ever trying to be 
>the Babel fish.  (Thank you Douglas Adams.)

Of course not. It was for American English only. This is one of the major
points of failure in the history of information processing.

>> Takeaway: Ed, one space is enough. I don't know how you got the idea 
>> people might miss seeing a single space, and so you need to type two or 
>> more.
>I wondered if it wasn't a typo or keyboard sensitivity issue.  I 
>remember I had to really slow down the double click speed for my grandpa 
>(R.I.P.) so that he could use the mouse.  Maybe some users actuate keys 
>slowly enough that the computer thinks that it's repeated keys.  ¯\_(ツ)_/¯

Well now he's flaunting it in his latest posts. Never mind. :)

>> And since plain ASCII is hard-formatted, extra spaces are NOT ignored 
>> and make for wider spacing between words.
>It seems as if you made an assumption.  Just because the underlying 
>character set is ASCII (per RFC 821 & 822, et al) does not mean that the 
>data that they are carrying is also ASCII.  As is evident by the 
>Content-Type: header stating the character set of UTF-8.

Containing extended Unicode character sets via UTF-8, doesn't make it a
non-hard-formatted medium. In ASCII a space is a space, and multi-spaces
DON'T collapse. White space collapse is a feature of html, and whether an
email is html or not is determined by the sending utility.

>Especially when textual white space compression does exactly that, 
>ignore extra white spaces.
>> Which  looks    very       odd, even if your mail utility didn't try to 
>> do something 'special' with your unusual user input.

As you see, this IS NOT HTML, since those extra spaces and your diagram below would
have collapsed if it was html. Also saving it as text and opening in a plain text ed
or hex editor absolutely reveals what it is.

>I frequently use multiple spaces with ASCII diagrams.
>| This |
>|  is  |
>|   a  |
>|  box |

>> Btw, I changed the subject line, because this is a wider topic. I've been 
>> meaning to start a conversation about the original evolution of ASCII, 
>> and various extensions. Related to a side project of mine.
>I'm curious to know more about your side project.

Hmm... the problem is it's intended to be serious, but is still far from exposure-ready.
So if I talk about it now, I risk having specific terms I've coined in the doco (including
the project name) getting meme-jammed or trademarked by others. The plan is to release it
all in one go, eventually. Definitely will be years before that happens, if ever.

However, here's a cut-n-paste (in plain text) of a section of the Introduction (html with diags.)
Almost always, a first attempt at some unfamiliar, complex task produces a less than optimal result. Only with the knowledge gained from actually doing a new thing, can one look back and see the mistakes made. It usually takes at least one more cycle of doing it over from scratch to produce something that is optimal for the needs of the situation. Sometimes, especially where deep and subtle conceptual innovations are involved, it takes many iterations.

Human development of computing science (including information coding schemes) has been effectively a 'first time effort', since we kept on developing new stuff built on top of earlier work. We almost never went back to the roots and rebuilt everything, applying insights gained from the many mistakes made.

In reviewing the evolution of information coding schemes since very early stages such as the Morse code, telegraph signal recording, typewriters, etc, through early computing systems, mass data storage and file systems, computer languages from Assembler through Compilers and Interpreters, and so on, several points can be identified at which early (inadequate) concepts became embedded then used as foundations for further developments. This made the original concepts seem like fundamentals, difficult to question (because they are underlying principles for much more complex later work), and virtually impossible to alter (due to the vast amounts of code dependent on them.)

And yet, when viewed in hindsight many of the early concepts are seriously flawed. They effectively hobbled all later work dependent on them.

Examples of these pivotal conceptual errors:

Defects in the ASCII code table. This was a great improvement at the time, but fails to implement several utterly essential concepts. The lack of these concepts in the character coding scheme underlying virtually all information processing since the 1960s, was unfortunate. Just one (of many) bad consequences has been the proliferation of 'patch-up' text coding schemes such as proprietry document formats (MS Word for eg), postscript, pdf, html (and its even more nutty academia-gone-mad variants like XML), UTF-8, unicode and so on.


This is a scan from the 'Recommended USA Standard Code for Information Interchange (USASCII) X3.4 - 1967'
The Hex A-F on rows 10-15, added here. Hexadecimal notation was not commonly in use in the 1960s.
Fig. ___ The original ASCII definition table.

ASCII's limitations were so severe that even the text (ie ASCII) program code source files used by programmers to develop literally everything else in computing science, had major shortcomings and inconveniences.

A few specific examples of ASCII's flaws:

    Missing concept of control vs data channel separation. And so we needed the "< >" syntax of html, etc.
    Inability to embed meta-data about the text in standard programatically accessible form.
    Absense of anything related to text adornments, ie italics, underline and bold. The most basic essentials of expressive text, completely ignored.
    Absense of any provision for creative typography. No awareness of fonts, type sizes, kerning, etc.
    Lack of logical 'new line', 'new paragraph' and 'new page' codes.
    Inadequate support of basic formatting elements such as tabular columns, text blocks, etc.
    Even the extremely fundamental and essential concept of 'tab columns' is impropperly implemented in ASCII, hence almost completely dysfunctional.
    No concept of general extensible-typed functional blocks within text, with the necessary opening and closing delimiters.
    Missing symmetry of quote characters. (A consequence of the absense of typed functional blocks.)
    No provision for code commenting. Hence the gaggle of comment delimiting styles in every coding language since. (Another consequence of the absense of typed functional blocks.)
    No awareness of programatic operations such as Inclusion, Variable substitution, Macros, Indirection, Introspection, Linking, Selection, etc.
    No facility for embedding of multi-byte character and binary code sequences.
    Missing an informational equivalent to the pure 'zero' symbol of number systems. A specific "There is no information here" symbol. (The NUL symbol has other meanings.) This lack has very profound implications.
    No facility to embed multiple data object types within text streams.
    No facility to correlate coded text elements to associated visual typographical elements within digital images, AV files, and other representational constructs. This has crippled efforts to digitize the cultural heritage of humankind.
    Non-configurable geometry of text flow, when representing the text in 2D planes. (Or 3D space for that matter.)
    Many of the 32 'control codes' (characters 0x00 to 0x1F) were allocated to hardware-specific uses that have since become obsolete and fallen into disuse. Leaving those codes as a wasted resource.
    ASCII defined only a 7-bit (128 codes) space, rather than the full 8-bit (256 codes) space available with byte sized architectures. This left the 'upper' 128 code page open to multiple chaotic, conflicting usage interpretations. For example the IBM PC code page symbol sets (multiple languages and graphics symbols, in pre-Unicode days) and the UTF-8 character bit-size extensions.
    Inability to create files which encapsulate the entirety of the visual appearance of the physical object or text which the file represents, without dependence on any external information. Even plain ASCII text files depend on the external definition of the character glyphs that the character codes represent. This can be a problem if files are intended to serve as long term historical records, potentially for geological timescales. This problem became much worse with the advent of the vast Unicode glyph set, and typset formats such as PDF. The PDF 'archival' format (in which all referenced fonts must be defined in the file) is a step in the right direction — except that format standard is still proprietary and not available for free. 

Sorry to be a tease.

Soon I'd like to have a discussion about the functional evolution of
the various ASCII control codes, and how they are used (or disused) now. 
But am a bit too busy atm to give it adequate attention.


More information about the cctalk mailing list