Text encoding Babel. Was Re: George Keremedjiev

Grant Taylor cctalk at gtaylor.tnetconsulting.net
Sun Nov 25 23:52:18 CST 2018


Not to beat a dead horse, but I ran across "Â Â Â  " in a text file when 
read via a web browser this evening and wanted to share my findings as 
they seemed timely.

On 11/22/18 5:55 PM, Guy Dunphy via cctalk wrote:
> Anyway, I was wondering how Ed's emails (and sometimes others elsewhere) 
> acquired that odd corruption.

IMHO it's not corruption as much as it is incompatibility.

> Answer: Ed's email util … interpret the user typing space twice in 
> succession, as meaning "I really, really want there to be a space here, 
> no matter what." So it inserts a 'no-break space' unicode character, 
> which of course requires a 2-byte UTF-8 encoding.

What I'm not sure of is how the 0xC2 0xA0 translates to 0xC3 0xA2 that 
is the  character.

I think that the 0xC2 0xA0 pair is treated as two independent 
characters.  Thus 0xC2 is "Â", and 0xA0 is a non-breaking space.

I don't know what happens to the non-breaking space, but the  and the 
space (0x20) that is after 0xC2 0xA0 (three byte sequence being 0xC2 
0xA0 0x20) is included and becomes "Â " which is what we see in reply 
text.  (Encoded as 0xC3 0x83 0x20.)

So, arguably, improperly processed / translated text that results in 
0xC3 0x83 0x20 / "Â " should have been a non-breaking space followed by 
a space.

This jives with both Ed's email and the document that I was reading that 
prompted this email.

> Then adds a plain ASCII space 0x20 just to be sure.

I don't think it's adding a plain ASCII space 0x20 just to be sure. 
Looking at the source of the message, I see =C2=A0, which is the UTF-8 
representation followed by the space.  My MUA that understands UTF-8 
shows that "=C2=A0 " translates to "  ".  Further, "=C2=A0 =C2=A0" 
translates to "   ".

Some of the reading that I did indicates that many things, HTML 
included, use white space compaction (by default), which means that 
multiple white space characters are reduced to a single white space 
character.  So, when Ed wants multiple white spaces, his MUA has to do 
something to state that two consecutive spaces can't be compacted. 
Hence the non-breaking space.

=C2=A0 quite literally translates to a space character that can't be 
compacted.  Thus "=C2=A0 =C2=A0" is really "<space> <space>" or "   ".

Multiple successive spaces will need to be a mixture of space and 
non-breaking space characters.

So, the plain ASCII space 0x20 after (or before) =C2=A0 is not there 
just to be sure.

> Personally I find it more interesting than annoying. Just another example 
> of the gradual chaotic devolution of ASCII, into a Babel of incompatible 
> encodings. Not that ASCII was all that great in the first place.

As stated in another reply, I don't think ASCII was ever trying to be 
the Babel fish.  (Thank you Douglas Adams.)

> Takeaway: Ed, one space is enough. I don't know how you got the idea 
> people might miss seeing a single space, and so you need to type two or 
> more.

I wondered if it wasn't a typo or keyboard sensitivity issue.  I 
remember I had to really slow down the double click speed for my grandpa 
(R.I.P.) so that he could use the mouse.  Maybe some users actuate keys 
slowly enough that the computer thinks that it's repeated keys.  ¯\_(ツ)_/¯

> But it isn't so. The normal convention in plain text is one space 
> character between each word.

The operative word is "convention", as in commonly accepted but not 
always the case behavior.  ;-)

> And since plain ASCII is hard-formatted, extra spaces are NOT ignored 
> and make for wider spacing between words.

It seems as if you made an assumption.  Just because the underlying 
character set is ASCII (per RFC 821 & 822, et al) does not mean that the 
data that they are carrying is also ASCII.  As is evident by the 
Content-Type: header stating the character set of UTF-8.

Especially when textual white space compression does exactly that, 
ignore extra white spaces.

> Which  looks    very       odd, even if your mail utility didn't try to 
> do something 'special' with your unusual user input.

I frequently use multiple spaces with ASCII diagrams.

+------+
| This |
|  is  |
|   a  |
|  box |
+------+

That will not look like I intended it with white space compression.

> Btw, I changed the subject line, because this is a wider topic. I've been 
> meaning to start a conversation about the original evolution of ASCII, 
> and various extensions. Related to a side project of mine.

I'm curious to know more about your side project.



-- 
Grant. . . .
unix || die


More information about the cctalk mailing list