Text encoding Babel. Was Re: George Keremedjiev

Sean Conner spc at conman.org
Fri Nov 30 16:57:19 CST 2018


It was thus said that the Great Keelan Lightfoot via cctalk once stated:
> > I see no reason that we can't have new control codes to convey new
> > concepts if they are needed.
> 
> I disagree with this; from a usability standpoint, control codes are
> problematic. Either the user needs to memorize them, or software needs
> to inject them at the appropriate times. There's technical problems
> too; when it comes to playing back a stream of characters, control
> characters mean that it is impossible to just start listening. It is
> difficult to fast forward and rewind in a file, because the only way
> to determine the current state is to replay the file up to that point.

  [ and further down the message ... ]

> I'm going to lavish on the unicode for this example, so those of you
> properly unequipped may not see this example:
> 
> foo := 𝑡ℎ𝑖𝑠 𝑖𝑠 𝑎 𝑠𝑡𝑟𝑖𝑛𝑔 𝘁𝗵𝗶𝘀 𝗶𝘀 𝗮 𝗰𝗼𝗺𝗺𝗲𝗻𝘁
> printf(𝑡ℎ𝑒 𝑠𝑡𝑟𝑖𝑛𝑔 𝑖𝑠 ① 𝑖𝑠𝑛𝑡 𝑡ℎ𝑎𝑡 𝑒𝑥𝑐𝑖𝑡𝑖𝑛𝑔, foo)
> if 𝘁𝗵𝗶𝘀 𝗶𝘀 𝗮 𝗽𝗼𝗼𝗿𝗹𝘆 𝗽𝗹𝗮𝗰𝗲𝗱 𝗰𝗼𝗺𝗺𝗲𝗻𝘁 foo ==
> 𝑡ℎ𝑖𝑠 𝑖𝑠 𝑎𝑙𝑠𝑜 𝑎 𝑠𝑡𝑟𝑖𝑛𝑔, 𝑏𝑢𝑡 𝑛𝑜𝑡 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒
> 𝑜𝑛𝑒 { 𝘁𝗵𝗶𝘀 𝗶𝘀 𝗮𝗹𝘀𝗼 𝗮 𝗰𝗼𝗺𝗺𝗲𝗻𝘁
> ...
> 
> An atrocious example, but a good demonstration of my point. If I had a
> toggle switch on my keyboard to switch between code, comment and
> string, it would have been much simpler to construct too!

  Somehow, the compiler will have to know that "𝑡ℎ𝑖𝑠 𝑖𝑠 𝑎 𝑠𝑡𝑟𝑖𝑛𝑔" is a
string while "𝘁𝗵𝗶𝘀 𝗶𝘀 𝗮 𝗰𝗼𝗺𝗺𝗲𝗻𝘁" is a comment to be ignored.  You lamented
the lack of a toggle switch for the two, but existing langauges, like C,
already have them, '"' is the "toggle" for strings, while '/*' and '*/' are
the toggles for comment (and now '//' if you are using C99).  It's still
something you have to "type" (or "toggle" or "switch" or somehow indicate
the mode).

  The other issue is now such inforamtion is stored, and there, I only see
two solutions---in-band and out-of-band.  In-band would be included with the
text.  Something along the lines of (where <ESC> is the ASCII ESC character
27, and this is an example only):

	foo := <ESC>_this is a string<ESC>\ <ESC>^this is a comment<ESC>\
	printf(<ESC>_the string is <ESC>[1p isn't that exciting<ESC>\,foo)

  But this has a problem you noted above---it's a lot harder to seek through
the file to arbitrary positions.  Grant Taylor stated another way of doing
this:

> What if there were (functionally) additional bits that indicated various
> other (what I was calling) stylings?
> 
> I think that something along those lines could help avoid a concern I
> have.  Namely how do search for an A, what ever ""style it's in.  I
> think I could hypothetically search for bytes ~> words (characters)
> containing (xxxxxxxx xxxxxxxx) (xxxxxxxx) 01x00001 (assuming that the
> proceeding don't cares are set appropriately) and find any format of A,
> upper case, lower case, bold, italic, underline, strike through, etc.

  There are several problems with this.  One, how many bits do you set aside
per character?  8?  16?  There are potentially an open ended set of stylings
that one might use.  Second problem---where do you store such bits?  Not to
imply this is a bad idea, just that there are issues that need to be
resolved with how things are done today (how does this interact with UTF-8
for instance?  Or UCS-4?).

Then there's out-of-band storage, which stores such information outside the
text (an example---I'm not saying this is the only way to store such
information out-of-band):

	foo := this is a string this is a comment
	printf(the string is 1 isn't that exciting,foo)

	---

	string 8-23
	string 50-63
	string 65-84
	replacement 64
	comment 25-41

  This has its own problems---namely, how to you keep the two together.  It
will either be a separate file, which could get separated, or part of the
text file but then you run into the problem of reading Microsoft Word files
cira 1986 with today's tools.  

  -spc (I like the ideas, but the implementations are harder than it first
	appears ... )


More information about the cctalk mailing list