Text encoding Babel. Was Re: George Keremedjiev

Tue Nov 27 06:58:26 CST 2018

On Mon, 26 Nov 2018 at 15:21, Guy Dunphy via cctalk
<cctalk at classiccmp.org> wrote:

> Defects in the ASCII code table. This was a great improvement at the time, but fails to implement several utterly essential concepts. The lack of these concepts in the character coding scheme underlying virtually all information processing since the 1960s, was unfortunate. Just one (of many) bad consequences has been the proliferation of 'patch-up' text coding schemes such as proprietry document formats (MS Word for eg), postscript, pdf, html (and its even more nutty academia-gone-mad variants like XML), UTF-8, unicode and so on.

This is fascinating stuff and I am very interested to see how it comes
out, but I think there is a problem here which I wanted to highlight.

The thing is this. You seem to be discussing what you perceive as
_general_ defects in ASCII, but they are I think not _general_
defects. They are specific to your purpose, and I don't know what that
is exactly, but I have a feeling it is not a general overall universal
goal.

Just consider what "A.S.C.I.I." stands for.

[1] it's American. Yes it has lots of issues internationally, but it
does the job well for American English. As a native English speaker I
rue the absence of £ but the fact that Americans as so unfamiliar with
the symbol that they even appropriate its name for the unrelated #
which already had a perfectly good name of its own, but ASCII is
American and Americans don't use £. Fine.

[2] The "I.I." bit. Historical accidents aside, vestigial traces of
specific obsolete hardware implementations, it's _not a markup
language_. Its function is unrelated to those of HTML or XML or
anything like that. It's for "information interchange". That means
from computer or program to other computer or program. It's an
encoding and that's all. We needed a standard one. We got it. It has
flaws, many flaws, but it worked.

No it doesn't contain æ and å and ä and ø and ö. That's a problem for
Scandinavians.

It doesn't contain š and č and ṡ and ý (among others) and that's a
problem for Roman-alphabet-using Slavs.

Even broadening the discussion to 8-bit ANSI...

It does have a very poor way of encoding é and à and so on, which
indicates the relative importance of Latin-language users in the
Americas, compared to Slavs and so on.

But markup languages, formatting, control signalling, all that sort of
stuff is a separate discussion to encoding standards.

Attempt to bring them into encoding systems and the problem explodes
in complexity and becomes insoluble.

Additionally, it also makes a bit of a mockery of OSes focussed on raw
text streams, such as Unix, and whereas I am no great lover of Unix,
it does provide me with a job, and less headaches than Windows.

So, overall, all I wanted to say was: identify the problem domain
specifically and how to separate that from over, *overlapping* domains
before attacking ASCII for weaknesses that are not actually weaknesses
at all but indeed strengths for a lot of its use-cases.

Saying that,  I'd really like to read more about this project. It
looks like it peripherally intersects with one of my own big ones.

-- 
Liam Proven - Profile: https://about.me/liamproven
Email: lproven at cix.co.uk - Google Mail/Hangouts/Plus: lproven at gmail.com
Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven
UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053