Text encoding Babel. Was Re: George Keremedjiev

Liam Proven lproven at gmail.com
Wed Nov 28 06:32:44 CST 2018


On Tue, 27 Nov 2018 at 20:47, Grant Taylor via cctalk
<cctalk at classiccmp.org> wrote:
>
> I don't think that HTML can reproduce fixed page layout like PostScript
> and PDF can.  It can make a close approximation.  But I don't think HTML
> can get there.  Nor do I think it should.

There are a wider panoply of options to consider.

For instance, Display Postscript, and come to that, arguably, NeWS.

Also, modern document-specific markups. I work in DocBook XML, which I
dislike intensely.

There's also, at another extreme, AsciiDoc (and Markdown (in various
"flavours")), Restructured Text, and similar "lightweight" MLs:

http://hyperpolyglot.org/lightweight-markup

But there are, of course, rivals. DITA is also widely-used.

And of course there are things like LyX/LaTeX/TeX, which some find
readable. I am not one of them. But I get paid to do Docbook, I don't
get paid to do TeX.

Neal Stephenson's highly enjoyable novel /Seveneves/ contains some
interesting speculations on the future of the Roman alphabet and what
close contact with Cyrillic over a period will do to it.

Aside:

[[

> I'm not personally aware of any cases where ASCII limits programming
> languages.  But my ignorance does not preclude that situation from existing.

APL and ColorForth, as others have pointed out.

> I have long wondered if there are computer languages that aren't rooted
> in English / ASCII.

https://en.wikipedia.org/wiki/Qalb_(programming_language)

More generally:

https://en.wikipedia.org/wiki/Non-English-based_programming_languages

Personally I am more interested in non-*textual* programming
languages. A trivial candidate is Scratch:

https://scratch.mit.edu/

But ones that entirely subvert the model of using linear files
containing characters that are sequentially interpreted are more
interesting to me. I blogged about one family I just discovered last
week:

https://liam-on-linux.livejournal.com/60054.html

The videos are more or less _necessary_ here, because trying to
describe this in text will fail _badly_. Well worth a couple of hours
of anyone's time.

]]

Anyway. To return to text encodings.

Again I wish to refer to a novel; to Kim Stanley Robinson's "Mars
trilogy", /Red Mars/, /Green Mars/ and /Blue Mars/. Or as a friend
called them, "RGB Mars" or even "Technicolor Mars".

A character presents an argument that if you try to summarise many
things on a scale -- e.g. for text encodings, from simplicity and
readability, to complexity and capability -- you can't encapsulate any
sophisticated system.

He urges a 4-cornered system, using the example of the "four humours":
phlegm, bile, choler and sang. The opposed corners of the diagram are
as important as the sides of the square; characteristics form the
corners, but the intersections between them are what defines us.

So. There is more than one scale here.

At one extreme, we could have the simplest possible text encoding.
Something like Morse code or Braille, which omits almost all "syntax"
-- almost no punctuation, no carriage returns or anything like that,
which are _metadata_, they are information about how to display the
content, not content themselves. Not even case is encoded: no
capitals, no minuscule letters. But of course a number of alphabets
don't have that distinction, and it's not essential in the Roman
alphabet.

Slightly richer, but littered with historical baggage from its origins
in teletypes: ASCII.

Much richer, but still not rich enough for all the
Roman-alphabet-using-languages: ANSI.

Insanely rich, but still not rich enough for all the written
languages: Unicode. (What plane? What encoding? What version, even?)

At the other extreme, markup languages that either weren't really
intended for humans but often are written by them -- e.g. the SGML/XML
family -- or are only usable by relatively few humans -- e.g. the TeX
family -- or that are almost never used by humans, e.g. PostScript, or
HP PCL.

And what I find a fairly happy medium -- AsciiDoc, say. Perfectly
readable by untrained people as plain ASCII, can be written with mere
hours of study, if that, but also can be processed and rendered into
something much prettier.

The richer the encoding, the harder it is for *humans* to read, and
the more complex the software to handle it needs to be.

So, yes, ASCII is perhaps too minimal. ANSI is just a superset.

But I'd argue that there _should_ be a separation between at least 2,
maybe 3 levels, and arguably more.

#1 Plain text encoding. Ideally able to handle all the characters in
all forms of the Latin alphabet, and single-byte based. Drop ASCII
legacy baggage such as backspace, bell, etc.

#2 Richer text, with simple markup, but human-readable and
human-writable without needing much skill or knowledge. Along the
lines of Markdown or *traditional* /email/ _formatting_ perhaps.

#3 Formatted text, with embedded control codes. The Oberon OS does this.

#4 Full 1980s word-processor-style document, with control codes,
formatting, font and page layout features, etc.

#5 Number 4, plus embedded objects, graphics. I'm thinking PDF or the
like as my model.

Try to collapse all these into one and you're doomed.

-- 
Liam Proven - Profile: https://about.me/liamproven
Email: lproven at cix.co.uk - Google Mail/Hangouts/Plus: lproven at gmail.com
Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven
UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053


More information about the cctalk mailing list