Text encoding Babel. Was Re: George Keremedjiev

Tue Nov 27 04:05:46 CST 2018

At 09:49 PM 26/11/2018 -0700, Grant wrote:
>On 11/26/18 7:21 AM, Guy Dunphy wrote:

>> Oh yes, tell me about the html 'there is no such thing 
>> as hard formatting and you can't have any even when 
>> you want it' concept. Thank you Tim Berners Lee.
>
>I've not delved too deeply into the lack of hard formatting in HTML.

It was a core of the underlying philosophy, that html would NOT allow
any kind of fixed formatting. The reasoning was that it could be displayed
on any kind of system, so had to be free-format and quite abstract.
Which is great, until you actually want to represent a real printed page,
or book. Like Postscript can. Thus html was doomed to be inadequate for
capture of printed works. That was a disaster. There wasn't any real reason
it could not be both. Just an academic's insistense on enforcing his ideology.
Then of course, over time html has morphed to include SOME forms of absolute
layout, because there was a real demand for that. But the result is a hodge-podge.

>
>I've also always considered HTML to be what you want displayed, with 
>minimal information about how you want it displayed.  IMHO CSS helps 
>significantly with the latter part.

Yes, it should be capable of that. But not enforce 'only that way'.
By 'html' I mean the kludge of html-css-js. The three-cat herd. (Ignoring all the _other_ web cats.)
Now it's way too late to fix it properly with patches.

>> Except that 'non-breaking space' is mostly about inhibiting line wrap at 
>> that word gap.
>
>I wouldn't have thought "mostly" or "inhibiting line wrap".  I view the 
>non-breaking space as a way to glue two parts of text together and treat 
>them as one unit, particularly for display and partially for selection.
>Granted, much of the breaking is done when the text can not continue (in 
>it's natural direction), frequently needing to start anew on the next line.

And that's why in html that character is written " "
You just rephrased my 1.2 lines as 5 lines.

>> But anyway, there's little point trying to psychoanalyze the writers of 
>> that software. Probably involved pointy-headed bosses.
>
>I like to understand why things have been done the way they were. 
>Hopefully I can learn from the reasons.

We already established that they thought it a good idea to insert fancy 'no-break'
coding if the user typed two spaces. They thought they were adding a useful feature.
I meant there's no point trying to determine why they were so deluded, and failed to
recognise that maybe some users (Ed) would want to just type two spaces.

>
>> Of course not. It was for American English only. This is one of the 
>> major points of failure in the history of information processing.
>
>Looking backwards, (I think) I can understand why you say that.  But 
>based on my (possibly limited) understanding of the time, I think that 
>ASCII was one of the primordial building blocks that was necessary.

YES! I'm not arguing ASCII was _bad_. It was a great advance. There was
no way they could have included the experience of 50 more years if comp-sci.
And now 'we' (the world) are stuck with it for legacy compatibility reasons.
Any extensions have to be retro-compatible.

 [snip]

>> Containing extended Unicode character sets via UTF-8, doesn't make it a 
>> non-hard-formatted medium. In ASCII a space is a space, and multi-spaces 
>> DON'T collapse. White space collapse is a feature of html, and whether 
>> an email is html or not is determined by the sending utility.
>
>Having read the rest of your email and now replying, I feel that we may 
>be talking about two different things.  One being ASCII's standard 
>definition of how to represent different letters / glyphs in a 
>consistent binary pattern.

That's what you are talking about.

> The other being how information is stored in an (un)structured sequence
> of ASCII characters.

What I'm talking about is not that. It's about how to create a coding scheme
that serves ALL the needs we are now aware of. (Just one of which is for old
ASCII files to still make sense.) This involves both re-definition of some
of the ASCII control codes, AND defining sequential structure standards.
For eg UTF-8 is a sequential structure. So are all the html and css codings,
all programming languages, etc. There's a continuum of encoding...structure...syntax.
The ASCII standard didn't really consider that continuum.

 [snip]  ACK - ACK.

>> ----------
 [snip]
>> Human development of computing science (including information coding 
>> schemes) has been effectively a 'first time effort', since we kept on 
>> developing new stuff built on top of earlier work. We almost never went 
>> back to the roots and rebuilt everything, applying insights gained from 
>> the many mistakes made.
>
>With few notable (partial) exceptions, I largely agree.

Which exceptions would those be? (That weren't built on top of ASCII!)

  [big snip]

>> This is a scan from the 'Recommended USA Standard Code for Information 
>> Interchange (USASCII) X3.4 - 1967' The Hex A-F on rows 10-15, added 
>> here. Hexadecimal notation was not commonly in use in the 1960s.  Fig. ___ 
>> The original ASCII definition table.
>> 
>> ASCII's limitations were so severe that even the text (ie ASCII) program 
>> code source files used by programmers to develop literally everything 
>> else in computing science, had major shortcomings and inconveniences.
>
>I don't think I'm willing to accept that at face value.

I assume you're thinking that ASCII serves just fine for program source code?
This is a bandwagon/normalcy bias effect. "Everyone does it that way and always has,
so it must be good."
Sigh. Well, I can't go into that without revealing more than I wish to atm.

>> A few specific examples of ASCII's flaws:
>> 
>> Â· Missing concept of control vs data channel separation. And so we 
>> needed the "< >" syntax of html, etc.
>
>I don't buy that, at all.
>
>ASCII has control codes to that I think could be (but isn't) used for 
>some of this.  Start of Text (STX) & End of Text (ETX), or Shift Out 
>(SO) & Shift In (SI), or Device Control 1 - 4 (DC1 - DC4), or File / 
>Group / Record / Unit Separators (FS / GS / RS / US) all come to mind.

You're making my point for me. Of course there are many ways to interpret
existing codes to achieve this effect. Some use control codes, others
overload functionality on printable characters. eg html with < and >.
My point is the base coding scheme doesn't allocate a SPECIFIC mechanism
for doing this. The result is a briar-patch of competing ad-hoc methods.
Hence the 'babel' I'm referring to, in every matter where ASCII didn't
define needed functionality.

>Either you're going to need two parallel byte streams, one for data and 
>another for control (I'm ignoring timing between them), -or- you're 
>going to need a way to indicate the need to switch between byte 
>(sub)streams in the overall byte (super)streams.  Much of what I've seen 
>is the latter.

By definition, in a single baseband data stream it's ALWAYS the case that
time-interleaving is the only way to achieve command/data separation.

>It just happens that different languages have decided to use different 
>(sequences of) characters / bytes to do this.  HTML (possibly all XML) 
>use "<" and ">".  ASP uses "<%" and "%>".  PHP uses "<?(php)" and ">?". 
>Apache HTTPD SSI uses "<!--#" and "-->".  I can't readily think of 
>others, but I know there are a plethora.  These are all signals to 
>indicate the switch between data and control stream.

Exactly. Because ASCII does not provide a specific coding. It didn't
occur to those drtafting the standard. Same as with all the other...

>
>> Â· Inability to embed meta-data about the text in standard programatically 
>> accessible form.
>
>I'll agree that there's no distinction of data, meta, or otherwise, in a 
>string of ASCII bytes.  But I don't expect there to be.

And so every different devel project that needed it, added some kludge on top.
This is what I'm saying: ASCII has no facility for this, but we need a basic
coding scheme that does (and is still ASCII-compatible.)

>Is there any distinction in the Roman alphabet (or any other alphabet in 
>this thread) to differentiate the sequence of bytes that makes up the 
>quote verses the metadata that is the name of the person that said the 
>quote?  Or what about the date that it was originally said?

Doesn't matter. The English alphabet (or any other human language) naturally
do not have protocols to concisely represent data types. That's no reason to
not build such things into the character coding scheme used in computational
machinery. In a way we can read.
Like, for instance written decimal numbers, sci-notation, units, etc.
The written form is much more compact than the spoken forms.

>This is about the time that I really started to feel that you were 
>talking about a file format (for lack of a better description) than how 
>the bytes were actually encoded, ASCII or EBCDIC or otherwise.

The project consists of several parts. One is to define an extension of ASCII
(with a different name, that I'm not going to mention for fear of pre-emptive
copyright bullshit.) Other parts relate to other areas in comp-sci, in the same
manner of 'see what happens if one starts from scratch.'

It's a fun hobby project. That text I quoted is a small part of one chapter of the docs.
Atm the whole thing is undergoing _another_ major refactoring, due to seeing a better way
to do some parts of it.

>> Â· Absense of anything related to text adornments, ie italics, underline 
>> and bold. The most basic essentials of expressive text, completely 
>> ignored.
>
>Again, alphabets don't have italics or underline or bold or other.  They 
>have to depend on people reading them, and inferring the metadata, and 
>using tonal inflection to convey that metadata.

And yet written texts do have adornments (which can be of different forms
in different languages.) So, you're saying a text encoding scheme should not have
any way to represent such things? Why not?
The ASCII printable character set does not have adornments, BECAUSE it is purely a
representation of the alphabet and other symbols. That's one of its failings, since
all 'extras' have to be implemented by ad-hoc improvisations.

>> Â· Absense of any provision for creative typography. No awareness of 
>> fonts, type sizes, kerning, etc.
>
>I don't believe that's anywhere close to ASCII's responsibility.

I'm pretty sure you've missed the whole point. The ASCII definition 'avoided responsibility'
thus making itself inadequate. Html, postscript, and other typographic conventions layer
that stuff on top, messily and often in proprietary ways. 

>
>> Â· Lack of logical 'new line', 'new paragraph' and 'new page' codes.
>
>I personally have issues with the concept of what a line is, or when to 
>start a new one.  (Aside: I'm a HUGE fan of format=flowed text.)

Then you never tried to represent a series of printed pages in html. 
Can be sort-of done but is a pain.
ASCII doesn't understand 'lines' either. It understands physical head printers.
Hence 'carriage return' and 'line feed'. Resulting in the CR/CR-LF/LF wars for
text files where a 'new line' was needed.
Even in format-flowed text there is a typographic need for 'new line'.
It means 'no matter where the current line ends, drop down one line and start
at the left.'
Like I'm typing here.

A paragraph otoh is like that, but with extra vertical space separating from above.
Because ASCII does not have these _absolutely_fundamental_ codes, is why html
has to have <br> and <p>. Not to get into the whole </p> argument.

Note that including facility for real newline and paragraph symbols in the basic
coding scheme, doesn't _force_ the text to be hard formatted. That's a display mode
option.

>
>We do have conventions for indicating a new paragraph, specifically two 
>new lines.

Sigh. Like two spaces in succession being interpretted to do something special?
You know in type layout there are typically special things that happen for
paragraphs but not for newlines? You don't see any problem with overloading
a pair of codes of one type, to mean something else?

>Is there an opportunity to streamline that?  Probably.

Factors to consider:
 - Ergonomics of typing. It _should_ be possible to directly type reasonably typographically
   formatted text, with minimal keystrokes. One can type html, but it's far from optimal.
   There are many other conventions. None arising from ASCII, because it lacks _everything_ necessary.
 - Efficiency of the file/stream encoding. Allowing for infinitely extensible character sets,
   embedded specifications of glyph appearances (fonts), layout, and dynamic elements.
 - Efficiaency and complexity of code to deal with constructing, processing and displaying texts.

>
>I also have unresolved issues of what a page is.  (Think reactive web 
>pages that gracefully adjust themselves as you dynamically resize the 
>window.)

Sure. Now you think of trying to construct a digital representation of a
printed work with historical significance. So it NUST NOT dynamically reformat.
Otoh it might be a total simulation of a physical object/book, page turn physics and all.

 [snip]

>> Â· Inadequate support of basic formatting elements such as tabular 
>> columns, text blocks, etc.
>
>ASCII has a very well defined tab character.  Both for horizontal and 
>vertical.  (Though I can't remember ever seeing vertical tab being used.)

Ha ha... consider how does the Tab function work in typewriters? What does
pressing a Tab key actually do?
ASCII has a Tab code, yes. It does NOT have other things required for actual use
of tabular columns. So, the Tab functionality is completely broken in ASCII.
That was actually a really bad error on their part. They didn't need foresight,
they just goofed. Typewriters had working Tabs function since 1897.

>I think there is some use for File / Group / Record / Unit Separators 
>(FS / GS / RS / US) for some of these uses, particularly for columns and 
>text blocks.

Not the same thing.

>> Â· Even the extremely fundamental and essential concept of 'tab 
>> columns' is impropperly implemented in ASCII, hence almost completely 
>> dysfunctional.
>
>Why do you say it's improperly implemented?

Specifically, ASCII does not provide any explicit means to set and clear an array of
tabular positions (whether absolute or proportional.)
Hence html has to implement tables, grid systems, etc. But it SHOULD be possible to
type columnar text (with tabs) exactly and as ergonomically as one would on a typwriter.

>It sounds as if you are commenting about what programs do when 
>confronting a tab, not the actual binary pattern that represents the tab 
>character.

Why would I be talking of the binary code of the tab character? 

>What would you like to see done differently?

Sigh. You'll have to wait.

>> Â· No concept of general extensible-typed functional blocks within text, 
>> with the necessary opening and closing delimiters.
>
>Now I think you're asking too much of a character encoding scheme.

ASCII is not solely a 'character encoding scheme', since it also has the control codes.
But those implement far less functionality than we need. 

>I do think that you can ask that of file formats.

Now tell me why you think the fundamental coding standard, should not be the same as
used in file formats. You're used to those being different things (since ASCII is missing so much),
but it doesn't have to be so.

>> Â· Missing symmetry of quote characters. (A consequence of the absense 
>> of typed functional blocks.)
>
>I think that ASCII accurately represents what the general American 
>populous was taught in elementary school.  Specifically that there is 
>functionally a single quote and a double quote.  Sure, there are opening 
>and closing quotes, both single and double, but that is effectively 
>styling and doesn't change the semantic meaning of the text.

There you go again, assuming 'styling' has no place in the base coding scheme.

>> Â· No provision for code commenting. Hence the gaggle of comment 
>> delimiting styles in every coding language since. (Another consequence 
>> of the absense of typed functional blocks.)
>
>How is that the responsibility of the standard used to encode characters 
>in a binary pattern?

You keep assuming that a basic coding scheme should contain nothing but the
common printable characters. Despite ASCII already containing more than that.
Also tell me why there should not be a printable character specifically meaning
"Start of comment" (and variants, line or block comments, terminators, etc.)
You are just used to doing it a traditional way, and not wondering if there
might be better ways.

>That REALLY sounds like it's the responsibility of the thing that uses 
>the underlying standard characters.

You think that, because all your life you've been typing /* comment */ or whatever.
In truth, the ASCII committee just forgot.

>> Â· No awareness of programatic operations such as Inclusion, Variable 
>> substitution, Macros, Indirection, Introspection, Linking, Selection, etc.
>
>I see zero way that is the binary encoding format's responsibility.

Oh well.

>I see every way that is the responsibility of the higher layer that is 
>using the underlying binary encoding.
>
>> Â· No facility for embedding of multi-byte character and binary code 
>> sequences.
>
>I can see how ASCII doesn't (can't?) encode multi-byte characters.  Some 
>can argue that ASCII can't even encode a full 8 bit byte character.

a) ASCII is 7 bits.
b) UTF-8
This is getting a bit pointless.

>But from the standpoint of storing / sending / retrieving (multiples of 
>8-bit) bytes, how is this ASCII's problem?
>
>IMHO this really jumps the shark (as if we hadn't already) from an 
>encoding scheme to a file format.
>
>> Â· Missing an informational equivalent to the pure 'zero' symbol of 
>> number systems. A specific "There is no information here" symbol. (The 
>> NUL symbol has other meanings.) This lack has very profound implications.
>
>You're going to need to work to convince me of that.

You're going to need to wait a few years, till you see the end product.
That bit of text I quoted is a very, very brief points list. Detailed discussion
of all this stuff is elsewhere, and I _can't_ post it now, since that would
seriously damage the project's practical potential. (Economic reasons.) 

>Mathematics has zero, 0, for a really long time.  (Yes, there was a time 
>before we had 0.)  But there is no numerical difference between 0 and 00 
>and 0000.  So, why do we need the latter two?

Column multiplier significance. That's a different thing from the nature of '0'
as a symbol.  At present there is no symbol meaning 'this is not information.'
Nevermind, it's difficult to grasp without a discussion of the implications for
very large mass storage device structure. And I'm not going there now.

>> Â· No facility to embed multiple data object types within text streams.
>
>How is this ASCII's problem?

It wasn't then, but the lack of it is our problem now.

>How do you represent other data object types if you aren't using ASCII? 
>Sure, there's raw binary, but that just means that you're using your own 
>encoding scheme which is even less of a common / well known standard 
>than ASCII.

UTF-8 is multi-byte binary, of a specific type. Just ONE type. No extensibility.

>We have all sorts of ways to encode other data objects in ASCII and then 
>include it in streams of bytes.

??? Are you deliberately being obtuse? The point is to attempt to formulate
a new standard that allows all this, in one well defined, extensible way that
permits all future potential cases. We do know how to do this now.

>Again, encoding verses file format.
>
>> Â· No facility to correlate coded text elements to associated visual 
>> typographical elements within digital images, AV files, and other 
>> representational constructs. This has crippled efforts to digitize the 
>> cultural heritage of humankind.
>
>Now I think you're lamenting the lack of computer friendly bytes 
>representing the text that is in the picture of a sign.  Functionally 
>what the ALT attribute of HTML's <IMG> tag is.

No. People who do scan captures of documents will understand that. They face the
choice: keep the document as page images (can't text search), or OCR'd text
(losing the page's visual soul.) But it should be possible to do BOTH, in
one file structure - if there was a defined way to link elements in the symbolic
text to words and characters in the images.
You'll say 'this is file format territory.' True at the moment, but only because
the basic coding scheme lacks any such capability.

>IMHO this is so far beyond a standard meant to make sure that people 
>represent A the same way on multiple computers.

You realise ASCII doesn't do that? 

>> Â· Non-configurable geometry of text flow, when representing the text 
>> in 2D planes. (Or 3D space for that matter.)
>
>What is a page ^W 2D plane?  ;-)

Something got lost there. "^W' ??
Surely you understand that point. English: left to right, secondary flow: downwards.
Many other cultural variants exist.

>I don't think oral text has the geometry of text flow or a page either.
>Again, IMHO, not ASCII's fault, or even it's wheelhouse.

Huh? This is pretty random.
It's a common response syndrome when someone discusses deviating from the common paradigm.
If I'm being silly enough to try discussing this in fragmentary form, I expect a lot of it.

>> Â· Many of the 32 'control codes' (characters 0x00 to 0x1F) were allocated 
>> to hardware-specific uses that have since become obsolete and fallen 
>> into disuse. Leaving those codes as a wasted resource.
>
>Fair point.
>
>I sometimes lament that they control codes aren't used more.
>
>> Â· ASCII defined only a 7-bit (128 codes) space, rather than the full 
>> 8-bit (256 codes) space available with byte sized architectures. This 
>> left the 'upper' 128 code page open to multiple chaotic, conflicting 
>> usage interpretations. For example the IBM PC code page symbol sets 
>> (multiple languages and graphics symbols, in pre-Unicode days) and the 
>> UTF-8 character bit-size extensions.
>
>I wonder what character sets looked like for other computers with 
>different word lengths.  How many more, or fewer, characters were encoded?

There are many old codings.

>Did it really make a difference?

Not after ASCII became a standard - unless you were using a language that needed more
or different characters. ie most of the world's population. 

>Would it make any real difference if words were 32-bits long?

Hah. In fact, the ability to represent unlimited-length numeric objects,
is one of the essentials of an adequate coding scheme. ASCII doesn't.
The whole 'x-bits long words' is one of the hangups of computing architectures too.
But that's another story.

>What if we moved to dictionary words represented by encoding schemes 
>instead of individual characters?

You're describing Chinese language programming. Though you didn't realise.
And yes... :)  A capable encoding scheme, and computing architecture built
on it, would allow such a thing.

>Or maybe we should move to encoding concepts instead of words.  That way 
>we might have some loose translation of the words for mother / father / 
>son / daughter between languages.  Maybe.  I'm sure there would still be 
>issues.  Gender and tense not withstanding.

Point? Not practical.
The coding scheme has to be compatible with the existing cultural schemes
and existing literature. (All of them.)

 [snip]

>> Â· Inability to create files which encapsulate the entirety of the visual 
>> appearance of the physical object or text which the file represents, 
>> without dependence on any external information. Even plain ASCII text 
>> files depend on the external definition of the character glyphs that the 
>> character codes represent. This can be a problem if files are intended 
>> to serve as long term historical records, potentially for geological 
>> timescales. This problem became much worse with the advent of the vast 
>> Unicode glyph set, and typset formats such as PDF.
>
>Now even more than ever, it sounds like you're talking about a file 
>format and not ASCII as a scheme meant to consistently encode characters.

Hmmm... well this is what happens when I post a short snippet from a larger text.
Short because I have to carefully read anything I cut-n-past post to be sure I didn't
include stuff I don't want to expose yet. Anyway, here's a bit more, that may
make things clearer.

----------------
Starting Over
What began as my general interest in the evolution of information encoding schemes, gained focus as more and more instances of early mistakes became apparent. Eventually it spawned a deliberate project to evaluate 'starting over.' What would be the result of trying?

Like this:

 *  Revisit the development history of computing science, identifying points at which, in hindsight, major conceptual shortcomings became cemented into foundations upon which today's practices rest.
 *  Evaluate how those conceptual pitfalls could have been avoided, given understandings arrived at later in computing science.
 *  Integrate all those improvements holistically, creating a virtual 'alternate timeline' of computing evolution, as if Computing Science had evolved with prescience of future conceptual advances and practical needs. Aiming to arrive at an information processing and computing architecture, that is what we'd already have now if we knew what we were doing from the start. 

The resulting computing environment's major components are the ****** coding scheme, the ***** operating system and hardware platform, the ***** scripting language, and the ***** file system. 
----------------

>> The PDF 'archival' format (in which all referenced fonts must be defined 
>> in the file) is a step in the right direction â€” except that format 
>> standard is still proprietary and not available for free.
>
>Don't get me started on PDF.  IMHO PDF is where information goes to die.

Hey, we totally agree on something! I *HATE* PDF, and the Adobe DRM-flyblown horse it rode in on.
When I scan tech documents, for lack of anything more acceptable I structure the
page images in html and wrap as a RAR-book.
Unfortunately few know of this method.

>Once data is in a PDF, the only reliable way to get the data back out to 
>be consumed by something else is through something like human eyes. 
>(Sure it may be possible to deconstruct the PDF, but it's fraught with 
>so many problems.)

There *was* at one point a freeware utility for deconstructing PDF files and analysing their structure.
I forget the name just now. It apparently was Borged by the forces of evil, and no longer can be found.
Anyone have a copy?

Photoshop is able to extract original images from PDFs, but it's a nightmare process.

>> ----------
>> 
>> Sorry to be a tease.
>
>Teas is not how I'd describe it.  I feel like it was more of a bait 
>(talking about shortcomings with ASCII's) and switch (talking about 
>shortcomings with file formats).

No, they are not intrinsically different things. It just seems that way from the viewpoint of convention
because ASCII lacks so many structural features that file (and stream) formats have to implement on their own.
(And so everyone does them differently.)

>That being said, I do think you made some extremely salient points about 
>file formats.

Ha, wait till (eventually - if ever) you see the real thing.
I'm having lots of fun with it. Result is like 'alien tech.'

>> Soon I'd like to have a discussion about the functional evolution of 
>> the various ASCII control codes, and how they are used (or disused) now. 
>> But am a bit too busy atm to give it adequate attention.
>
>I think that would be an interesting discussion.

Soon. Few weeks. Got to get some stuff out of the way first. I have way too many projects.

Guy