Text encoding Babel. Was Re: George Keremedjiev

Grant Taylor cctalk at gtaylor.tnetconsulting.net
Mon Nov 26 22:49:51 CST 2018


On 11/26/18 7:21 AM, Guy Dunphy wrote:
> I was speaking poetically. Perhaps "the mail software he uses was written 
> by morons" is clearer.

;-)

> Oh yes, tell me about the html 'there is no such thing 
> as hard formatting and you can't have any even when 
> you want it' concept. Thank you Tim Berners Lee.

I've not delved too deeply into the lack of hard formatting in HTML.

I've also always considered HTML to be what you want displayed, with 
minimal information about how you want it displayed.  IMHO CSS helps 
significantly with the latter part.

> http://everist.org/NobLog/20130904_Retarded_ideas_in_comp_sci.htm
> http://everist.org/NobLog/20140427_p-term_is_retarded.htm

Intriguing.  $readingList++.

> Except that 'non-breaking space' is mostly about inhibiting line wrap at 
> that word gap.

I wouldn't have thought "mostly" or "inhibiting line wrap".  I view the 
non-breaking space as a way to glue two parts of text together and treat 
them as one unit, particularly for display and partially for selection.

Granted, much of the breaking is done when the text can not continue (in 
it's natural direction), frequently needing to start anew on the next line.

> But anyway, there's little point trying to psychoanalyze the writers of 
> that software. Probably involved pointy-headed bosses.

I like to understand why things have been done the way they were. 
Hopefully I can learn from the reasons.

> Of course not. It was for American English only. This is one of the 
> major points of failure in the history of information processing.

Looking backwards, (I think) I can understand why you say that.  But 
based on my (possibly limited) understanding of the time, I think that 
ASCII was one of the primordial building blocks that was necessary.  It 
was a standard (one of many emerging standards of the time) that allowed 
computers from different manufacturers interoperate and represent 
characters with the same binary pattern.  Something that we now (mostly) 
take for granted and something that could not be assured at the time or 
before.

> Containing extended Unicode character sets via UTF-8, doesn't make it a 
> non-hard-formatted medium. In ASCII a space is a space, and multi-spaces 
> DON'T collapse. White space collapse is a feature of html, and whether 
> an email is html or not is determined by the sending utility.

Having read the rest of your email and now replying, I feel that we may 
be talking about two different things.  One being ASCII's standard 
definition of how to represent different letters / glyphs in a 
consistent binary pattern.  The other being how information is stored in 
an (un)structured sequence of ASCII characters.

> As you see, this IS NOT HTML, since those extra spaces and your diagram 
> below would have collapsed if it was html. Also saving it as text and 
> opening in a plain text ed or hex editor absolutely reveals what it is.

I feel it is important to acknowledge your point and to state that I'm 
moving on.

> Hmm... the problem is it's intended to be serious, but is still far from 
> exposure-ready.  So if I talk about it now, I risk having specific terms 
> I've coined in the doco (including the project name) getting meme-jammed 
> or trademarked by others. The plan is to release it all in one go, 
> eventually. Definitely will be years before that happens, if ever.

Fair enough.

> However, here's a cut-n-paste (in plain text) of a section of the 
> Introduction (html with diags.)

ACK

> ----------
> 
> Almost always, a first attempt at some unfamiliar, complex task produces 
> a less than optimal result. Only with the knowledge gained from actually 
> doing a new thing, can one look back and see the mistakes made. It usually 
> takes at least one more cycle of doing it over from scratch to produce 
> something that is optimal for the needs of the situation. Sometimes, 
> especially where deep and subtle conceptual innovations are involved, 
> it takes many iterations.

Part way through the first large (for me at the time) project that I 
worked on, I decided that the project (and likely others) needed three 
versions before being production ready:

1)  First whack at solving the problem.  LOTS about the problem is 
learned, including the true requirements and the unknown dependencies 
along the say.  This will not be the final shipping version.  -  Think 
of this as the Alpha release.
2)  This is a complete re-write of the project based on what was learned 
in #1.  -  Think of this as the Beta release.
3)  This is less of a re-write and more of a bug fix for version 2.  - 
Think of this as the shipping release.

> Human development of computing science (including information coding 
> schemes) has been effectively a 'first time effort', since we kept on 
> developing new stuff built on top of earlier work. We almost never went 
> back to the roots and rebuilt everything, applying insights gained from 
> the many mistakes made.

With few notable (partial) exceptions, I largely agree.

> In reviewing the evolution of information coding schemes since very 
> early stages such as the Morse code, telegraph signal recording, 
> typewriters, etc, through early computing systems, mass data storage and 
> file systems, computer languages from Assembler through Compilers and 
> Interpreters, and so on, several points can be identified at which early 
> (inadequate) concepts became embedded then used as foundations for further 
> developments. This made the original concepts seem like fundamentals, 
> difficult to question (because they are underlying principles for much 
> more complex later work), and virtually impossible to alter (due to the 
> vast amounts of code dependent on them.)

Agreed.

> And yet, when viewed in hindsight many of the early concepts are seriously 
> flawed. They effectively hobbled all later work dependent on them.

Okay.

> Examples of these pivotal conceptual errors:
> 
> Defects in the ASCII code table. This was a great improvement at the 
> time, but fails to implement several utterly essential concepts. The lack 
> of these concepts in the character coding scheme underlying virtually 
> all information processing since the 1960s, was unfortunate. Just one 
> (of many) bad consequences has been the proliferation of 'patch-up' 
> text coding schemes such as proprietry document formats (MS Word for eg), 
> postscript, pdf, html (and its even more nutty academia-gone-mad variants 
> like XML), UTF-8, unicode and so on.

Hum....

> This is a scan from the 'Recommended USA Standard Code for Information 
> Interchange (USASCII) X3.4 - 1967' The Hex A-F on rows 10-15, added 
> here. Hexadecimal notation was not commonly in use in the 1960s.  Fig. ___ 
> The original ASCII definition table.
> 
> ASCII's limitations were so severe that even the text (ie ASCII) program 
> code source files used by programmers to develop literally everything 
> else in computing science, had major shortcomings and inconveniences.

I don't think I'm willing to accept that at face value.

> A few specific examples of ASCII's flaws:
> 
> · Missing concept of control vs data channel separation. And so we 
> needed the "< >" syntax of html, etc.

I don't buy that, at all.

ASCII has control codes to that I think could be (but isn't) used for 
some of this.  Start of Text (STX) & End of Text (ETX), or Shift Out 
(SO) & Shift In (SI), or Device Control 1 - 4 (DC1 - DC4), or File / 
Group / Record / Unit Separators (FS / GS / RS / US) all come to mind.

Either you're going to need two parallel byte streams, one for data and 
another for control (I'm ignoring timing between them), -or- you're 
going to need a way to indicate the need to switch between byte 
(sub)streams in the overall byte (super)streams.  Much of what I've seen 
is the latter.

It just happens that different languages have decided to use different 
(sequences of) characters / bytes to do this.  HTML (possibly all XML) 
use "<" and ">".  ASP uses "<%" and "%>".  PHP uses "<?(php)" and ">?". 
Apache HTTPD SSI uses "<!--#" and "-->".  I can't readily think of 
others, but I know there are a plethora.  These are all signals to 
indicate the switch between data and control stream.

> · Inability to embed meta-data about the text in standard programatically 
> accessible form.

I'll agree that there's no distinction of data, meta, or otherwise, in a 
string of ASCII bytes.  But I don't expect there to be.

Is there any distinction in the Roman alphabet (or any other alphabet in 
this thread) to differentiate the sequence of bytes that makes up the 
quote verses the metadata that is the name of the person that said the 
quote?  Or what about the date that it was originally said?

This is about the time that I really started to feel that you were 
talking about a file format (for lack of a better description) than how 
the bytes were actually encoded, ASCII or EBCDIC or otherwise.

> · Absense of anything related to text adornments, ie italics, underline 
> and bold. The most basic essentials of expressive text, completely 
> ignored.

Again, alphabets don't have italics or underline or bold or other.  They 
have to depend on people reading them, and inferring the metadata, and 
using tonal inflection to convey that metadata.

> · Absense of any provision for creative typography. No awareness of 
> fonts, type sizes, kerning, etc.

I don't believe that's anywhere close to ASCII's responsibility.

> · Lack of logical 'new line', 'new paragraph' and 'new page' codes.

I personally have issues with the concept of what a line is, or when to 
start a new one.  (Aside: I'm a HUGE fan of format=flowed text.)

We do have conventions for indicating a new paragraph, specifically two 
new lines.

Is there an opportunity to streamline that?  Probably.

I also have unresolved issues of what a page is.  (Think reactive web 
pages that gracefully adjust themselves as you dynamically resize the 
window.)

There is also Form Feed (FF), which is used in printers to advance to 
the next page.  (Where the page is defined as the physical size of 
paper, independently of the amount of text that will / won't fit on a 
given page.)

> · Inadequate support of basic formatting elements such as tabular 
> columns, text blocks, etc.

ASCII has a very well defined tab character.  Both for horizontal and 
vertical.  (Though I can't remember ever seeing vertical tab being used.)

I think there is some use for File / Group / Record / Unit Separators 
(FS / GS / RS / US) for some of these uses, particularly for columns and 
text blocks.

> · Even the extremely fundamental and essential concept of 'tab 
> columns' is impropperly implemented in ASCII, hence almost completely 
> dysfunctional.

Why do you say it's improperly implemented?

It sounds as if you are commenting about what programs do when 
confronting a tab, not the actual binary pattern that represents the tab 
character.

What would you like to see done differently?

> · No concept of general extensible-typed functional blocks within text, 
> with the necessary opening and closing delimiters.

Now I think you're asking too much of a character encoding scheme.

I do think that you can ask that of file formats.

> · Missing symmetry of quote characters. (A consequence of the absense 
> of typed functional blocks.)

I think that ASCII accurately represents what the general American 
populous was taught in elementary school.  Specifically that there is 
functionally a single quote and a double quote.  Sure, there are opening 
and closing quotes, both single and double, but that is effectively 
styling and doesn't change the semantic meaning of the text.

> · No provision for code commenting. Hence the gaggle of comment 
> delimiting styles in every coding language since. (Another consequence 
> of the absense of typed functional blocks.)

How is that the responsibility of the standard used to encode characters 
in a binary pattern?

That REALLY sounds like it's the responsibility of the thing that uses 
the underlying standard characters.

> · No awareness of programatic operations such as Inclusion, Variable 
> substitution, Macros, Indirection, Introspection, Linking, Selection, etc.

I see zero way that is the binary encoding format's responsibility.

I see every way that is the responsibility of the higher layer that is 
using the underlying binary encoding.

> · No facility for embedding of multi-byte character and binary code 
> sequences.

I can see how ASCII doesn't (can't?) encode multi-byte characters.  Some 
can argue that ASCII can't even encode a full 8 bit byte character.

But from the standpoint of storing / sending / retrieving (multiples of 
8-bit) bytes, how is this ASCII's problem?

IMHO this really jumps the shark (as if we hadn't already) from an 
encoding scheme to a file format.

> · Missing an informational equivalent to the pure 'zero' symbol of 
> number systems. A specific "There is no information here" symbol. (The 
> NUL symbol has other meanings.) This lack has very profound implications.

You're going to need to work to convince me of that.

Mathematics has zero, 0, for a really long time.  (Yes, there was a time 
before we had 0.)  But there is no numerical difference between 0 and 00 
and 0000.  So, why do we need the latter two?

How many grains of sand does it take to make a pile?

My opinion: None.  You simply need to define that it is a pile.  Then it 
become a question of "how many grains of sand are in the pile".  Zero 
(0) is a perfectly acceptable number.  I also feel like a anything but a 
number is not an answer to the question of how many.

> · No facility to embed multiple data object types within text streams.

How is this ASCII's problem?

How do you represent other data object types if you aren't using ASCII? 
Sure, there's raw binary, but that just means that you're using your own 
encoding scheme which is even less of a common / well known standard 
than ASCII.

We have all sorts of ways to encode other data objects in ASCII and then 
include it in streams of bytes.

Again, encoding verses file format.

> · No facility to correlate coded text elements to associated visual 
> typographical elements within digital images, AV files, and other 
> representational constructs. This has crippled efforts to digitize the 
> cultural heritage of humankind.

Now I think you're lamenting the lack of computer friendly bytes 
representing the text that is in the picture of a sign.  Functionally 
what the ALT attribute of HTML's <IMG> tag is.

IMHO this is so far beyond a standard meant to make sure that people 
represent A the same way on multiple computers.

> · Non-configurable geometry of text flow, when representing the text 
> in 2D planes. (Or 3D space for that matter.)

What is a page ^W 2D plane?  ;-)

I don't think oral text has the geometry of text flow or a page either.

Again, IMHO, not ASCII's fault, or even it's wheelhouse.

> · Many of the 32 'control codes' (characters 0x00 to 0x1F) were allocated 
> to hardware-specific uses that have since become obsolete and fallen 
> into disuse. Leaving those codes as a wasted resource.

Fair point.

I sometimes lament that they control codes aren't used more.

> · ASCII defined only a 7-bit (128 codes) space, rather than the full 
> 8-bit (256 codes) space available with byte sized architectures. This 
> left the 'upper' 128 code page open to multiple chaotic, conflicting 
> usage interpretations. For example the IBM PC code page symbol sets 
> (multiple languages and graphics symbols, in pre-Unicode days) and the 
> UTF-8 character bit-size extensions.

I wonder what character sets looked like for other computers with 
different word lengths.  How many more, or fewer, characters were encoded?

Did it really make a difference?

Would it make any real difference if words were 32-bits long?

What if we moved to dictionary words represented by encoding schemes 
instead of individual characters?

Or maybe we should move to encoding concepts instead of words.  That way 
we might have some loose translation of the words for mother / father / 
son / daughter between languages.  Maybe.  I'm sure there would still be 
issues.  Gender and tense not withstanding.

Then there's the issues of possession and tense.

I feel like all of these are beyond the purpose and intent of ASCII, a 
way to consistently encode characters in the hopes that they might be 
able to be used across disparate computer systems.

> · Inability to create files which encapsulate the entirety of the visual 
> appearance of the physical object or text which the file represents, 
> without dependence on any external information. Even plain ASCII text 
> files depend on the external definition of the character glyphs that the 
> character codes represent. This can be a problem if files are intended 
> to serve as long term historical records, potentially for geological 
> timescales. This problem became much worse with the advent of the vast 
> Unicode glyph set, and typset formats such as PDF.

Now even more than ever, it sounds like you're talking about a file 
format and not ASCII as a scheme meant to consistently encode characters.

> The PDF 'archival' format (in which all referenced fonts must be defined 
> in the file) is a step in the right direction — except that format 
> standard is still proprietary and not available for free.

Don't get me started on PDF.  IMHO PDF is where information goes to die.

Once data is in a PDF, the only reliable way to get the data back out to 
be consumed by something else is through something like human eyes. 
(Sure it may be possible to deconstruct the PDF, but it's fraught with 
so many problems.)

> ----------
> 
> Sorry to be a tease.

Teas is not how I'd describe it.  I feel like it was more of a bait 
(talking about shortcomings with ASCII's) and switch (talking about 
shortcomings with file formats).

That being said, I do think you made some extremely salient points about 
file formats.

> Soon I'd like to have a discussion about the functional evolution of 
> the various ASCII control codes, and how they are used (or disused) now. 
> But am a bit too busy atm to give it adequate attention.

I think that would be an interesting discussion.



-- 
Grant. . . .
unix || die


More information about the cctalk mailing list