Text encoding Babel. Was Re: George Keremedjiev

Tue Nov 27 13:47:05 CST 2018

On 11/27/2018 03:05 AM, Guy Dunphy wrote:
> It was a core of the underlying philosophy, that html would NOT allow any 
> kind of fixed formatting. The reasoning was that it could be displayed 
> on any kind of system, so had to be free-format and quite abstract.

That's one of the reasons that I like HTML as much as I do.

> Which is great, until you actually want to represent a real printed page, 
> or book. Like Postscript can. Thus html was doomed to be inadequate for 
> capture of printed works.

I feel like trying to accurately represent fixed page layout in HTML is 
a questionable idea.  I would think that it would be better to use a 
different type of file.

> That was a disaster. There wasn't any real reason it could not be 
> both. Just an academic's insistense on enforcing his ideology.  Then of 
> course, over time html has morphed to include SOME forms of absolute 
> layout, because there was a real demand for that. But the result is 
> a hodge-podge.

I don't think that HTML can reproduce fixed page layout like PostScript 
and PDF can.  It can make a close approximation.  But I don't think HTML 
can get there.  Nor do I think it should.

> Yes, it should be capable of that. But not enforce 'only that way'.

I question if people are choosing to use HTML to store documentation 
because it's so popular and then getting upset when they want to do 
things that HTML is not meant to do.  Or in some cases is actually meant 
to /not/ to.

Use the tool for the job.  Don't alter the wrong tool for your 
particular job.

IMHO true page layout doesn't belong in HTML.  Loosely laying out the 
same content in approximately the same layout is okay.

> By 'html' I mean the kludge of html-css-js. The three-cat herd. (Ignoring 
> all the _other_ web cats.)  Now it's way too late to fix it properly 
> with patches.

I don't agree with that.  HTML (and XML) has markup that can be used, 
and changed, to define how the HTML is meant to be interpreted.

The fact that people don't do so correctly is mostly independent of the 
fact that it has the ability.  I say mostly because there is some small 
amount of wiggle room for discussion of does the functionality actually 
work or not.

> I meant there's no point trying to determine why they were so deluded, 
> and failed to recognise that maybe some users (Ed) would want to just 
> type two spaces.

I /do/ believe that there /is/ a point in trying to understand why 
someone did what they did.

> now 'we' (the world) are stuck with it for legacy compatibility reasons.

Our need to be able to read it does not translate to our need to 
continue to use it.

> Any extensions have to be retro-compatible.

I disagree.

I see zero reason why we couldn't come up with something new and 
completely different.

Granted, there should be ways to translate from one to the other.  Much 
like how ASCII and EBCDIC are still in use today.

> What I'm talking about is not that. It's about how to create a coding 
> scheme that serves ALL the needs we are now aware of. (Just one of 
> which is for old ASCII files to still make sense.) This involves both 
> re-definition of some of the ASCII control codes, AND defining sequential 
> structure standards.  For eg UTF-8 is a sequential structure. So are 
> all the html and css codings, all programming languages, etc. There's a 
> continuum of encoding...structure...syntax.  The ASCII standard didn't 
> really consider that continuum.

I don't think that ASCII was even trying to answer / solve the problems 
that you're talking about.

ASCII was a solution for a different problem for a different time.

There is no reason we can't move on to something else.

> Which exceptions would those be? (That weren't built on top of ASCII!)

It is subject to the meaning of "back tot he roots" and not worth taking 
more time.

> I assume you're thinking that ASCII serves just fine for program source 
> code?

I'm not personally aware of any cases where ASCII limits programming 
languages.  But my ignorance does not preclude that situation from existing.

I do believe that there are a number of niche programming languages (if 
you will) that store things as binary data (I'm thinking PLCs and the 
likes) but occasionally have said data represented (as a hexadecimal 
dump) in ASCII.  But the fact that ASCII can or can't easily display the 
data is immaterial to the system being programmed.

I have long wondered if there are computer languages that aren't rooted 
in English / ASCII.  I feel like it's rather pompous to assume that all 
programming languages are rooted in English / ASCII.  I would hope that 
there are programming languages that are more specific to the region of 
the world they were developed in.  As such, I would expect that they 
would be stored in something other than ASCII.

Could the sequence of bytes be displayed as ASCII?  Sure.  Would it make 
much sense?  Not likely.

> This is a bandwagon/normalcy bias effect. "Everyone does it that way 
> and always has, so it must be good."

Nope, not for me.

It may be the case for some people.  But I actively try to avoid such 
biases.  Or if I do use them, I acknowledge that they are biases so that 
others can overcome them.

> Sigh. Well, I can't go into that without revealing more than I wish 
> to atm.

Fair.

I will say that I don't think there's any reason why English based 
programming languages can't be encoded in Morse code, either American or 
International.  Sure, it would be a variable bit length word, but it 
would work.  Nothing mandates that ASCII is used.  ASCII is used by 
convention.  But nothing states that that convention can't be changed. 
Hence why some embedded applications use something else that's more 
convenient for them.

> You're making my point for me. Of course there are many ways to interpret 
> existing codes to achieve this effect. Some use control codes, others 
> overload functionality on printable characters. eg html with < and >.

I disagree.

> My point is the base coding scheme doesn't allocate a SPECIFIC mechanism 
> for doing this.

I think there are ASCII codes that could, or should, have been used for 
that.

> The result is a briar-patch of competing ad-hoc methods.  Hence the 
> 'babel' I'm referring to, in every matter where ASCII didn't define 
> needed functionality.

I don't believe that the fact that people didn't use it for one reason 
or another is not ASCII's fault.

> Exactly. Because ASCII does not provide a specific coding. It didn't 
> occur to those drtafting the standard. Same as with all the other...

I believe that ASCII did provide control codes that could have been used.

I also question how much of the fact that the control codes weren't used 
was / is related to the fact that most people don't have keys on their 
keyboard for them.  Thus many people chose to use different keys ~> 
bytes to perform the needed function.

> And so every different devel project that needed it, added some kludge 
> on top.  This is what I'm saying: ASCII has no facility for this, but 
> we need a basic coding scheme that does (and is still ASCII-compatible.)

How would you encode the same string of characters, "Lincoln", used in 
both the name of the person speaking, Abraham Lincoln, and the phrase 
they said, "I drive a Lincoln Town Car".  Would you have completely 
different ways of encoding "Lincoln" in each of the contexts?  Or would 
you have a way to indicate which context is applied to the sequence of 
seven characters / words of memory?

If it's the latter, you must have some way to switch between the two 
contexts.  This could be one of the ASCII control codes, or it could be 
an overload of one (or sequence of) characters.

I believe that ASCII is a standard (one of many) that defines how to 
represent characters / control codes in a binary pattern.  I also 
believe that any semantics or meanings beyond the context of a single 
character / control code is outside of the scope of ASCII or comparable 
standards.

I believe such semantic meaning sits on top of the underlying character 
set.  Hence file format.

> Doesn't matter. The English alphabet (or any other human language) 
> naturally do not have protocols to concisely represent data types.

So if I apply (what I understand to be) your logic, I can argue that the 
English language is (similarly) flawed.

> That's no reason to not build such things into the character coding 
> scheme used in computational machinery.

I agree there is need for such data.  I do not agree that it belongs in 
the character encoding.  I believe that it belongs in file formats that 
sit on top of character encoding.

Does a file format need additional characters that are outside of the 
typical language so that the file format can contain typical characters 
without overloading?  Sure.  That's where the control codes come in.

> The project consists of several parts. One is to define an extension of 
> ASCII (with a different name, that I'm not going to mention for fear 
> of pre-emptive copyright bullshit.) Other parts relate to other areas 
> in comp-sci, in the same manner of 'see what happens if one starts 
> from scratch.'

Why do you need to define them as an extension of ASCII?  Rather why not 
define it completely anew and sluff off the old.  I don't see any reason 
why something brand new can't be completely it's own.  The only 
requirement I see is a way to convert between the old and the new.

> So, you're saying a text encoding scheme should not have any way to 
> represent such things? Why not?

I don't believe that the letter "A", be it bold and / or italic and / or 
underline is still the letter "A".  The formatting and display of it 
does not change the semantic meaning of the letter "A".  As such, I 
don't think that the different forms of the letter "A" should be encoded 
as different letters.

I do think that file formats should have a way to encode the different 
formats of the same semantic letter, "A".  If a control code is needed 
to indicate a formatting change, so be it.  ASCII has some codes that 
can be used for this.  Or other character sets can have more codes to do 
the same thing.

> The ASCII printable character set does not have adornments, BECAUSE it 
> is purely a representation of the alphabet and other symbols. That's 
> one of its failings, since all 'extras' have to be implemented by ad-hoc 
> improvisations.

I think the fact that all four forms of "A" are same ASCII byte is a 
good thing.

It's both good and bad that programmers are free to implement their 
ideas how they see fit.  Requiring all programmers to use the same thing 
to represent Italic underlined text would be limiting.

> I'm pretty sure you've missed the whole point. The ASCII definition 
> 'avoided responsibility' thus making itself inadequate. Html, postscript, 
> and other typographic conventions layer that stuff on top, messily and 
> often in proprietary ways.

We can agree to disagree.

> Then you never tried to represent a series of printed pages in html. 
> Can be sort-of done but is a pain.

I would not choose to accurately represent printed pages in HTML. 
That's not what HTML is meant for.

I would choose a page layout language to represent printed pages.

> ASCII doesn't understand 'lines' either. It understands physical head 
> printers.  Hence 'carriage return' and 'line feed'. Resulting in the 
> CR/CR-LF/LF wars for text files where a 'new line' was needed.

I don't consider that to be a war.  I consider it to be three different 
file formats (each with it's own way of encoding a new line).  And LOTS 
of ignorance about the fact.

It is trivial to convert between the formats.  Trivial enough that many 
editors automatically detect the format that's being used and behave 
appropriately for the detected format.

> Even in format-flowed text there is a typographic need for 'new line'. 
> It means 'no matter where the current line ends, drop down one line and 
> start at the left.'
> Like I'm typing here.

I'll agree that there is a new line in the underlying text that makes up 
the format=flowed line.

But I believe that format=flowed text is a collection of lines 
(paragraphs) that are stored using format=flowed encoding.  Each line 
(paragraph) is independent of the others.  As such, the "new line" that 
you speak of is outside the scope of format=flowed text.  Or rather, the 
"new line" that you speak of means the end of one format=flowed line 
(paragraph) and the start of another (assuming it also uses format=flowed).

> A paragraph otoh is like that, but with extra vertical space separating 
> from above.  Because ASCII does not have these _absolutely_fundamental_ 
> codes, is why html has to have <br> and <p>.

I suspect that even if ASCII did have a specific purpose code that 
either people wouldn't use it and / or HTML would also ignore it with 
it's white space compaction philosophy.

> Not to get into the whole </p> argument.

I'm not going there.  I don't think it's germane to this discussion.

> Note that including facility for real newline and paragraph symbols in the 
> basic coding scheme, doesn't _force_ the text to be hard formatted. That's 
> a display mode option.

Much like HTML's philosophy to compact white space?

> Sigh. Like two spaces in succession being interpretted to do something 
> special?

I'm not aware of any special meaning for two spaces in succession.  I am 
aware of shenanigans that different applications do to preserve the 
multiple spaces.  Ergo this conversation.

I'm also aware that the two spaces after punctuation is a relatively 
modern thing, introduced by (I believe) typewriters as a mechanism to 
avoid a mechanical problem.  A convention that persisted into computers. 
  A convention that some modern devices actively thwart.  I.e. the iPad 
/ iPhone turning two spaces into a period space ready for new sentences.

> You know in type layout there are typically special things that happen 
> for paragraphs but not for newlines?

Nope.  I've done very little with typography / layout.

> You don't see any problem with overloading a pair of codes of one type, 
> to mean something else?

It depends on what the file format is.  If the file format expects 
everything to be a discrete (8-bit) word / byte, then yes, there can be 
complications ~> problems with requiring use of two.  If the file format 
does not make such expectations, and instead relies on chords of (8-bit) 
words / bytes, then I don't see any overt problem.  The biggest issue 
will be ensuring that chords are interpreted properly.

> Factors to consider:
> 
> - Ergonomics of typing. It _should_ be possible to directly type 
> reasonably typographically formatted text, with minimal keystrokes. One 
> can type html, but it's far from optimal.  There are many other 
> conventions. None arising from ASCII, because it lacks _everything_ 
> necessary.

I don't believe that the way that text is typed must be directly related 
to how it's stored and / or presented.

I'm curious what you've had difficulty with over the years with related 
to new lines, paragraphs, page breaks, formatting, etc.

> - Efficiency of the file/stream encoding. Allowing for infinitely 
> extensible character sets, embedded specifications of glyph appearances 
> (fonts), layout, and dynamic elements.

Yes.  This is part of a file format and not part of the character 
encoding.  File have been storing binary data that is completely 
independent of the encoding for years.  —  Granted, transferring said 
files between systems can run into complications.

> - Efficiaency and complexity of code to deal with constructing, processing 
> and displaying texts.

I've LONG been a fan of paying it forward to make things easier to use 
in the long run.  I'd much rather have more complex code to make my day 
to day life easier and more convenient.

> Sure. Now you think of trying to construct a digital representation of 
> a printed work with historical significance. So it NUST NOT dynamically 
> reformat.  Otoh it might be a total simulation of a physical object/book, 
> page turn physics and all.

I would never try to digitally reproduce such a work in HTML.  I would 
copy contents into HTML so that it is easily accessible via HTML.  But I 
would never expect that representation to accurately reproduce the 
original work.

> Ha ha... consider how does the Tab function work in typewriters? What 
> does pressing a Tab key actually do?

Based on memory, the tab key advances to the next tab stop that the type 
writer / operator has configured.

Note:  The tab key itself has no knowledge of where said tab stop is 
located.

> ASCII has a Tab code, yes. It does NOT have other things required for 
> actual use of tabular columns.

The typewriter's tab key "does NOT have other things required for 
actual use of tabular columns" either.  Other parts of the typewriter do.

Similarly, the text editor has "other things required for  actual use of 
tabular columns".

> So, the Tab functionality is completely broken in ASCII.  That was 
> actually a really bad error on their part. They didn't need foresight, 
> they just goofed.

I disagree.  See above for why.

> Typewriters had working Tabs function since 1897.

I've been able to actual use of tabular columns for years, if not 
decades, on computers without any problem.

> Specifically, ASCII does not provide any explicit means to set and 
> clear an array of tabular positions (whether absolute or proportional.)

I disagree.

Again, ASCII is a way to encode characters and is independent of how 
those characters are used.

I could easily use Device Control 1 to tell a device that I'm sending it 
information that it should use to ""program / configure a tab stop distance.

Quite similar to how the tab key on the typewriter does not tell the 
typewriter how far to advance.  Instead I have to use other keys / 
buttons / knobs on the typewriter to define where the tab stop should 
be.  The tab simply advances to the next tab stop.

> Hence html has to implement tables, grid systems, etc. But it SHOULD be 
> possible to type columnar text (with tabs) exactly and as ergonomically 
> as one would on a typwriter.

First, HTML's white space compaction will disagree.  (For better or worse).

Second, tabs are 8 characters by convention.  A convention that can very 
easily be redefined.

As such, it's somewhere between impractical and impossible to rely on 
the following content to appear the same on a computer or typewriter 
without specifying what the tab stop is.  A tab stop of 32 will align 
things.  A tab stop of 8 will not.

bob<tab>ed
abcdefghijklmnopqrstuvwxyz<tab>0123456789

IMHO this is a flaw with the concept of tab and not the aforementioned 
implementations.

> Why would I be talking of the binary code of the tab character?

Your comment was about what is done when the character is encountered 
when the overarching discussion is about ASCII, which is a standard for 
how to encode characters, not what is done when a character is encountered.

> Sigh. You'll have to wait.

Fair enough.

> ASCII is not solely a 'character encoding scheme', since it also has the 
> control codes.  But those implement far less functionality than we need.

Sorry, when I said "character encoding scheme", I was meaning character 
s and control codes.  Thus asking too much of the {character,control 
code} encoding scheme.

> Now tell me why you think the fundamental coding standard, should not be 
> the same as used in file formats. You're used to those being different 
> things (since ASCII is missing so much), but it doesn't have to be so.

I think of ASCII as being a way to encode characters / control codes. 
Conversely I think of file formats as being a way to define how the 
encoded characters / control codes are used in concert with each other.

The file format builds on top of the character / control code encoding.

> There you go again, assuming 'styling' has no place in the base coding 
> scheme.

Correct.  I believe that styling is part of a file format, not the 
underlying character / control code encoding.

> You keep assuming that a basic coding scheme should contain nothing but 
> the common printable characters. Despite ASCII already containing more 
> than that.

No, I do not.  I can understand why you may think that.  Please allow me 
to clarify.

I believe that a basic coding scheme contains printable characters and 
control codes.

Sorry for omitting the "control codes", which are part of the defined 
ASCII standard.

> Also tell me why there should not be a printable character specifically 
> meaning "Start of comment" (and variants, line or block comments, 
> terminators, etc.)

I don't perceive a need for a control code that means "start of 
comment".  Much less the aforementioned variants.

I do perceive a need for a file format that uses characters and / or 
control codes to represent that.

> You are just used to doing it a traditional way, and not wondering if 
> there might be better ways.

Nope.  That's a false statement.

I have long pontificated a file format that made it easy to structure 
text (what I typically deal with) such that it's easy to reference 
(include) part of it in other files.  I've stared at DocBook and a few 
other file formats that do include meta-data about the text such that 
things can be referenced.  All of the file formats that I looked at 
re-used ASCII.  But nothing stopped them from using their own binary 
encoding.  Much the way that I believe that Microsoft Word documents do.

Suffice it to say that I'm not just using the traditional methods.  I'm 
actively looking for and evaluating alternatives to see if they will 
work better for me than what I'm currently doing.

> You think that, because all your life you've been typing /* comment */ 
> or whatever.

No I have not.  I've been using different comment characters / chords of 
characters for years.

> In truth, the ASCII committee just forgot.

I disagree.

> Oh well.

I am willing to entertain discussions of the need of additional control 
characters.  But I expect such discussions to include why the file 
format can't re-use a different control character and why it's necessary 
to define another one.  (Think Word document's binary format.)

> You're going to need to wait a few years, till you see the end product.

Okay.

> That bit of text I quoted is a very, very brief points list. Detailed 
> discussion of all this stuff is elsewhere, and I _can't_ post it 
> now, since that would seriously damage the project's practical 
> potential. (Economic reasons.)

Fair enough.

> Column multiplier significance. That's a different thing from the nature 
> of '0' as a symbol.  At present there is no symbol meaning 'this is not 
> information.'

Why would there be a "this is not information" in a field that is 
specifically meant to hold information?

I can see a use for "this information is unavailable".  But null is 
typically used for that (and other things).

> Nevermind, it's difficult to grasp without a discussion of the 
> implications for very large mass storage device structure. And I'm not 
> going there now.

Okay.

That sounds like it borderlines on file systems and / or database 
structures.  Which I consider as being a higher layer than file format, 
used to differentiate different files / records.

> It wasn't then, but the lack of it is our problem now.

I disagree.

I don't think that this is a character / control code encoding problem.

I think this is a file format problem.

> UTF-8 is multi-byte binary, of a specific type. Just ONE type. No 
> extensibility.

I find the lack of extensibility questionable.  Granted I don't know 
much about UTF<anything>.  But I do think that I routinely see new 
characters / glyphs added.  So either it's got some modicum of 
extensibility, or people are simply defining previously undefined / 
unused code points.

> ??? Are you deliberately being obtuse?

No.

I'm saying that we have multiple ways to encode binary data (pictures, 
sound, programs, you name it) such that it can safely be represented in 
printable ASCII characters:

  · Base 16
  · Base 32
  · Base 64
  · UUEncode
  · Quoted-Printable

I'm sure there are more, but that's just what comes to mind at the moment.

MIME structures allow us to include multiple different types of content 
in the same printable ASCII file.

I've worked with Lotus Notes files (I don't remember the extension) that 
easily stored complex data structures.  Things like a document with 
embedded video, sound, programs, pictures, links to other files, other 
files themselves, could all easily be put into a single file.

> The point is to attempt to formulate a new standard that allows all this, 
> in one well defined, extensible way that permits all future potential 
> cases. We do know how to do this now.

I feel like the Lotus Notes file was extremely extensible.

But that's a file format, not an character / control code encoding scheme.

> No. People who do scan captures of documents will understand that. They 
> face the choice: keep the document as page images (can't text search), 
> or OCR'd text (losing the page's visual soul.)

My understanding is that there are multiple file types (formats) that 
can already do that.

I believe it's possible to include the document as page image -and- 
include OCR'd text so that it is searchable.

I feel confident that an Epub can include an HTML page that includes the 
image(s) and ATL value on IMG tags.

I bet there are a number of other options too.

> But it should be possible to do BOTH,

Agreed.  I think that it is possible to do both.

> in one file structure

That sounds dangerously close to a file format.  Or at LOT closer to a 
file format than a character / control code encoding scheme.

> if there was a defined way to link elements in the symbolic text to 
> words and characters in the images.

I believe there likely is, or can be.

I wonder if an image map would work in an Epub or Microsoft's HTML 
Archive files.

> You'll say 'this is file format territory.'

Yep.

> True…

:-)

> …at the moment,

What will change that will prevent that the same file formats that exist 
today won't exist or won't be able to continue to do this in the future?

Or why will what works today stop working in the future?

> but only because the basic coding scheme lacks any such capability.

Even if the new encoding scheme that you're working (which you can't 
talk about) on does include these capabilities, that does not preclude 
the current file formats from continuing to work in the future.

> You realise ASCII doesn't do that?

Sorry, I was talking within the context of ASCII.

I believe that any computer that uses ASCII (and doesn't do a 
translation internally) does represent a capital A as binary 01000001.

If that is not the case, please enlighten me.

> Something got lost there. "^W' ??

Sorry, ^W, is how unix geeks represent control w, which is a readline 
key sequence to erase the last word.

So I was effectively asking "What is a 2D plane?".

> Surely you understand that point. English: left to right, secondary 
> flow: downwards.  Many other cultural variants exist.

Yes, I understand that English is primarily left to right, and 
secondarily top to bottom.  However, that is within a defined 2D plane. 
(Or page as I was calling it.)

My real point was to ask what defines a 2D plane (page)?  Is it it's 
size?  Is it how much text can fit on it?  What point size is the text 
there on?

A 2D plane (page) is rather nebulous without context to define it.

> Huh? This is pretty random.

I was making a comparison to a defined 2D plane (page) that can hold a 
finite amount of information / text at a given point size.

I was then wondering if there was similar definition of a unit for oral 
speak.

> Not after ASCII became a standard - unless you were using a language that 
> needed more or different characters. ie most of the world's population.

EBCDIC is still quite popular, even here in the US.  Well, at least in 
IBM shops.  I hear that it's also popular in other mainframe shops 
around the world that want to interoperate with IBM mainframes.

Unicode / UTF-* are also gaining traction.

Thus I think the other encoding methods are making a difference.  ;-)

> Hah. In fact, the ability to represent unlimited-length numeric objects, 
> is one of the essentials of an adequate coding scheme. ASCII doesn't.

I disagree.  ASCII does just as well as numbers taught to kindergartners 
in the English speaking world.  Where the numbers are a collection of 
individual characters, 0 - 9.

Granted, that's not the same thing as a single word of memory holding a 
64-bit number.  But, humans don't have tens / hundreds / thousands of 
different numbers representing different values that are their own 
discrete character.   Instead, humans use different sets of digits to 
represent different values in different places.

> The whole 'x-bits long words' is one of the hangups of computing 
> architectures too.

Sure.  I think doing something like humans do might be more scalable. 
But then we could get by with probably 4 or 5 bit representations of 
numbers.  Binary Coded Decimal comes to mind.  }:-)

> But that's another story.

Agree.

> You're describing Chinese language programming. Though you didn't realise. 
> And yes... :)  A capable encoding scheme, and computing architecture 
> built on it, would allow such a thing.

Something that is most decidedly outside of the scope of what ASCII was 
meant to solve.

> Point? Not practical.

It might not be practical for most day-to-day computing.  But I do think 
that there are merits to it for specific use cases.

> The coding scheme has to be compatible with the existing cultural schemes 
> and existing literature. (All of them.)

Why does the coding scheme have to be compatible?  Why can't it be 
completely different as long as there is a way to translate between them.

> What began as my general interest in the evolution of information 
> encoding schemes, gained focus as more and more instances of early 
> mistakes became apparent. Eventually it spawned a deliberate project to 
> evaluate 'starting over.'

There in lies some critical meta-data.  You have a purpose behind what 
you're doing, which happens to seem related to deriding ASCII.

> Like this:
> 
> *  Revisit the development history of computing science, identifying 
> points at which, in hindsight, major conceptual shortcomings became 
> cemented into foundations upon which today's practices rest.
> 
> *  Evaluate how those conceptual pitfalls could have been avoided, 
> given understandings arrived at later in computing science.
> 
> *  Integrate all those improvements holistically, creating a virtual 
> 'alternate timeline' of computing evolution, as if Computing Science 
> had evolved with prescience of future conceptual advances and practical 
> needs. Aiming to arrive at an information processing and computing 
> architecture, that is what we'd already have now if we knew what we were 
> doing from the start.

Learning from others mistakes is usually a good thing.

> Hey, we totally agree on something! I *HATE* PDF, and the Adobe 
> DRM-flyblown horse it rode in on.  When I scan tech documents, for lack 
> of anything more acceptable I structure the page images in html and wrap 
> as a RAR-book.  Unfortunately few know of this method.

~chuckle~

> There *was* at one point a freeware utility for deconstructing PDF 
> files and analysing their structure.  I forget the name just now. It 
> apparently was Borged by the forces of evil, and no longer can be found. 
> Anyone have a copy?

I've been able to get raw data out of PDFs before.  But it's usually so 
badly broken that it's difficult if not impossible to make it practical 
to use.  I'm usually better of just retyping what I want.

> No, they are not intrinsically different things. It just seems that way 
> from the viewpoint of convention because ASCII lacks so many structural 
> features that file (and stream) formats have to implement on their own. 
> (And so everyone does them differently.)

I disagree.

ASCII is a common way of encoding characters and control codes in the 
same binary pattern.

File formats are what collections of ASCII characters / control codes 
mean / do.

> Ha, wait till (eventually - if ever) you see the real thing.  I'm having 
> lots of fun with it. Result is like 'alien tech.'

Please don't blame me for not holding my breath.

> Soon. Few weeks. Got to get some stuff out of the way first. I have way 
> too many projects.

:-)

-- 
Grant. . . .
unix || die