Text encoding Babel. Was Re: George Keremedjiev

Fred Cisin cisin at xenosoft.com
Sun Nov 25 19:34:36 CST 2018


On Mon, 26 Nov 2018, Tomasz Rola via cctalk wrote:
> To supply this train of thought with some numbers:
>
> - my copy of Common Lisp HyperSpec claims 978 symbols (i.e. words) on
>   its alphabetical index; many words have modifiers (a.k.a. keyword
>   options, with default values) which increases the number at least
>   twofold, IMHO, if one agrees that each combo should be counted as
>   different word, to which I would say yes
>
> - I have read somewhere that Japanese pupil after graduating from
>   elementary school is supposed to know 1000 kanjis by heart (there
>   is a standardised set, I have a book)

Would those "modifiers of words" qualify as ADJECTIVES?


The Japanes phonetic alphabets, Katakana and Hirigana, have 46 letters 
each, almost twice that with diacritics.
I have heard that Japanese Kanji has more than 50,000 words/characters 
(for which 16bits would fit, but be a little risky).  But, that in common 
usage, 1100 to 2000 words comprise most of common usage.  Wikipedia says 
that as of 2010, the student requirement is 2136.

Japanese Kanji and Chinese have substantial overlap, but there is no way 
that you could squeeze both into 16 bits, without leaving out important 
stuff.

Therefore, for use with current computers, 32 bits would be needed.
Some games can be played with mixing sizes by doing things like setting 
high bit, for 128 7 bit characters plus 32768 15 bit characters, and 
2147483648 31 bit characters.



More information about the cctalk mailing list