Character Sets (was Re: trigraphs)

Steve Hosgood iiit-sh at cybaswan.UUCP
Sat Apr 29 09:02:51 AEST 1989


Several people have been talking about Trigraphs recently. Danes, Swedes,
Icelanders and others have discussed at length whether or not they (the
potential benefactors of such a scheme) actually *want* or *need* the damn
things anyway.

Now IMHO, we're seeing here the consequences of restricting the world's
computer users to a 7-bit coding system originally designed just for American
English. Surely it would be better for ANSI to scrap formally the concept of
7-bit coding and move to better things? As I understand it, the reason for
7-bits in the old days was so that the character and a parity bit would fit
into a byte. These days though, as far as I know, all ACIA chips will happily
send 8-bits and parity - though most people disable the parity anyway!


I've got an article in front of me from Scientific American in 1983(ish),
though I don't know the exact date as it's a photocopy. Anyway, it's pages
82 thru' 93 and written by Joseph D. Becker of Xerox Corporation, and is
entitled "Multilingual Word Processing". It seems a lot of work has been
done on beating the problems of handling the world's languages by means of
switching of character sets. Xerox seem chiefly interested in word processing,
but it's obvious that the same ideas could be used in E-mail, and presumably
language source-code as well.

[ ** in case you didn't see the article **
The idea is that you define 8-bit alphabets, and reserve the character 0xFF
to indicate "next byte is an alphabet identifier". This allows you to switch
from one character set to another in mid-text very easily. I get the feeling
that the alphabets are designed to have shared sections, so that the codes
0x00 thru 0x7F print the same in the 'Roman/Hebrew' set as they do in the
'Roman/Esperanto' set for instance. Obviously the several alphabets needed for
Chinese will not have any commonality with the Roman stuff though.
]

I don't think you'd have to go as far as switched character sets to solve the
problem of dealing with *most* of the Northern European and North American
languages. Just look at the IBM-PC character set for instance. However it
would be nice to think ahead a bit and allow for the Greeks, Russians, Chinese
and Japanese.

The result of moving in this direction would be that people with old Danish
terminals would see the unrepresentable characters on screen as trigraphs, and
would type them as such, but the trigraphs are a local product of the computer's
TTY handler. What would appear in the source-code file would be the 8-bit
Northern Europe/USA code for '{' or whatever he wanted. If someone in the USA
wanted to use a 'yen' symbol, he'd have to type a trigraph for it, which
would cause an alphabet-shift code to appear in the source file to cater for
it. Someone in Japan reading that file would just see a 'yen' symbol.

OK, well it's *far* too late for such ideas to be submitted to X3J11 now, but
did anyone mention it in the early days, *before* it was too late?
Actually, it's not an X3J11 problem if you put responsibility for trigraphs
into the TTY handler. Whose problem would it be?

-----------------------------------------------+------------------------------
Steve Hosgood BSc,                             | Phone (+44) 792 295213
Image Processing and Systems Engineer,         | Fax (+44) 792 295532
Institute for Industrial Information Techology,| Telex 48149
Innovation Centre, University of Wales, +------+ JANET: iiit-sh at uk.ac.swan.pyr
Swansea SA2 8PP                         | UUCP: ..!ukc!cybaswan.UUCP!iiit-sh
----------------------------------------+-------------------------------------
            My views are not necessarily those of my employers!
	"Traditional Japanese Theatre? Just say Noh" - not Nancy Reagan



More information about the Comp.std.c mailing list