Does C depend on ASCII? - (nf)

rpw3 at fortune.UUCP rpw3 at fortune.UUCP
Sun May 6 21:06:18 AEST 1984


#R:utcsstat:-187300:fortune:26900056:000:2297
fortune!rpw3    May  6 03:13:00 1984

[ After this, let's move this to "net.lang.c", shall we? ]

Many, many programs I have seen depend on certain characteristics of ASCII,
but I am sure it varies by program as to how much of the total sequence is
wired in. This has GOT to be a major factor in the cost of porting UNIX to
a non-ASCII machine. Most of what I have seen included the at least the
following hard dependencies:

1. The numbers are contiguous (no gaps).

	Kernighan & Ritchie [pp20-21]:
   	"This particular program relies heavily on the properties of the
	character representation of digits. For example, the test

		if (c >= '0' && c <= '9') ...

   	determines whether the character in "c" is a digit. If it is, the
	numeric value of that digit is

		c - '0'

   	This works only if '0', '1', etc., are positive and in increasing
	order, and if there is nothing but digits between '0' and '9'.
	Fortunately, this is true for all conventional character sets."

   Note particularly the word "all" in that last sentence. Again [page 39],
   in the sample "atoi(s)", the same assumption is made.

2. The lowercase letters (as a class) are contiguous, as are the uppers.
   Some programs know that 'A' + 040 == 'a', some don't. Some only depend
   on 'a' > 'A' (so that 'x' - 'X' is a positive number).

Interestingly, most of the programs I have seen DON'T assume any fixed
distance between '9' and 'A', but when converting hexadecimal input they
adjust for letters by subtracting 'A' - ('9' + 1) from the value of the letter.

3. The ASCII control characters exist, and have values of 'X' - 0100 for
   any control character <^X> (where 'X' is the upper-case letter of similar
   appearance). Is is known (for example) that newline == '\n' == 'J' - 0100,
   and that 'H' - 0100 is a backspace.

In sum, many programs assume ASCII, or at least, certain properties of the
collating sequence. The ones mentioned above are certainly not a complete
list of what you may find when trying to use another character set, but they
are a few "biggies". The use of "_ctype[]" can help, but many programs do not
use it with consistency.

Sorry 'bout that...

Rob Warnock

UUCP:	{ihnp4,ucbvax!amd70,hpda,harpo,sri-unix,allegra}!fortune!rpw3
DDD:	(415)595-8444
USPS:	Fortune Systems Corp, 101 Twin Dolphin Drive, Redwood City, CA 94065



More information about the Comp.unix mailing list