ANSI C -- trigraphs and character sets

Mon Dec 15 02:15:42 AEST 1986

This is one of a collection of comments on the Draft Standard, posted to
comp.lang.c for discussion before I mail a final draft to the Ansi C
committee.  Each message discusses one problem I have found with the Draft
Standard that I feel warrants a "no" vote.  Note that this message is my
personal opinion, and does not reflect on the opinions of my employer.

---- Problem:

Page 10, line 1ff. The Standard should recognize the primacy of the ISO
Latin 1 character set.

Page 10, line 34ff. Trigraphs should be deleted from the standard.

---- Motivation:

Page 10, line 1ff.  The character set should be defined in terms of ISO
Latin 1 (ISO 8859/1, ANSI X3.134.1, ECMA-94).  While other character sets
may be used, they should be defined with reference to this standard.
Latin 1 contains representations for the accented characters needed for
many European languages.  These representations do not conflict with the
characters, such as backslash, that are needed for C syntax.  The standard
should permit the use of accented characters (positions 12/0 through 15/15)
in variable names (noting, however, that this may be non-portable and not
requiring it in a conforming compiler).  It should also require acceptance
of all 255 characters in strings.  (Some existing compilers use the 0x80
bit to mark variable substitution in the preprocessor.)  A reasonable
extension, but not one that I would mandate, would be to accept the Latin 1
multiply and divide signs as equivalents to '*' and '/' and the raised
dot as equivalent to period in numeric quantities.

Page 10, line 34ff.  Trigraphs were added to the standard in order to
accomodate European users who currently use the character set positions
occupied by # [ \ ] ^ { | } ~.  A better solution is offered by the Latin 1
alphabet, which consists of the USASCII 7-bit alphabet augmented by a 128
byte character set containing the ``special'' letters used by most European
countries.  This standard was prepared jointly by ANSI, ISO, and CBEMA
(the European business equipment manufacturers).

During the transitional period, users of existing equipment that supports
national letters are better served by implementation-specific conversion
routines that are external to the C language. These would compose multi-byte
sequences into Latin 1 and display Latin 1 characters (using either the
representations available on the terminal or fallback composition sequences)

The composition process would be external to, and independent of, the C
language.  It may be provided by the implementation by a #pragma.

Note that the standard does not offer the implementor guidance in handling
programs that mix trigraph sequences and national letters.  As stated, it
is clear that the sequence `??/' functions as a backslash.  However, it is
not clear how the compiler is to treat an input character (assuming 7-bit
Ascii) in position 5/12 (having decimal value 92).  Is this also a backslash,
or is it a national letter (such as the Swedish capital 'O' with two dots)?

----

Martin Minow
decvax!minow