How to write Trigraph like character sequences in a string

Doug Gwyn gwyn at smoke.brl.mil
Wed Jun 5 18:16:38 AEST 1991


In article <1991Jun5.005958.9597 at tkou02.enet.dec.com> diamond at jit533.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>Does this mean that in a national character set that doesn't have [|\ etc.,
>and a proposed implementation accepts the entire national character set
>plus trigraphs, failure to support [|\ etc. through direct means without
>trigraphs will make the implementation non-conforming?

The C source character set, in terms of the standard, has no necessary
relation to the code set used for "the national character set", although
in many current implementations there happens to be a close relationship.
The standard says the specified set of C source characters, which are the
result of conversion from external representations, must be supported,
but it doesn't specify details of the conversion.  This is independent
of the trigraph mapping that occurs at a slightly phase of translation.

You might recall the "Software Tools" Ratfor translator implementation;
it converted external source file characters, which might for example be
coded in EBCDIC, to a universal internal form (which happened to be ASCII
in that example), to take advantage of the known properties of the internal
representation (e.g. contiguity of the alphabetic character codes) for fast
processing, and converted back to external characters upon output (it was a
text-to-text format translator, not a true compiler).  C compiler
implementations [that don't need to provide support for Japanese-style
multibyte character encodings in C source files] could readily map whatever
the site-specific conventions are for external representation of funny
characters (traditionally displayed as vertical-bar glyph, etc.) used in
C source code to internal, more convenient (probably 7-bit code) form for
subsequent processing.  Such conventions are not the business of any C
standard; they necessarily depend on highly site-dependent characteristics.

>I thought that the exact opposite of this was previously decided, that
>trigraphs were sufficient in such cases.

No, and trigraphs are a horrible invention that seems to mainly serve to
lead people in the wrong direction for coping with character set problems.
While trigraphs may at first appear to "solve" the code set issue by
permitting translation of any strictly conforming C source to a "lowest
common demoninator" set of characters that even ISO 646 sites claim to
support, in order to accomplish this function one needs to come up with
utilities for translating generic C into full trigraphed form, as well as
the inter-code set text file tranfer and translation facilities that are
always necessary for data interchange.  The latter have to solve the
differing-code-set issues anyway.  (Note that there are radically
different code sets among the sites that receive this newsgroup; to a
large extent similar problems have already been overcome in the
development of internetworking.)

If I may add a general observation about code set issues, particularly
multibyte encodings:  It seems to me that the people designing software
facilities, hardware, and standards concerning the issues generally fail
to appreciate a crucial design point:  The sooner you can map everything
into a uniform format with simple, clean, properties, the better off you
are.  Instead, we keep seeing designs that require the users of the
services to face algorithmic complexity, because the data being operated
upon has been left in a complex encoded form instead of being turned into
the previously mentioned uniform format with nice properties.  Algorithms
naturally reflect the underlying structure of the data.  If you'd like to
be able to code programs that deal with text in a simple manner, as seen
in early UNIX utilities such as "wc", you need to keep the form in which
text is seen by program code as simple as possible; for example, all text
characters must be handled as one "character" type, a complete unit of
which would be returned per call to getchar(), obviating the need for
wchar_t and the (rapidly growing) library of functions for helping
applications deal with nonunitized, fragmented, and stateful characters.
This would mean that in some environments a 16-bit datum would be
required for representing a single character, but we ended up with that
anyway, in the form of wchar_t, without the benefits of a simple program
interface to text units.  Since the character problems have to do with
people, their details should be pushed as far out from the application
(thus as close to the users) as possible.  I think X3J11, prompted by
certain vendors who were already committed to complicated solutions,
made the wrong choice here.  I would hope that other computer engineers
learn from this example how not to "solve" such problems.



More information about the Comp.std.c mailing list