trigraphs in X3J11

Fri May 20 11:24:25 AEST 1988

I've etalked to a few people about this, but I'd like to see if there's
more info floating around.

Background: "Trigraphs" in dpANS C are a way of avoiding the problems of
character-set restrictions, by introducing 3-character replacements for
those characters which are required for C but do not exist in the ISO 7-bit
set.  For example, if your character set doesn't have braces {}, you can
use ??< and ??> to denote them.  The behavior is as if trigraphs were
replaced by the corresponding single characters in a prepass to the compiler,
*including* replacement within strings.  All trigraphs begin with "??".

The draft standard seems to be written in such a way that a compiler MUST
accept these trigraph sequences.  I'm perplexed on a couple of points here.

1.  Replacement within strings:  This is a change to the existing language.
    It breaks existing programs.  I looked through existing source code
    that we have here and found several programs which get broken or
    significantly altered.  Here's an example--sanitized, but typical of
    what can happen.  Suppose you now have:
	printf("bad status ??<%x>??--device %n\n", st, dev);
    What you're going to get, according to the draft standard, is something
    that has the effect of:
	printf("bad status {%x>~-device %n\n", st, dev);
    Point:  The sequence "??" is not at all rare.  Why was it chosen as the
    introducer?  (I think people who start getting messages about using
    `/dev/tty^ are going to be confused.)

    Note also that it is common practice to use "?" in initializing strings
    where the "?" positions will be replaced at execution time.  Pity the
    poor programmer who sets up something like:
	char	ta[] = "/tmp/d?????/a",   tb[] = "/tmp/d?????/b";
    and discovers (eventually) that these strings are each two characters
    shorter than they used to be; if he tries to replace the ?s, he'll
    write off the ends of the strings!

    NOW, before you light 'em up and blast me, YES, I realize it's a hard
    problem.  There aren't many safe character sequences to use--and YES, I
    know that you can't use backslash because that's one of the possibly-
    missing characters.  What I don't understand is why it was decided to
    introduce a brand-new (I assume) mechanism which breaks existing code.

2.  Replacement in program text:  My philosophical objections to
    replacement of trigraphs within a program are much less...but I wonder
    who might ever use them.  Is there any precedent for these sequences?
    Is there any reason to think they'll be used?  Let's take another
    (slightly contrived but realistic) example here--I'll construct a
    piece of code which says, roughly, "If the first character of `line'
    is a sharp or percent, call function prepro to handle the rest of the
    line, then increment linect".  We would now write this as:

	if (line[0]=='#' || line[0]=='%') {
		prepro(&line[1]);
		linect++;
	}

    Replacing all the nasty characters with corresponding trigraphs gives:

	if (line??(0??)=='??=' ??!??! line??(0??)=='%') ??<
		prepro(&line??(1??));
		linect++;
	??>

    I submit that this will produce code which is so near to unreadable
    that there is virtually no prospect of the mechanism ever seeing
    significant use.  If you believe that, you have to wonder why every
    standard compiler should have to carry the extra baggage.  If you don't
    believe that, I'd like to see some real evidence to show that
    programmers might use it.

A general question:  Has the trigraph mechanism been tried out, in real
practice, anywhere prior to the introduction in X3J11?  If so, I'd like to
hear about how it's worked out.
-- 
Dick Dunn      UUCP: {ncar,cbosgd,nbires}!ico!rcd       (303)449-2870
   ...Never attribute to malice what can be adequately explained by stupidity.