Standard C Digest - V2 #2

Sun Jan 6 13:41:36 AEST 1985

ANSI Draft of Proposed  C Language Std.

Mail your replies to the author(s) below or to cbosgd!std-c.
Cbosgd is reachable via most of the USENET nodes, including ihnp4,
ucbvax, decvax, hou3c....  Administrivia should be mailed to 
cbosgd!std-c-request.

ARPA -> mail to cbosgd!std-c at BERKELEY.ARPA (NOT INFO-C)

**************** mod.std.c  Vol. 2 No. 2  1/5/85 ********************

Today's Topics:

		More Comments on the X3J11/84-161 Draft (1)
----------------------------------------------------------------------

Date: Fri, 21 Dec 84 13:12:34 pst
From: cbosgd!ucbvax!ucsfcgl!arnold (Ken Arnold)
Subject: comments on the X3J11/84-161 draft

I have received my copy of X3J11/84-161 (Draft Proposed Standard for
C), and, unfortunately, am about to run off home for the holidays.
However, I have read sections A through C, and am halfway through D.  I
have several comments, but haven't yet the time to compose them into a
reasonable document.  However, I have found three things in the C
standard which break many pieces of code.  At least two of these have
been discussed in net.lang.c, but since this is the place to discuss
the standard (and we were begged to send comments), I thought I would
get these together here.  Trust me, this is not the last you will hear
from me on the standard.  However, at the very least, things which
break code should not be put in the standard.

*Problem:	Breaks Code
*Reference:	C.8.2 Macro Replacement; Semantics (p. 60)
*Description:

	"Character constants and string literals in the token sequence
	[of a macro] or in the rest of the program are not scanned for
	macro names or formal parameters."

Not scanning strings in the program for macro names is proper and
current.  However, not scanning strings in the token sequence in the
macro definition will break several current programs, including the
4.2bsd operating system (c.f. "CTRL" in <sys/ttychars.h>).  I suggest,
as proper wording:

	"Character constants and string literals in the token sequence
	will be scanned for formal parameter substitution.  Character
	constants and string literals in the rest of the program will
	not be searched for macro names."

As part of this fix, C.8.2; Notes (p. 61) must also be modified.  It
currently reads

	"The following two capabilities will be added in subsequent
	drafts:  the ability to substitute for a macro parameter in a
	string literal; and the ability to concatenate two tokens into
	a single token after macro replacement.  Details of syntax and
	semantics have not yet been resolved."

After the correction above, it should read

	"The ability to concatenate two tokens into a single token
	after macro replacement will be added in a subsequent draft.
	Details of syntax and semantics have not yet been resolved."

*Problem:	Breaks Code
*Reference:	C.8.3 Conditional inclusions; Note (p. 62)
*Description:

	"As indicated by the syntax, a token must not follow a #else or
	#endif directive before the terminating new-line character.
	However, comments may appear anywhere on any source line,
	including on a preprocessor directive."

This breaks many existing programs, including rmail, deroff, diction,
efl, eqn, learn, lint, nroff, refer, struct, troff, uucp, and ingres.
As was suggested in net.lang.c, it would be nice to require a match of
the token list with the appropriate #if-type command, but this would be
a pain to implement.  I would recommend this as a standard extension,
and to change the proposed standard in the following way:

	Delete C.8.3; Note (p. 62)

	Change C.8.3; Syntax (p.61) from

		# else	*new-line*
		# endif	*new-line*

	to

		# else	*token-sequence(opt)* *new-line*
		# endif	*token-sequence(opt)* *new-line*

Note that this is completely general; it does not even require that the
parameters to #else and #endif be of the same type (*constant
expression* or *identifier*) as #if, #ifdef, and #ifndef lines.  Some
of the programs mentioned above would break if this was not true.

*Problem:	Breaks Code, Confusing Syntax, Unclear
*Reference:	C.1.2 Identifiers; Implementation Limits (p. 14)
*Description:

	"The implementation must treat at least the first 31 characters
	of an *internal name* (an identifier that does not have
	external linkage -- described below) as significant.
	Corresponding lower-case and upper-case letters are different.
	The implementation may further restrict the significance of an
	*external name* (an identifier that has external linkage) to
	six characters and may ignore distinctions of alphabetical case
	for such names.  These characteristics are all implementation
	defined."

*Unclear:

Which characteristics are specifically referred to by the last
sentence?  All the ones described in the paragraph?  All the ones in
the preceding sentence?  I presume the latter, but it should be clear.

*Breaks Code:

The PDP-11/70 (Ritchie) compiler allowed seven significant characters.
I realize that this is not a standard, but much code has been written
with this as the minimal assumption.  (I'll be kind and not even
mention the code written under the many compilers which set no limit,
or a large (usually 31 or 32 character) limit.)  If one is asking
someone to rewrite a compiler (and many of the extensions would require
some extensive modifications to existing compilers), asking them to
modify a loader is not too much to add.  Having written a compiler,
assembler, and linker/loader, I think I am not speaking in the dark
here.  This is primarily a argument of relative effort.  To properly
implement "const" and "volatile", for example, including keeping the
peephole optimizers away from stuff they shouldn't touch, is probably
at a similar level of effort.

In any case, this will break existing programs written to run under
PDP-11's or above.  I suspect that case insignificance will do so to,
but solving this problem on non-ascii machines is decidedly non-
trivial.  In a future letter I will make the case for the antithesis of
"common extensions", which is "allowable deletions", and I would argue
the lack of case distinction should be under the latter category.

*Confusing Syntax:

Generally, giving externally and internally linkaged (ick, what a
word!) identifiers different syntax is confusing and ugly, and appears
arbitrary (even if it isn't), thus making the language more difficult
to learn and debug.

A suggested wording:

	"The implementation must treat at least the first 31 characters
	of identifiers as significant.  Corresponding lower-case and
	upper-case letters are different.  For an *internal name* (an
	identifier that does not have external linkage -- described
	below), these characteristics must always hold.  For an
	*external name* (an identifier that has external linkage), it
	is an allowable deletion [see below -- Ken] not to have case
	distinctions on machines where this is sufficiently difficult."

Just to give you a taste (and to clear this up and this avoid some
discussion), my concept of an allowable deletion is something which is
recognized to be very difficult or impossible on certain systems, and
which the implementor is therefore allowed to punt, as long as (a) this
is documented up front (i.e., not hidden in some implementation
document that no one considering buying the compiler would read); and
(b) a reasonable warning is generated by the compiler.  In this case,
for example, the compiler could say something like

	"warning: case not significant in external variables"

the first time it encountered a global variable with upper case letters
(or mixed case variables, or something) and then shut up for the rest
of the compilation.

				General Comment
				===============

Now, I don't want to rant or flame about this, so I will say it simply,
professionally, and baldly.  The standard should not break existing
code, except where such code takes advantage of bugs, non-standard
extensions, or implementation or machine dependent code (such as asm
statements).  To do so will ensure that the standard is ignored in
these areas, and thus increases the probability that other sections of
the standard will be ignored ("after all, the standard was flawed in
the first place").  There are many things which could be fixed, but, as
Henry Spencer said about "case" statements, fixing them would break too
much code, and this is not the purpose of the standards committee.
That the third problem truly breaks code seems apparent to me, but that
the first two break code is clear.  The specific solutions offered here
are, I believe, the best solutions to the problems, but the problems
*must* be solved.

		Yours for Happy Holidays and Winter Solstice,

		Ken Arnold
		arnold at BERKELEY.ARPA
		ucbvax!arnold
--------------------------------------
End of Vol. 2, No. 2. Std-C  (Jan. 5, 1985  22:40:00)
-- 
Orlando Sotomayor-Diaz	/AT&T Bell Laboratories, Red Hill Road
			/Middletown, New Jersey, 07748 (HR 1B 316)
Tel: 201-949-9230	/UUCP: {ihnp4, houxm}!homxa!osd7