Understanding the Bourne Shell (was Re: Finding the last arg)

Tue Jan 8 07:11:20 AEST 1991

In article <443 at minya.UUCP> jc at minya.UUCP (John Chambers) writes:
>> What ALLWAYS works in the Bourne-Shell is this:
>> 
>> 	for last do :; done
>
>Wow! A one-liner that works for more than 9 args!  Of course, there's 
>the question as to whether this loop is actually faster than starting 
>a subprocess that just does puts(argv[artc-1]), but at least there's
>a way to do it that is portable.

I have compared the alternatives here on my 386 box and as you might guess
the differences in speed depends on the length of the argument list.

For ~25 arguments the for-loop is the fastest, above that up to ~100
arguments there's few difference, but the for loop uses more usr-time and
the sub-process more sys-time. There seem to be minor differences between
what is called as sub-process, i.e. a specialized C program (as the poster
suggested) or another shell-script (as Maarten Litmaath posted earlier in
this thread).

For the rather untypical size of 250 arguments there still isn't much
difference but sometimes the sub-process is faster (the results vary over
some range and I didn't go into the efforts to calculate the average).
My general experience with the 386 is that it starts sub-processes really
fast, so I think the for-do method will even win even for more than 250
arguments on a lot of systems.

(BTW: I've learned by my experiments that the shell internally limits
the number of arguments that can be passed to a sub process to 254.
I allways thought the only limit were the space supplied by the OS
to pass the stuff to the sub-process, which is typically several KByte
for the *contents* of arguments + environment. I never noticed the limit
on the *number* of arguments before.)

>That comment isn't worth wasting the bandwidth, of course; my motive
>for this followup is a bit of bizarreness that I discovered while
>testing this command.  The usual format of a for loop is 3 lines:
>	for last
>	do :
>	done
>Usually when I want to collapse such vertical code into a horizontal
>format, I follow the rule "Replace the newlines with semicolons", and
>it works.  For instance,
>	if [ <test> ]
>	then <stuff>
>	else <stuff>
>	fi
>reduces to
>	if [ <test> ];then <stuff>;else <stuff>;fi
>which I can do in vi via a series of "Jr;" commands.  With the above 
>for-loop, this gives
>	for last;do :;done
>which doesn't work.  The shell gives a syntax error, complaining about
>an unexpected ';' in the line.  Myself, I found this to be a somewhat 
>unexpected error message.  It appears my simple-minded algorithm for 
>condensing code doesn't work in this case.
>
>So what's going on here?  What the @#$^&#( is the shell's syntax that 
>makes the semicolon not only unneeded, but illegal in this case?

Funny, I stumbled over the same thing when I "invented" my for-do method
for accessing the last argument some years ago. The explanation is a bit
longer, so all who aren't interested in the details should leave at this
point.

The syntax for the "for" statment is more or less the following (I stick
to the "yacc"-style here, but include keywords into single quotes even if
they are longer than one character, what is not allowed with "yacc"):

for_stmt : 'for' NAME 'in' word_list SEP 'do' cmd_list 'done'
	 | 'for' NAME 'do' cmd_list 'done'
	 ;

word_list: WORD
	 | word_list WORD

cmd_list : cmd arg_list SEP
	 | cmd_list cmd arg_list SEP
	 ;

arg_list : /*empty*/
	 | arg_list WORD
	 ;

SEP	 : ';'
	 | '\n'
	 ;

(The meaning of NAME and WORD should be obvious - I don't want to go into
the syntactic details too far. I have further left out an undocumented
shell feature, that allows you to replace "do" and "done" with "{" and "}";
note that the latter is only true for for-do-done, not for while-do-done
and until-do-done!)

Note that white space is allowed everywhere in between the tokens
and nonterminals. But SEP is a mandatory seperator (which can be
a newline or a semicolon). The reason for requiring a separator in
some cases is simple: There is the possibility that some keywords of
the shell might also be used as regular argument to commands or within
a word_list - we'll come back to this in a moment.

The shell detects the two forms of the "for" statement simply by looking
at what follows the loop-variable. If it is an "in" then there must also
follow a word_list, which in turn must be terminated by a mandatory
seperator, as explained above. If there follows a "do" there is no
wordlist. If there follows a semicolon after the loop-variable, this
is against the syntax (this was what the poster puzzled).

Of course, Mr. Bourne could have made the syntax to allow for it by
changing the RHS of the rule for the "for" statement without "in" into

	'for' NAME SEP 'do' cmd_list 'done'

but IMHO the difficulties of the poster (and many more, me included)
have some other reason, that has something to do with the difference
between
	- mandatory command separators resp. terminators and
	- optional white space before commands and keywords and
	- spaces as separators of command and argument list and
	- the semicolon beeing allowed only in the first case and
	- the newline beeing allowed in the first and second case,
	- space characters beeing allowed in the second and third.

In a simple command, i.e. a programm name that is followed by some arguments,
there's not much of a problem as it seems "natural" for most users to type
spaces to separate the arguments and newlines to terminate commands and it
seems obvious that the two can not be used interchangable, as this either
would terminate the argument list prematurely (if you try to separate
arguments with a newline) or it doesn't properly end your command (if you
don't type newline). 

Now let's consider the more complex shell statements. Some very stupid
users might in fact expect that the shell can read their mind, but all the
others will understand that the shell must either treat ALL keywords (and
maybe even all the commands) special, not allowing them as regular arguments,
or needs some other separator as the one used between arguments, if there
shall follow a keyword after a command (or there shall be two commands) in
the same line. The logic can be applied to most keywords regardless if
they introduce some complex command or if they mark the beginning of the
next part of the command (like "then" or "else" in an "if" statement).

More puzzling is that the shell also ALLOWS newlines in place of spaces
where it's clear that a complex command isn't complete%. One place where
this occurs is when you start a "for" statement and have not yet supplied
the matching "done".  For example

	for var in foo bar
		<some newlines here (1)>
	do	<some newlines here (2)>
		cmd
		<some newlines here (3)>
	done

is all allowed, though seldom used, except for exactly one newline in
the place marked (2). Note that the newlines before and after "cmd" here
can not simply be seen as "empty commands", because if they could, the
following would be legal:

	for var in foo bar
	do
	done

which IS NOT, since there is at least ONE command necessary between "do"
and "done" (please refer to the syntax given above). Note further that a
semicolon by itself is NOT an empty command, as

	for var in foo bar
	do ;
	done

does not work - you need at least the colon here:

	for var in foo bar
	do :
	done

------
%: More puzzling is that the shell does only allow it in some places.
   E.g. "for <newline>" is a syntax error while "for i <newline>"
   patiently waites for the "in" or "do".
------

>One of the real hassles I keep finding with /bin/sh (and /bin/csh is
>even worse ;-) is that the actual syntax regarding things like white
>space, newlines, and semicolons seems to be a secret.  It often takes 
>a lot of experimenting to find a way to get these syntax characters 
>right.  Is there any actual documentation on sh's syntax?  Is it truly 
>as ad-hoc as the above example implies?

For all I know the C-shell is more or less "ad-hoc", but for the Bourne
shell (which, until now and for the rest of this article, I allways mean
when I speak of "the shell") you can find a formal syntax allready in a
very ancient document, the "Bell Systems Technical Journal" (BSTJ in short)
from July/August 1978, ISSN0005-8580. The grammar starts on page 1987 as
Appendix A of an article written by S.R. Bourne himself. Though it fails
to mention some of the finer points (like the space/newline problems just
discussed) it may serve as a start for you and I found that it could even
be fed to yacc without much problems (I never tried to fill in the actions
to make it work as a "real" shell ...)

>Is there perhaps some logical 
>structure underlying it all that would explain why
> 	for last do :; done
>and
>	for last
>	do :
>	done
>both work but
>	for last;do :;done
>doesn't?

Well, "logic" is not so much an absolut value as many of us think, as it
often depends on what you expect. This is so because we may think we
have recognized something as a "rule" and tend to see all withstanding
observations as "illogical", where just the examples we studied were too
limited to recognize that we had only a seen special case (in this generality
that may also be true for the things we consider to be the "universal
laws" or "laws of nature" - but this brings us away from the topic.)

Now, what you observed were that newline and semicolon are interchangable
in all the examples you looked at and have tried before you came to that
"for" statement. (Remember I told you in the beginning that I had the same
problem with this - so it can not be said that your expectations were
without reason.) A bit more experimentation could also have shown that in
general the both are not really interchangable. E.g. if you type a single
newline nothing happens (except the shell prompts again), if you type two
newlines still nothing happens but if you type a semicolon + a newline this
is a syntax error. Hence semicolon and newline are not so much
interchangable as it seemed on first glance.

Now, having a little more experience we can come up with some other
explanation:

	- commands can not be empty (they consist at least of
	  an external or builtin command; the ":" is the builtin
	  command which does nothing but evaluate its arguments)
	- a semicolon or a newline% terminates a command
	- a command list is a non-empty sequence of commands, all
	  of which must be properly terminated
	- a semicolon or a newline terminates the word list of
	  the "in" part of the "for" statement
	- space characters and newlines are allowed before commands
	- nearly all the keywords of the shell are only recognized if
	  they are found in the position of a command, i.e. if there is
	  a previous command or a word list of a "for" statement there
	  MUST be a separator and their CAN be some space characters or
	  newlines
	- the most important exceptions from the above are "in" (as
	  well for the "for" statement as for the "case" statement) and
	  "do". But as the word list in the "in" part of a "for" statement
	  (or the command list after the "while" or "until" in such a
	  statement) must be properly terminated, a "do" NOT in command
	  position can only occur in a "in"-less "for" statement.

-----
%: There are other valid command separators/terminators that are recognized
together with the semicolon, but this doesn't matter here.
-----

In some sense, this are the "laws of nature" as derived from observing
the shell's behaviour. As the shell is not really nature but the outcome
of the thoughts of some human beeing, we could of course complain now
that this is "illogical" (compared to our sense logic!) or that there
are "too many exceptions" and that it could be simplified with fewer,
but more general rules.

But when thinking how to smoothen things out by using fewer rules, we
often do not recognize all the consequences that this would have.
Assume for a momemt we would treat both, newline and semicolon, as
statement terminator. Have you really considered what this would mean?
Typing a newline (at your terminal or as empty line in a shell script)
would be a syntax error (sic!) as a single semicolon is. Quite simple
I hear you say, then we allow for an empty statement to be really empty,
which would allow for single newlines as well as single semicolons. But
be careful! We then must think about the exit status of such a statement.
Should it allways be true as the colon command? But then you must be very
careful inserting empty lines into a script, because the following two
would have different semantics

	if		|		if	cmd
		cmd	|
	then		|		then

and you must never separate command execution and accessing $? by a
newline, since the empty command "newline" destroys the value of any
previous command's exit status. Again I hear you say, we make the
empty statment special - it shall leave the status of the "real" command
that was executed last. But now the following will become dangerous

	while
	do
		<do something until exit or break>
	done

as it depends on the last command BEFORE the loop when the loop is
entered the first time, and after that on the last command executed
WITHIN the loop. So, step by step we may introduce more special casing
for something that looked like a trivial change in the first place!

I hope you have gained a little more understanding for the syntax of the
shell now. It isn't really as strange as it might seem on first glance,
though I admit a few things are not so obvious and it's easy to come to
some wrong conclusions if you have insufficient experience. (If this
article hadn't become that long I could write a little more on it - maybe
some other time.)
-- 
Martin Weitzel, email: martin at mwtech.UUCP, voice: 49-(0)6151-6 56 83