How Shells handle errors, e.g. cmd1 `cmd2`.

Tue Dec 4 14:36:36 AEST 1984

Consider the nested commands:

    cmd1 `cmd2`

The problem posed is how and when to get the Shell to abandon CMD1 when CMD2
has an error.  Well, first, what is an error?  One wishes to distinguish:

    Type 1) "Error" statuses returned by an executing command, e.g. GREP.
    Type 2) Errors produced by the shell when trying to *start* a command,
            e.g. "No match", "Command not found", "Arglist too big",
	    "too many processes", "syntax error", etc.

There are three kinds of Type 1 return statuses on UNIX:

    1a) A legitimate non-zero return code, as in exit 1 ("pattern not
        found") from a successful execution of GREP.  This non-zero code
	isn't really an "error" at all.  The command is telling you something.
    1b) An error code, such as exit 2 ("file not found" or "pattern
        syntax error", etc.) from GREP.  (For GREP, this doesn't mean the
	pattern wasn't there, it means you messed up calling the command
	and it couldn't give you the 1 or 0 exit code you were asking for.)
    1c) The command terminates with a signal, such as Interrupt, or
        Bus Error, or Segmentation Fault.

When testing for the *success* (zero-return) of a command, it doesn't matter
what kind of Type 1 error occurs.  If the command isn't successful, something
depending on it is always abandoned no matter what the reason, as in:

    cat file2 file1 >file3  &&  mv file3 file1

One doesn't particularly care why the CAT failed, as long as its failure
causes the MV to be abandoned.  The "&&" operator is good for this.

However, when testing for the *failure* (non-zero-return) of a command, the
type of error being detected is usually very important.  Consider:

	 grep thing file  ||  echo thing >>file

    I.   If GREP fails to find the thing in the file, the ECHO adds the
         thing to the file.  This is the expected behaviour.
    II.  If GREP has a Segmentation Fault, the same thing happens.
    III. If GREP isn't even executable, or is mis-spelled GREQ, ECHO
         still adds the thing to the file.

Is it a feature or bug that the "||" operator has this broad a meaning?
I will argue that it is a bug, and that it can be fixed.

UNIX gives the shell a single piece of information about a command's
termination status: either its exit code (0-255) or the signal it died
with (1-31).  Type 1a and 1b errors are expressed in exit codes; Type
1c errors are termination signals.  Consider Exit Codes first.

Every command uses a different exit code convention to distinguish a
Type 1a error from a Type 1b.  (GREP exits with 1 for "pattern not
found", and with 2 for "file not found", but SORT exits with 1 for
"file not found".) The shell's conditional command operators (&& and
||) only detect non-zero exit codes, so they are no use in telling Type
1a from Type 1b.  If telling the difference matters, you have to write
an IF statement to check the actual value of the exit code, and you
have to know what the different codes mean for the specific command.
Consider the quick vs. better way to set up a background job to append
a thing to a very big file:

    % grep thing file  ||  echo thing >>file &  # WRONG
    % grep thing file   ;  if ( $status == 1 ) echo thing >>file &  # BETTER

The second form is better, but even it doesn't guarantee to do the correct
thing, as we shall see.  Just remember that you can't use the shell's
conditional command operators to tell a legitimate non-zero return
status from an error code from a command indicating it didn't work.

Commands terminating because of signals are another matter.  The Bourne
Shell translates signals into exit codes by adding 128 to the signal
number.  A command terminiating with SIGHUP appears, to the user
wanting to check the status, to have exited with code 129.  (The 4.2bsd
C Shell does the same conversion, except it *always* abandons
processing any further commands if a command terminates with an
interrupt, and it sets the exit code to 1 in this case.)  Because of
this translation of signals into fake exit codes, you can't use an IF
statement to tell a legitimate exit code in the range 129-159 from a
command that died from a signal.  It also means you can't use the
shell's conditional command operators to tell apart any of the Type 1
errors.  The conditional command operators behave the same way no
matter wheter GREP returns 1, 2, or dies from a Segmentation Fault.

Now, consider Type 2 errors, where the shell itself is detecting the
error condition.  These errors inlcude: GLOB failure ("No match"),
command not found, too many processes, arglist too long, etc.  Type 2
errors should be treated like Type 1b errors; something went wrong with
the execution of the command.  But what exit code can the shell set to
indicate this?  For GREP, the consistent code is 2; for SORT, it is 1.
There is no universal exit code that means "the command didn't execute
correctly".  In practice, the shells just set exit code 1.  If you
carefully programmed your GREP with an IF statement, as shown above, but
GREP didn't have execute permissions, tough luck.  The shell can't
execute GREP, returns exit code 1, and the IF thinks that GREP worked but
didn't find the "thing" in the file.  You can't tell the errors apart.

To summarize the current semantics of exit codes:

    - In some cases you can't assume a non-zero return code means what
      you think it does:
	* You can't tell an exit status between 129 and 159 from a command
	  terminating with a signal.
	* You can't tell an exit status of 1 from "Command not found" or
	  "No match" or "too many processes" or other shell error.
    - The only thing the "||" operator can tell you is that
      *something* went wrong either executing or trying to execute the
      previous command.  "||" does not just test the return code of the
      previous command.  The non-zero return might be a shell error, a
      signal, or the legitimate non-zero code you were expecting.  This
      means all uses of the "||" operator to test a return code are wrong.

In the posed problem, we want to test for the failure of a child
command and stop the execution of the parent command in which it is
nested.  Given the current mixing of exit codes, some of which are
legitimate non-zero values and some of which are errors, there is no
exit code that the child command can return to tell the parent not to
go ahead.  Whatever value exit code we pick, some UNIX command will
exit with it under normal, successful, circumstances.  The "proper" fix
is to rewrite all the UNIX commands so that at least one value is reserved
for "you blew it", and teach the shell to obey that value.  This will
never be done.  Type 1a and 1b exit codes will be mixed up for the rest
of UNIX history.  However, we can do something about Type 2 errors.

If we can't use exit codes, some form of out-of-band signalling must be
employed that doesn't use exit codes.  The alternative is to make
better use of signals.  Two things can be done.

    I.  Assume that no command terminates "normally" with a signal.
	If a command terminates with a signal, have the shell itself 
	abandon processing of further dependent commands instead of
	returning exit code 128 + signal_number as it does now.
    II. If the shell detects a Type 2 error when trying to start a
	command, generate a signal instead of returning exit code 1.

The effect of these simple changes is two-fold.  First, it removes all
ambiguity from the meaning of an exit code.  If a dependent command
reads an exit code of 1, that is guaranteed to be the code with which
the command exited.  Second, shell-detected (Type 2) errors will generate
a signal and automatically cause dependent commands to be abandoned.

Consider this example under the new system:

    grep thing file  ||  if ( $status == 1 ) echo thing >>file

This is now guaranteed to echo THING into the file *only* if GREP
executes correctly and returns exit code 1.  If GREP isn't executable,
the EXEC fails, the child shell that is processing the GREP command line
kills itself with a signal, and the parent detects this and abandons the
dependent IF command.  If GREP dies of a Segmentation Fault, again, the
parent detects the signal and abandons the dependent command.  No fake
exit codes are returned to the dependent command.

This has changed the meaning of the "||" operator from "do this if anything
goes wrong when executing or trying to execute the command on the left"
to "do this only if the command on the left executes *successfully* and
returns non-zero".  The former broad behaviour is still available by changing
"||" to ";" in the above example (making the IF independent), and testing the
return status for non-zero (indicating any kind of Type 1 or 2 error).

Relating this to the original problem (cmd1 `cmd2`), note that CMD2 is
merely a dependent of CMD1.  If CMD2 gets off to a normal start, and
doesn't die of a signal, all is well.  If CMD2 fails because of a
shell-detected error, that error will cause a signal, and that signal
will be detected by the parent shell and CMD1 will be abandoned.

This doesn't solve the unsolvable confusion of Type 1a and 1b exit
codes, but it fixes all Type 2 errors (errors detectable by the shell).

I installed these two changes in my version of C Shell here at
Waterloo.  We've been using it for over a year now.  It uses the
otherwise unused (4.2bsd) Signal 31 to tell the parent about a child
that has detected a Type 2 error when trying to start a command.
-- 
        -IAN!  (Ian! D. Allen)      University of Waterloo