more on error checking (was: Complexity of reallocating storage)

Sun Feb 10 18:11:20 AEST 1991

In article <1991Feb07.013637.6542 at convex.com>, Tom Christiansen writes:
>As was mentioned earlier, *every* system call should *always*
>be checked, even if you "know" it can't fail.

Indeed.  However, at the risk of confusing the issue further just
when this basic fact is starting to sink in, it's important to
remember, when you set out to check for every error everywhere,
that the thing to do upon finding one may not be to complain and
abort.  I'll provide several examples.

My own code is usually liberally sprinkled with assertions and
sanity checks.  One fruitful place to put them is in the default
case of a switch, when the default should "never" be reached.
Suppose I have a little routine which decodes the value of some
internal, numeric flag (presumably so it can be printed out
somewhere):

	#include "status.h"

	char *
	status_str(sts)
	int sts;
	{
	switch(sts)
		{
		case STS_GOLDEN:
			return "STS_GOLDEN";
			break;

		case STS_DEMENTED:
			return "STS_DEMENTED";
			break;

		case STS_ANHYDROUS:
			return "STS_ANHYDROUS";
			break;

		default:
			panic("status_str: bad status %d", sts);
		}
	}

The panic is there, of course, so that next week when I add
another status code but forget to update the status_str()
routine, my mistake will be very noticeable, and there will be a
strong reminder and incentive to fix it.

But does it really have to be a fatal error?  Probably not,
especially if there's any chance that the "impossible case" won't
be triggered by me, but by some (hopefully beta-test) user.  A
better way to handle this particular "impossible case" is
nonfatally: 

		default:
			{
			static char tmpbuf[15];
			oddity("status_str: bad status %d", sts);
			sprintf(tmpbuf, "status %d", sts);
			return tmpbuf;
			}

In my code, "oddity" is just like panic (i.e. it prints debugging
messages that are to be construed as the developer's problem, not
the user's) but it's nonfatal.  There's still a noisy message,
there's still incentive and reminder to fix the bug, but at least
the program can continue to run.

There was a case like this in the 4.1bsd tty driver.  (I don't
know if it's still there.)  The driver tries to do CRT erase
processing (i.e. the backspace-space-backspace stuff that wipes
characters off the screen as you press the delete or rubout or
backspace key or whatever) perfectly.  The tricky part is that
you might need 1, 2, or up to 8 backspaces, depending on whether
there is a normal character, a control character (echoed as "^X"),
or a tab on the screen being erased.

The code for figuring out how many backspaces to emit involved a
big array full of action codes.  There were a few impossible
cases.  Sure enough, they were handled with a panic.

I don't know about you, but I think that aesthetically pleasing
crterase processing, though nice, is a bit of a frill, and
probably one of the last things that it's vital for an operating
system kernel to do perfectly.  I would be very, very annoyed
(well, to tell the truth, I'd also be ridiculously amused) if a
machine crashed, dropping several users on the floor and losing
lots of their work, just because some poor schlep tried to hit
the delete key to get rid of some "impossible" character on his
command line.  (No, it never happened, but once I'd seen that
code in the tty driver, I was forever slightly haunted with the
fear that the system was going to crash because of it.  In fact,
hitting the impossible case would have represented a programming
error, not an impossible character.  Still, the panic seemed
needlessly draconian.)

Finally, consider write(1).  On the recipient's screen, it prints
something like

	message from scs on tty23 at 12:34 ...

Now, whatever method it uses to find out what terminal the sender
is logged into (probably ttyname(3)), it's not going to work if
the sender doesn't have one.  How could the sender not have one?
Actually, the burden of proof should be on the author of write(1)
to show why the program can't proceed if the tty is not known,
not on the user to suggest circumstances under which it might be
meaningful not to have one, but to satisfy the skeptics out there
I'll mention a real-world, non-fabricated case: at jobs.
(Actually, this case, or another one involving write(1), came up
on the net just a week ago.)  I sometimes like to send myself
notification that one of my at jobs has finished by including

	echo "at job is done" | write scs

in the at script.  However, at scripts run with no controlling
tty, so write(1) aborts and refuses to send the message, just
because it won't be able to say where it's from.

Just now I tried write(1)ing to myself from within emacs, so that
I could see the exact "message from..." message on my screen.
But since I'm running the Gigantic Neanderthal Ubiquitous emacs,
which runs shell escapes in some kind of weird subshell with
pipes, all I got was the "write: can't find your tty" message in
the *Shell Command Output* window.

Some time ago I wrote my own version of write(1), mostly so that
I could put in the wrinkle that it leave out the "on ttyxx" part,
rather than falling back and punting, if it couldn't determine
it.  (It also prints to several ttys if the user is logged in
multiple times, rather than picking the wrong one.  Now, of
course, in these security-conscious days, private versions of
write(1) don't work well, because they should be setgid.)

So yes, check for all error returns.  And yes, print noisy messages
(unless the error is truly benign, as in the write(1) example).
But think for a minute whether the error really has to be fatal,
whether there's any reason the program can't continue.  If it
can't; if the error means that vital data has been corrupted such
that continuing would be futile or worse; then by all means abort.
But if you can safely continue with little or no effort, please do.

                                            Steve Summit
                                            scs at adam.mit.edu

P.S.  I am aware that there are better ways of decoding numeric
flags than by using a big switch statement as in the first
example.  (In fact, I usually use a lookup table.)