make x + mv x z + rm z = crash

braun at drivax.UUCP braun at drivax.UUCP
Wed Feb 4 11:12:07 AEST 1987



We managed to crash our Vax 11/780 this week.  I was wondering if anybody
could help me understand what went on.  I have some strong suspicions, but
if they are right, they point out some pretty weak places in Unix; the kind
that my co-workers point at and say "See, Unix isn't a REAL operating system!
Real Operating Systems wouldn't do that", then go back to the VMS machine,
snickering.

Anyway, the symptoms were as follows:  A user called and said that some of his
processes were hung, and although he had a prompt, he couldn't  kill -9 any
of them.  Another user was having the same problem.  It turns out that no one
could kill any of these processes.  But some processes could be killed.  Only
those that were in STAT 'D' (or 'DW') could NOT be killed.  

This makes sense to me, as I assume that a process has to come off of an
event list before it can be killed.  If this is assumption is correct, it
seems like a weak point, but I can appreciate the difficulty of killing 
processes waiting on events.

Anyway, the system finally ground to a halt, although not for some time (about
15 minutes after the first report came in).  It turns out that no one had been
writing to the file system that was being used by the processes in question.

One of the processes was a compile in the assembly phase.  This was not the
native Unix compiler, but a cross compiler for another architecture.  It is 
only slightly suspected as having been the culprit; only becuase it is a
3rd party product with a relatively short history.

A stronger candidate was a combination of 'make' processes and .logout process
which, by a strong co-incidence happened to be executing at the same time.  The
combination of processes produced the following tasks:

	1:	compile, creating file x.o
	2:	mv x.o /work/user/trashcan/x.o
	3:	rm /work/user/trashcan/x.o

(1) was the result of 'make'ing x.  (2) is the result of the users' "rm"
command being aliased to "mv \!* ~/trashcan".  (3) is the result of the
users' .logout containing "/bin/rm /work/user/trashcan/*", and the user
logging out while (1) and (2) were running.

This makes me think that the file system either bogged down or got confused
trying to chase it's own tail.  And although I don't have a solution to the
problem off hand, I think that this type of thing shouldn't bring a system
to it's knees.

Do you think I have made an adequate assesment of the problem?  Do you agree
or disagree with my opinion of it?  Mail or Followup as appropriate.


-- 
kral		408/647-6112		...!{amdahl,ihnp4}!drivax!braun



More information about the Comp.unix.wizards mailing list