<exiting> tip processes on the SUN

Steve Dyer dyer at spdcc.COM
Thu Dec 28 04:08:00 AEST 1989


In article <21875 at adm.BRL.MIL> swenson at nusc-wpn.arpa writes:
>	We are using the standard tip line to a remote VME cage (i.e. just
>another machine).  During (what appears to be) some relatively high bandwidth
>data transfers, the tip line loses its mind.  Do a ps and the tip shows up as
><exiting>.  Try to kill the process -- it won't die.  During fastboot we
>get a message like "Warning processes wouldn't die -- suggest using ps" 
>(we are truly afraid at this point).  The questions are, why is the tip
>line hanging up (difficult to answer with limited information, I know),
>and is there a way to kill the <exiting> process without rebooting the system?

Almost always, when a process is stuck in the <exiting> state, it's in the
middle of a device-specific close routine called from the exit code.
A process can invoke the exit code either explicitly through the exit
system call or in response to most signals which have SIG_DFL handling.
Here, the device-specific close routine would probably be for the
serial I/O hardware.

If the device-specific close routine (or a routine it calls) sleeps, and
for one reason or another there is no wakeup() forthcoming, you will get
into this kind of a situation.  Usually, the close routine in TTY drivers
attempts to flush the characters on the output clist to the hardware before
returning from the close.  Now, with hardware problems or bugs in the driver
itself, if the output interrupt never happens or it doesn't manage to issue
a wakeup, the process will be hung up on a sleep() inside the exit code.

You can issue a "kill" as much as you want.  What it will do each time,
however, is to interrupt the sleep and restart the exit code.  The exit
code will loop through all open files and call the device-specific close
routine again and get stuck one more time.  Without rewriting the device
driver to handle this pathological situation (or ingenious adb hacking
on an active kernel), the easiest way to recover from this is to reboot.

This is a general description of what can go wrong--it isn't Sun-specific.

-- 
Steve Dyer
dyer at ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer
dyer at arktouros.mit.edu, dyer at hstbme.mit.edu



More information about the Comp.unix.wizards mailing list