4/280S hangs with nfsd in INODE/KERNELMA wait

Sat Nov 11 01:06:44 AEST 1989

> On occasions, to numerous to mention, over the last couple of weeks our
> Sun 4/280S have had the load average skyrocket (i.e. >60) after which it
> is catatonic and must be rebooted.

Our Sun 3/160S has the same problems. (since SunOS 4.0.3_Export has been
installed).

> When this happened the other day we tried to look around before rebooting.
> Shutting down to single user mode we got the "Something won't die, ps axl
> advised."  Below is the output from the "ps axl".  We have also noticed
> one of the nfsds waiting on `kernelma' just as the load starts to climb.
> This machine is also subject to the infamous `Bus Error Reg 20<TIMEOUT>'
> and `BAD TRAP' panics that have been reported here by ourselves and
> others.

We get a similar behaviour - a daily Bus Error Reg 20 <TIMEOUT> We also
have nfsd's waiting on kernelma.  Additional information: our server
serves 9 clients. Most of the nfsd's (8) seem to be working hard (they are
seen as 'running' in the 'top' display most of the time). Vmstat shows a
lot of interrupts (about 10 times more than other servers with similar
configuration and load). Vmstat -i shows ie0 as the main source of
interrupts). There is no cpu problem (we have idle cpu most of the time).
I/O is horrible - there is almost no response to keyboard input. There are
constantly 8 - 12 processes in the run queue and the number of context
switches is huge. All this sometimes happens when most of the clients are
idle and nobody is logged in on the server (except root). Disconnecting
the Ethernet cable reduces the number of interrupts to 'normal', empties
the run queue and seems to solve the problem. Killing 7 out of 8 nfsd's
has the same effect.

Here is a comparison (of possibly relevant devices) between our machine
and psuvax1:

psuvax1                                 our server
mem = 32768K (0x2000000)                mem = 12288k (0xc00000)
avail mem = 31481856                    avail mem = 11214848
xd0: <NEC D2363 ... >                   xd0: <Fujitsu-M2361 Eagle ...>
xd1: <NEC D2363 ... >                   xd1: <CDC-SABRE-1230 ...>
xdc1 at vme16d32 0xee90 vec 0x45        xyc0 at vme16d16 0xee40 vec 0x48
xd4: <NEC D2363 ... >                   xy0: <Fujitsu-M2351 Eagle ...>
si0 at vme24d16 0x200000 vec 0x40       sc0 at vme24d16 0x200000 vec 0x40
st1 at si0 slave 40                     st0 at sc0 slave 32
                                        sd0 at sc0 slave 0 <-- no disk attached
zs0 at obio 0xf1000000 pri 3            zs0 at obio 0x20000 pri 3
zs1 at obio 0xf0000000 pri 3
mcp0 at vme32d32 0x1000000 vec 0x8b     mcp0 at vme32d32 0x1000000 vec 0x8b
ie0 at obio 0xf6000000 pri 3            ie0 at obio 0xc0000 pri 3
                                        ie1 at vme24d16 0xe88000 vec 0x75

We have similar servers (3/160 and 3/180) with similar configurations -
except the ALM board (mcp) which is installed on the problematic host
only.  There are no similar problems on these servers.  Thus we suspect
the mcp or the interaction between the 2 disk controllers (2 xdc's on
psuvax1, 1 xdc and 1 xyc on our machine), their device drivers or the
interaction between these drivers and other parts of the kernel.

A different (?) question: The manual states that "Four seems to be a good
number" concerning the number of nfsd's. However, in the distributed
/etc/rc.local the number of nfsd's is 8. Any comments or ideas?

Ariel Cohen
System manager
Tel-Aviv University
CS lab, School of Math Sci.