nfsd freezing: a solution

Steve Losen scl at sasha.acc.Virginia.EDU
Wed Mar 7 02:20:47 AEST 1990


Thanks so much to everyone who responded to my recent posting about the
nfsd daemons hanging forever in a "D" (disk wait) state on our sun3
server.  This turns out to be a known bug for which there is a kernel fix
(a new ufs_bmap.o file).  I installed the fix and our servers have been up
almost a whole week (probably a record!).  I ended up having to get this
fix by calling Sun support.  To avoid this hassle, I have made this
available via anonymous ftp on virginia.EDU in pub/nfsd.tar.Z.

[[Ed's Note: Hopefully you verified that this was okay with Sun? -bdg]]

The following is the README file that comes with the patch.  The patch
contains a ufs_bmap.o for the sun2, sun3, and sun4.

README:

Problem description:

Occassionally on NFS server machines the nfsd daemons have been reported
to get into a disk wait ("DW") state as noted in a listing of "ps aux".
The result of this condition causes all client requests to the server to
fail.  Problem descriptions reported in Sun bugId's 1017518 and 1017893
identify at least two distinct different causes of this problem, described
below:

  Case 1017518: 

On the server system, processes go into DW state and don't return.  This
problem is related to VM and may happen even in non NFS instances.  The
core dump will show _sleep, _cv_wait, _page_cv_wait, and _page_wait at the
top of the stack trace.  Basically the process is blocked waiting for the
keep count on the page it wants to go to zero (meaning that it is
available) but somehow it didn't get decremented correctly and will never
go to zero.

  Case 1017893: 

This is a server problem similar to the client problem in bugId 1018954.
The process is blocked waiting for an mbuf structure to be released back
to NFS, but it is never being released.  The core dump for this problem
shows the hung process with a stack trace of _svc_sendreply,
_svckudp_send(0x7hexdigits,0x7hexdigits) + 2C, _sleep.  The routine
svckudp_send is trying to send a reply to the client, but is blocked
waiting for the mbuf structure pointed to by the first 0x7hexdigits
argument above.  Actually, the first 0x7hexdigits argument to svckudp_send
is a SVCXPRT pointer, not an mbuf.  However, it's possible to derive the
mbuf's address given this argument.

Fix description: 

  Case 1017518:

There currently are two patches available for this case:  

	1) an adb patch which sets nfsreadmap to 0: 

        	# adb -w /vmunix -
        	nfsreadmap?W 0
        	$q

	   This eliminates most of the code that increments and 
	   decrements the keep count.  

	2) The included patched ufs_bmap.o files which fixes a
	   bug in bmap() where "softlocked" were never released after
           failing to extend the original block.

Both patches may not be necessary. It is recommended that the ufs_bmap.o
patch be tried first before the adb patch is also used.

  Case 1017893:

There is not a patch available for case 1017893 at this time.

Otherwise if it is not clear which case symptoms are the cause of your nfs
server hang, or if after applying the above patches you continue to
experience the problem, it will be necessary to get a system core dump
sample and submit them to Sun Customer Support so that your problem can be
further distinquished.

Install instructions:

After extracting fix tape contents into /tmp, as root install the
appropriate sun2, sun3, or sun4 patches as follows:

	cd /sys/{sun2,sun3,sun4}/OBJ
	mv ufs_bmap.o ufs_bmap.o_orig
	cp /tmp/ufs_bmap.o_{sun2,sun3,sun4} ufs_bmap.o
	chmod 444 ufs_bmap.o

Then a new kernel will need to be remade and used.

  Bug Id: 1017518 
  Release summary: 4.0, 4.0.1, 4.0.3
  Fixed in Release: 4.1

*******

Steve Losen
scl at virginia.edu     University of Virginia Academic Computing Center



More information about the Comp.sys.sun mailing list