mounted machine down => df hangs

Sun Dec 3 19:06:09 AEST 1989

kae at ihlpm.att.com (Kenneth A Edwards) writes:

>In article <2652 at brazos.Rice.edu> rush at xanadu.llnl.gov (Alan Edwards) writes:
>>
>>When one of our disk servers goes down, doing a 'df' on a machine that has
>>the one of the disk server's partitions mounted, causes the 'df' process
>>to hang PERMANENTLY.  The df process cannot be killed by kill -9.  Is

>This is not likely to be fixed (and isn't fixed) in 4.0.3, since the
>problem is inherent in the definition of how "hard" (the default mount)
>NFS works.  There are a couple of things you can do:

Actually, there is a new bug introduced with release 4.0 with NFS mounted
filesytems.  Processes that are not accessing the downed system still also
hang up waiting on the dead system.  Under 3.5 and previous releases it
was possible to work around this problem by mounting all NFS filesystems
in a directory under root and with a separate directory for each server.
For example: "/hosts/sun1/disk1" would be a mount point for an NFS
filesystem under this scheme.  Then with the library call "getwd()", if
your process is in directory "/hosts/sun2/disk1", your process can safely
step up the directory tree and not touch the dead NFS mount point.  This
worked just fine for us until release 4.0.

With SunOS4.0 and later releases, Sun introduced a "performance
improvement" to the "getwd()" library call.  The library function ends up
"stat()"ing virtually all your mounted filesystems every time your program
tries to compute its working directory.  This is nearly guaranteed to hang
up any process makeing a "getwd()" call when a hard-mounted NFS filesystem
hangs.

IMHO a process that is not accessing data on a dead NFS server should not
hang waiting on that server, but it does.  Can we have slow "getwd()" call
back? :^(.

Mikel Lechner			UUCP:  mikel at teraida.UUCP
Teradyne EDA, Inc.