Sys V/386/3.2 UNIX system getting hung (?)

M.BAKER mrb1 at homxc.ATT.COM
Fri Apr 7 01:43:27 AEST 1989


Hi ---

Since the net was so helpful on my last query, I'd like to
give it another try:

	We have an AT&T 6386E system running UNIX SysV/3.2.
	
	While running our application, it has been observed to
	'hang'.  Specifically, the application stops in the
	middle of things.  More importantly, all the terminal I/O
	stops.......including the system console.  You can't log
        in on a free getty.  Anything you
	type gets echoed back to the screen, but nothing gets 
	done with it.  If you hit "Ctrl-Alt-Del", the screen
	displays a message saying "You must run shutdown before
	using Ctrl-Alt-Del" or something very similar to that.
	There is no "Fatal Error - Parity Check at ...." message
	or anything abnormal on the console.
	The only thing to do then (that seems to work for me) is
	to hit RESET.

	Well, rebooting kind of destroys all the clues.  Since
	the kernel apparently never did a panic(), there's no
	dump available to look at with crash.
	If the hang occurred in the middle of the night, and
	time elapses before you reset the system, sar shows
	nothing past the last recorded 'checkpoint' before the
	system 'died'.

I will furnish more details of our hardware configuration/software
application upon request....for now, I think that these basic clues
should be able to get us aimed in the right direction. 

My first suspicion:

The 3.1 & 3.2 software notes state that if you "run out of 
free clists, all input/output activity from/to terminal ports and
the console will cease.  No warning message is printed by the
system to show that it is out of clists".  Sounded good at first,
so we raised the NCLIST tunable parameter from 120 to 170 (recom-
mended value for 4M machine) and then to 200 (the max. in mtune).
Stil had the problem, though.  Which leads to a couple of quick ques-
tions:

	1.) Can you check the number of free clists while the
		system is running?  sar doesn't seem to be any
		help here, and I'm sure crash can reveal it but
		I'm not sure how to get to it.  

	2.) Is there any circumstance in which clists can get slowly
		used up (i.e., occasionally not returned to the
	        free pool)?

Also, could this problem be symptomatic of the time slicer
interrupt going away
(not being generated, or recognized) which robs UNIX of knowing
that time is passing us by?  Or are we just in some kind of major
deadlock?

I think that the processor is still alive, since console characters
echo to the screen and it responds to the Ctrl-ALt-Del keyin.  Plus
this is a protected mode machine, so it's a little tougher for an
application to clobber the OS by writing in the wrong area, or
whatever.

Any clues/suggestions/tips/criticisms/flames/whatever would be
really appreciated.

Thanks
M. Baker
homxc!mrb1          201-949-3455



More information about the Comp.unix.questions mailing list