What do panic messages mean?

Larry Dighera root at conexch.UUCP
Tue May 17 07:32:21 AEST 1988


In article <694 at applix.UUCP> jim at applix.UUCP (Jim Morton) writes:
>In article <308 at conexch.UUCP>, root at conexch.UUCP (Larry Dighera) writes:
>> 
>> Is the format of the panic message documented anywhere so that one
>> can interpret them?  I realize that the third through fifth lines are
>> register dumps.
>> 
>> TRAP 000D in SYSTEM
>> ax=5600, bx=00BC, cx=0000, dx=0020, si=002F, di=6936
>> bp=03A4, fl=0206, uds=0018, es=0020
>> pc=0038:9AA1,  ksp=0388
>> panic: general protection trap
>>
[two panic message deleted]
>
>First, do a "nm /xenix | sort >/tmp/foo". [...]  Then, take the
>PC address given in your panic message (In the above case, 38:9AA1) and
>find the routine in the kernel (by looking at /tmp/foo) that is located
>at the next lower address. This is the routine that crashed the kernel.
>If you're lucky, the name of this routine will give you a clue as to why
>the system crashed - if the routine was, for example, "_ttioctl" it would
>point towards a problem doing an ioctl() call on a serial port line.
>
>You have to use the first "pc=" value given, if there is a second one
>printed it probably points to the trap routine itself. If you REALLY
>want to hack further, you can find the /usr/sys/* module the routine
>is located in and adb(CP) around in it to see what's going on. That way
>the register values (ax= bx=) may show you why the routine crashed.
>

Jim: 
 
Thank you for sharing your technique for determining what routine the 
kernel was executing at the time of the panic.  Your posting is most 
helpful.  Here is the relevant output from: 
		                        nm -vp /xenix | grep '0038:9' | sort
(nm complained about too many symbols to sort without the -p)

0038:98a2  T _nbmap
0038:9914  T _notincore
0038:9947  T _exrd
0038:9a59  S FIO_TEXT           <-- Segment Name
0038:9a59  T _getf
0038:9a91  T _closef            << --- The routine that contains 38:9aa1
0038:9c62  T _isinfile          <<< --- The routine that crashed the kernel?
0038:9cc2  T _openi
0038:9d4c  T _access
0038:9e2b  T _owner
0038:9ed7  T _suser
0038:9ef5  T _ufalloc
0038:9f2c  T _falloc
0038:9faf  S SYS4_TEXT          <-- Segment Name
0038:9faf  T _gtime
0038:9fc3  T _ftime

Is my notation, to the right of the nm output, a correct interpretation of
the method you have described of locating the routine that crashed the
kernel?  

I'd examine the routine with adb, but adb's syntax is a little too obscure
for me at this point.  If my interpretation of your intent is correct,
would you agree that the kernel was doing file i/o at the time of this
panic?  (presuming so, I'll continue)

So, now that I have a method of locating the kernel routine that was
running at the time of the panic let's consider the causes of kernel
panics.

Quoting David F. Carlson's follow up message on this subject:
<Welcome to the Intel 80286!  We Microport users know *exactly* what that
<error message means.  In the Intel '286 reference manual their are several
<types of faults: TSS, protection, etc.  It seems that the stack region for
<any process is 64K (ie., one segment).  But so is the kernel stack!  When
<the kernel stack pointer rolls over its segment it causes the above panic.

Is the only possible cause ?

<There is really no efficient means to correct this as kernel stack is used
<very frequently and to have multiple stack segments would require nasty
<segment loads, etc.  In fact, even the Microsoft huge model compilers don't
<allow multiple stack segments.  However, the UNIX (Xenix) kernel is large
<and each process that runs will occupy area on the kernel stack.  In addition,
<interrupt handlers also use the kernel stack and can cause very large (albeit
<short term) stack usage.  Bottom line is that there are many circumstances
<of a kernel stack requiring more than 64K to live and no practical way for
<the '286 architecture to provide it.  Now just wait until the naive DOS
<losers get suckered into OS/2 with exactly the same '286 limitations and 
<no 386 version in sight!  (This architecture issue is exactly why I would
<never recommend less than a 32 bit address space for UNIX:  buy a '386.

If this is true, the '286 kernel has an inherent limitation that can 
be a source of panics.  Failing hardware (RAM, CPU, ...), as well as
disruptive events (stray alpha particle, power glitch, ...) come to mind
as plausible causes too.  So, although I can tell what routine the kernel
was executing at the time of the panic, the complete origin is still 
somewhat obscure.
 
Fortunately, the panic messages usually end up in /lost+found after re-booting.
This is the result of fsck putting the /use/adm/messages file there (named by
it's inode number, of course).  Keeping a log of the PC value of each
panic, should shed some more light on their cause.  If the PC
value is the same in all panics, I would suspect only one cause.  If the
PC value was inconsistent, I would suspect either multiple causes or 
random events.
 
In my case, of three panics documented, in two cases the PC value was
as above, and the last was different (the kernel was apparently
in the _siowrite routine of the AMI LAMB 8 port driver).  From this 
limited data I would infer that indeed there are multiple causes for
the panics I am experiencing.

Am I on the right track?  

Larry Dighera

-- 
USPS: The Consultants' Exchange, PO Box 12100, Santa Ana, CA  92712
TELE: (714) 842-6348: BBS (N81); (714) 842-5851: Xenix guest account (E71)
UUCP: conexch Any ACU 2400 17148425851 ogin:-""-ogin:-""-ogin: nuucp
UUCP: ...!uunet!turnkey!conexch!root || ...!trwrb!ucla-an!conexch!root



More information about the Comp.unix.xenix mailing list