Kernel trap type 0xE

Thu Jul 26 20:57:13 AEST 1990

In article <ERIC.90Jul23172008 at mks.mks.com> eric at mks.mks.com (Eric Gisin) writes:
>We have been getting a lot of kernel type 0xE traps
>on our Interactive 2.2 system. Under 2.0 the machine
>would crash a lot but there were usually no clues as to why.

You see, this is improvement, now instead of crashing and burning without
any clue you get a trap type :-} :-}.

>I looked up interrupt 0xE in the Intel manual and it is a page fault.
>Several people have claimed they are hardware related (memory or bus),
>but I can't see how this can be the case. You only get this fault
>when an address mapping results in a page table entry with the
>page-present bit of zero or when page access is denied.

>Can anyone explain or speculate how a hardware problem
>could be causing page faults?

Yes, quite simply. First off, understand that the kernel unlike a user
process is not demand paged, it does not have pages stolen if not referenced.
Therefore, a page fault, something quite common and frequent for a user
process, is indicative of a crisis when running in the kernel. All kernel
pages should be in core at all times (note that this is implementation
specific). Now you wonder how page-faulting in the kernel can be a hardware
problem, well most easily imagined is where the memory subsystem just isn't
quite up to snuff and when the CPU goes to read something on the data bus
it gets garbage instead of legitimate data. In fact, case in point, your
own words:

>Two crashes I did debug had a null-pointer dereference in namei(),
>one called from open() the other for exece().
>Two other crashes occured during reboot in wdrintr().
>Could these be software-bug generated faults?

A pointer is a perfect example, we read memory to get an address, and in fact
we expect a legitmate kernel address, however in that one-in-a-million case
the hardware burps and we read garbage on the data bus, we branch to that
spurious address, oh no, guess what, page fault! Of course, this is only one
simple scenario, things could be much more complex but I hope it illustrates
the point.

In fact, the very fact that your crashes ranged all over in the code should
have been a clue, I mean do you think the kernel code is mysteriously
changing from one minute to the next? The typical kernel software problem
is where you get a fixed location where a panic happens like in some third
party device driver :-}!

Naturally, there are cases where it is a combination of hardware and software,
like when some code only gets executed when certain unique hardware is present
and where that code has a problem and/or it doesn't handle the hardware
properly ( again device drivers being the most common ), but far more common
these days are scenarios similar to mine where the code is just fine but
the hardware has heartburn :-}.

Hope this sheds some light.

Disclaimer: Me speak for LCC or IBM ?? I don't wear suits and ties :-}!

-- 
Jack F. Vogel			jackv at locus.com
AIX370 Technical Support	       - or -
Locus Computing Corp.		jackv at turnkey.TCC.COM