Update on zero UNIBUS interrupt vector problem

v.wales at ucla-locus v.wales at ucla-locus
Tue Sep 27 06:56:42 AEST 1983


From:            Rich Wales <v.wales at ucla-locus>

About a week ago, I sent a message to UNIX-Wizards asking for help in
tracking down a problem with zero UNIBUS interrupt vectors on our VAX
11/780 running 4.1BSD.  I received several replies, which I am summar-
izing below.  Unfortunately, we are still having the original problem.

Apparently I didn't explain the problem adequately in my first message.
I have been observing two different kinds of "zero vector" scenarios:

(1) A steady "trickle" of zero vectors (about 100-200 an hour) -- which
    I assume is due to the "grant stealing"/"passive release" behavior
    described below by Dave Martindale, and which therefore is probably
    nothing to worry about.

(2) An occasional "glitch", where the entire UBA seems to lock up for
    several seconds, registering hundreds of thousands of zero vectors
    in succession until the count finally exceeds 250K and the error
    code in dev/uba.c clears things up via a UBA reset.

We have, on the average, one of these "glitches" about every three days
-- though sometimes we have gone "clean" for two weeks, and other times
we have had three or four in a single day.  About a third of them are
associated with UBA error messages indicating UBSTO (UNIBUS Select
Timeout) conditions; the FUBAR register value in the error message has
generally pointed to one of our DZ's (but not the same DZ every time).

My specific questions right now are:

(1) Suppose I get a UBA error, like the following:

	    uba0: uba error sr=2<UBSTO> fmer=0 fubar=760104

    Does this mean that the device whose register space is cited in the
    FUBAR (e.g., 760104, which on our 780 is one of the registers on a
    DZ) is defective?  Or is this device simply the innocent victim of
    some other problem in the UNIBUS or the UBA?

(2) We do have some people around here who are experienced in 'scoping
    logic circuits, but none of us have ever tried to analyze a UNIBUS.
    Can anyone suggest (in reasonable detail) an approach which might
    locate the source of a "glitch" such as the ones we are having?

(3) Is there any way for the kernel to tell whether a zero vector is
    the result of a "passive release" (see Martindale's description
    below)?  Since VMS logs zero vectors (at least, our DEC CE says it
    does), I would think that the VMS error log would be swamped with
    zero vector reports unless there were some way of weeding out the
    red herrings.

    The BRRVR FULL bits in the UBA Status Register (see page 287 of the
    1982-83 VAX Hardware Handbook) sort-of sound like what I am looking
    for, but the handbook seems to say that these bits remain set only
    in case of an error.  (What's the story, Armando?)

(4) We have several 750's, and I have never seen any zero UNIBUS vec-
    tors on any of them (not even the "trickle" syndrome).  Is the
    750's UBA smarter than the 780's in this respect?

Here are summarized versions of the replies I have received to date,
with my comments.

-- Rich Wales <wales at UCLA-LOCUS>

-------
    >>  Zero UBA vector is caused by "grant stealing" in the 8647 chip
    >>  used in most DEC boards, including the DZ11.  If one of these
    >>  boards sees NPR and bus grant simultaneously, it seizes the
    >>  grant, and then gives it up (passive release) if it didn't want
    >>  it.  The UBA never receives a vector value, so it thinks it saw
    >>  a zero vector.  Zero vectors would thus be common on a machine
    >>  with both lots of UNIBUS DMA and devices which interrupt
    >>  frequently (such as DZ's).
    >>      Dave Martindale <decvax!utzoo!watmath!watcgl!dmmartindale>
    >>      (not a direct reply -- from an old UNIX-Wizards message I
    >>      found amongst my vast backlog of mail)

    This undoubtedly explains the "trickle", but not the "glitches".

-------
    >>  Pull out boards one at a time until the problem disappears.
    >>      Doug Gwyn <gwyn at BRL-VLD>
    >>      lacasse at RAND-UNIX
    >>      Peter Gross <hao!pag at SEISMO>

    Unfortunately, the 780 in question is a heavily used production
    machine, and the "glitches" occur at irregular and non-reproducible
    intervals -- so we cannot afford the luxury of running for days or
    weeks at a time without all the hardware in place.

-------
    >>  Find out what interrupt occurs right after a zero vector.
    >>      Greg Chesson <chesson%Shasta at SU-SCORE>

    I added a few lines to sys/locore.s to tally (in an array) inter-
    rupt vectors occurring right after a zero vector at the same IPL.
    (There were far, far too many of them to use a console "printf".)
    Just for fun, I am tallying vectors both before and after a zero
    vector -- in two 1-D arrays, not a 2-D array (at least for now).

    The "trickle" of vectors showed up distributed among all my UNIBUS
    devices, essentially in proportion to how heavily they were being
    used.  (I had previously installed other code to tally interrupts
    by device, so I knew how many inputs and outputs occurred on each
    DZ, DH/DM, etc.)  This behavior seems to agree quite well with
    Martindale's "grant stealing" analysis.

    A few non-zero vectors occurring right after zero vectors corres-
    ponded to DMA devices (SI disk and DH/DM output), but not many.

    After one of the "glitches", my "tallying" code showed me that I
    had had some 248,000 zero vectors followed by zero vectors.  In
    other words, the zero vectors must have come all in a row, without
    anything else intervening.  (The reason for 248,000 rather than
    250,000 is that the "trickle" had raised the total zero-vector
    count to about 2,000 before the "glitch".)

-------
    >>  Try moving the SI disk interface to the front of the UNIBUS.
    >>      Greg Chesson <chesson%Shasta at SU-SCORE>

    I did this, with no effect whatsoever on the system (i.e., I con-
    tinued to have "glitches" after moving the SI).

-------
    >>  Make sure every DMA interface on the UNIBUS has the NPR jumper
    >>  removed.
    >>      Rick Adams <rlgvax!ra at SEISMO>

    I have checked this, and they do.

    In any case, though, if I had a DMA interface with the NPR jumper
    still in place, I wouldn't think the device in question would work
    at all.

-------
    >>  DEC DZ's and ABLE DH/DM's are probably not at fault.
    >>      Bob Walsh <walsh at BBN-UNIX>

    >>  Culprit is probably the SI disk controller.
    >>      Sam Leffler <sam at BERKELEY>

    Assuming the SI controller is at fault, any ideas on how to prove
    it (keeping in mind that we depend on it and cannot run the system
    in production without it)?

-------
    >>  Put a scope on the UNIBUS.
    >>      Sam Leffler <sam at BERKELEY>

    We do have some people here who are experienced in 'scoping logic
    circuits, but none of our people have troubleshot a UNIBUS before.
    Can anyone suggest exactly what to monitor and how to do it?



More information about the Comp.unix.wizards mailing list