Possible corruption of Message Queues on XENIX 386

Mark Delany MDelany%hbapn1.prime.com at relay.cs.net
Tue May 8 13:02:49 AEST 1990


Has anyone else seen Message Queue corruption on XENIX 386 (SysV 2.3.1)
when under heavy load, particularly when the system is paging?

We're suspicious as ipcs gives strange values for CBYTES and QNUM.  To
wit:

--------------------
Standard IPC package status

IPC status from /dev/kmem as of Thu May 3 11:01:02 1990
T     ID     KEY        MODE       OWNER    GROUP  CREATOR   CGROUP CBYTES  QNUM QBYTES LSPID LRPID   STIME    RTIME    CTIME
Message Queues:
q     10 0x712806a1 SRrw-rw-rw-     cacs    group     cacs    group  65404 65535   1028   389   331 10:50:53 10:50:53 10:29:42
q     11 0x712806a2 -Rrw-rw-rw-     cacs    group     cacs    group     40     2   8192   331   389 10:50:53 10:50:53 10:29:42
...
--------------------

and on another occasion


--------------------
Standard IPC package status

IPC status from /dev/kmem as of Thu May 3 16:12:47 1990
T     ID     KEY        MODE       OWNER    GROUP  CREATOR   CGROUP CBYTES  QNUM QBYTES LSPID LRPID   STIME    RTIME    CTIME
Message Queues:
q     20 0x712806a1 SRrw-rw-rw-     cacs    group     cacs    group  65445     0   1028   511   446 16:05:24 16:05:24 15:34:09
q     21 0x712806a2 -Rrw-rw-rw-     cacs    group     cacs    group     20     1   8192   446   511 16:05:24 16:05:24 15:34:09
...
--------------------


CBYTES and QNUM are 16 bit so it looks pretty much like an underflow
problem to me...

It only seems to occur when the system is heavily loaded and most likely
paging too.  Further, the programs in question are making fairly extensive
use of Message Q's (as well as shared memory - if that's relevant) and it
is highly likely that more than one process is trying to access the same Q
at the same time.  In other words, if there are any flaws in the locks
protecting these structures, then the progs will find them real soon!

Once this corruption occurs, all the programs wedge on message Qs.  In
addition, the system often hangs after this has happened. The only
solution we've found so far is to re-boot :-(

What I'd like to know is: Has anyone else come across this?  Were you able
to effect a work-around?

Naturally I've already call our supplier for help, but they're an indirect
supplier (ie not SCO) and, er, haven't been able to come up with any
solution or work-around for us thus far.



Thanks.



More information about the Comp.unix.wizards mailing list