IOS problems after upgrading to Unicos 6.0

gerben at news.sara.nl gerben at news.sara.nl
Thu Jun 20 07:10:57 AEST 1991


After upgrading our Cray from Unicos 5 to Unicos 6 we experienced
severe problems.

Our configuration:

Cray YMP-4/464, IOS-D, Unicos 6.0.11, IOS 6.0.?

IOS with 1 MIOP, 1 XIOP and 1 BIOP with channels to:
  - 3 DD49's and 4 DD41's

   DD49-1		80% for swap (20% backuproot)
   DD49-2		20% for swap, 80 % inode's for / and /usr filesystems
   DD49-3 + DD41's	/ + /usr filesystem data

SSD 128Mw		/bin + /usr/bin filesystems + 110Mw ldcache space
BMR			/usr/lib filesystem


Symptoms of the problems we had:

- system seems to be frozen for periods upto 30 seconds

- often these periods were soon followed by a system crash

- Analysis of the dump showed that in the message buffers for the OWS console
  the systems says it saw disk errors, asking the operator to choose
  between 'retry', 'abort'. (This message only once really made it to
  the screen of the console)

- Message buffers of the errorlog also indicate severe disk problems
  (problem was that the errlog daemon wrote messages to disk about
   problems with that disk, causing it to generate more message, etc...)

- We could reproduce crashes by running a set of 3 user submitted
  batchjobs. Probably the combination of intensive vector computation
  and disk i/o causes the crashes.

Later, after receiving an iosmod we also saw that the dump showed excessive
MCU interrupts.

This mod was written in order to solve problems KFA/Germany had after
their 6.0 upleveling. We're not sure they saw the same symptoms as we
did.

The mod id is: 60iosys17577a.

For us it seems to prevent the crashes (system is now running for 24
hours), but now we see some performance problems:

- It looks like the systems suspends it activity for a few seconds every
  2 minutes. We found that this happens due to the ldsync's the init
  process does. ldsync called on command level causes the same suspends.

- Swapping seems to have the same effect, response time increases dramatically.

It looks like disk i/o initiated by unicose itself (like swapping and
ldcache flushing) has a poor performance. We are not sure about the
performance of user initiated disk i/o.

During the next few days we will look into the performance problems we
have and let you know about our findings. At this moment we are not sure
where these problems originate from.


So be sure you contact your local Cray analist about this ios mod, which
should, in our view, have been a field allert.


Sincerely,

Jan Overweel and Gerben Jansen,
SARA, Amsterdam,
The Netherlands.

email address: gerben at sara.nl



More information about the Comp.unix.cray mailing list