WRT Disk errors on 11/750 running 4.3 BSD

Thomas J Hacker hacker at egrunix.UUCP
Sat Jul 22 00:09:35 AEST 1989


As promised....posting of responses.
Thanks to following people for responding:

Larry Parmelee
parmelee at cs.cornell.edu

Guy Harris
guy at bootme.auspex.com
(Sorry if I forgot anyone else's name)

Re: Disk Problems on a 11/750 running 4.3 BSD


In article <115 at egrunix.UUCP> you write:
> So, I thought I would wait a day or two to see if it would repeat,
> then this came up:
> 
> Jul 11 18:07:30 unix vmunix: mcr0: soft ecc addr 1a72 syn 73

"mcr0" is "Memory ContRoller 0".  It is most likely not related to
your disk problems.  As long as you only see "soft" errors, and
they don't occur "too often", you can just ignore them forever.
("too often": we had a 780 that would routinely report 10-12 of
those mcr0 errors per hour, and other than wasting console paper,
caused no other apparent problems.  It was like this for years.)

Soft/Hard- "soft" means the memory "ecc" - Error Check/Correction
logic detected an error but was able to correct it (single bit error).
"hard" means the ecc detected an error but couldn't fix it (double 
bit error). 

"addr" and the following number, "1a72", can be used to figure
out which board was failing.  You need to know how much memory is
on each board, and multiply the "1a72" number by 4, since the ecc
logic looks at memory in 4-byte chunks:  (1a72*4) mod (bytes per board)
gives you the board number which had the error.  Unfortunately I'm
not sure how the boards are laid out in a 750.

The "syn" - Syndrome and following number "73" can be used to
figure out which chip on the board failed.

One last note:  I say "failed" above, but be aware that this generally
only means that one single bit out of a large number happened to 
change state.  With high density memory chips, this sort of thing is
not entirely unexpected, hence they build the boards with ecc logic
to correct the occassional expected bit flip.  Mcrx soft errors can
be ignored almost indefinitely, unless they start occuring in such
numbers that you think a whole chip has failed.  Even if a whole chip
fails, you can probably "limp along" for quite a while, assuming there
are no other problems on that memory board.  

-- 
Thomas Hacker               ...Weave a circle round him thrice,
Systems Programmer             And close your eyes with holy dread, 
Oakland University	       For he on honeydew hath fed, --"Kubla Khan" 
hackertj at unix.secs.oakland.edu And drunk the milk of Paradise. -- ST Coleridge



More information about the Comp.unix.questions mailing list