WRT Disk errors on 11/750 running 4.3 BSD

Mon Jul 24 16:08:51 AEST 1989

[re `mcr%d: soft ecc addr %x syn %x' errors]

In article <117 at egrunix.UUCP> hacker at egrunix.UUCP (Thomas J Hacker) writes:
>... As long as you only see "soft" errors, and they don't occur "too
>often", you can just ignore them forever.

This is ill-advised.  The purpose behind error-detecting-and-correcting
memory is to fix the errors *and* provide a report so that failing chips
can be replaced when it is convenient to halt the machine, rather than
immediately after losing whatever was in progress.

("too often": we had a 780 that would routinely report 10-12 of
>those mcr0 errors per hour, and other than wasting console paper,
>caused no other apparent problems.  It was like this for years.)

4BSD shuts off further error reports for ten minutes after each error,
so a machine that reports six errors per hour probably has at least one
hard failure (by this I mean `one chip that is really, truly bad':
both `soft' and `hard' ECC errors can be due to either `soft' or `hard'
hardware errors; a soft hardware error is like the noise your car makes
whenever it is *not* in the shop).  In this case a single stray cosmic
ray or alpha particle can bring the machine down with an uncorrectable
double-bit error, or, worse, corrupt two or more bits undetectably.
Running with a known hard failure is rather like driving your Honda
around when one cylinder is out---it works, but you should fix it as
soon as you possibly can.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris at mimsy.umd.edu	Path:	uunet!mimsy!chris