uq0 being reset

George Robbins grr at cbmvax.commodore.com
Wed Feb 14 23:40:39 AEST 1990


In article <709 at shodha.dec.com> alan at shodha.dec.com ( Alan's Home for Wayward Notes File.) writes:
> In article <9648 at cbmvax.commodore.com>, grr at cbmvax.commodore.com (George Robbins) writes:
> > 
> > It is important to understand that these messages are basically *fatal* -
> > meaning that you need to take action as soon you see them...
> 
> 	One of the features of the Digitial Storage Architecture
> 	(DSA) is that it tries to provide applications a view
> 	of disks that make them appear to be error free...
> 	                                ...  The block is good, but the
> 	data is corrupted from what it should have been.  Rather than
> 	gloss over it, the drivers force an Input error when the block
> 	is accessed.  The bit gets cleared when it is written to.

Thanks to Alan for posting all the additional info.  Since we've escaped from
the original "yer drive/controller is dead" subject, I'll expand a bit also...

1) The "forced error" is essentially a "tombstone".  The damage was done some-
   time before, the bad block was replaced, but the "tombstone" marks that
   place where the "corpse", the un-recovered data, lies.

   Obviously, whether to corrupted data is important to recover or not is
   installation/case dependent.  HOWEVER, even though only a few bits or
   bytes may have gotten zapped, it is important to understand that the
   forced error condition does propagate up to the user level software.

   This means that if a program bothers to do any error checking, it's
   likely to toss it's cookies or at least stop processing that file then
   and there.  For example dump will print "shouldn't happen errors" and
   tar truncates the file and prints a "size changed" message.  What your
   pet applications does is another question.

   If you're getting these messages, it is still important to act on them
   with due haste, though not necessarily panic.

2) Generally, you will tend to get the forced error message with the same
   block number repetitively - there was only one original error/replacement
   but each time you read the file you'll get the forced error again, for
   example as part of your daily backup run...

   If you start getting a variety of block numbers, then it's a pretty good
   indication that your drive is starting to go south or maybe (especially
   if it's a 3-rd party drive) you didn't run enough surface analysis to
   initially pick up all the bad spots.

3) You can use "/etc/uerf -o full -D" to pick up the gory history of the
   problem, however interpretation of the error log is non-trivial.

   I can't remember off-hand whether a "BAD BLOCK REPLACEMENT FAILED"
   message actually gets logged to the console - I've always ended up seeing
   the "forced error" messages rather than the original error.

   The value of /etc/uerf is somewhat compromised by the amount of pro-forma
   crap that get stuck in there, especially if you have some thing like 12K
   "fixed up unaligned access" messages (lps20 "lpscomm" thanks 8-) gumming
   up the works.

   I've attached a little shell script that I run nightly that mails me a
   summary of accumulated errors.  Periodically, I clear the logfile and
   let it start over.

   One of the other guys posted a program to do some log selection/analysis
   which might be a better starting point - I haven't messed with it yet...

Here's the error log analyzer...
--------------------------------------------------------------------------------
#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	daily
# This archive created: Wed Feb 14 07:32:11 1990
export PATH; PATH=/bin:$PATH
echo shar: extracting "'daily'" '(360 characters)'
if test -f 'daily'
then
	echo shar: will not over-write existing file "'daily'"
else
sed 's/^	X//' << \SHAR_EOF > 'daily'
	X#! /bin/sh -
	X(
	X# extract summary/counts from uerf garbage
	X
	Xecho ""
	Xecho "Error Log Messages:"
	X/etc/uerf | \
	X	egrep '^MESSAGE|^ERROR' | \
	X	sed -e 's/.*MESSAGE *//' -e 's/.*ERROR SYNDROME *//' | \
	X	sort | \
	X	uniq -c
	X
	X# should do some kind of (crude) rotation...
	X
	X# LOG=/usr/adm/syserr/`hostname`
	X# cat $LOG >> $LOG.old
	X# cat /dev/null > $LOG
	X
	X) 2>&1 | mail root
SHAR_EOF
if test 360 -ne "`wc -c < 'daily'`"
then
	echo shar: error transmitting "'daily'" '(should have been 360 characters)'
fi
fi # end of overwriting check
#	End of shell archive
exit 0
-- 
George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing:   domain: grr at cbmvax.commodore.com
Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)



More information about the Comp.unix.ultrix mailing list