uq0 being reset
Alan's Home for Wayward Notes File.
alan at shodha.dec.com
Wed Feb 14 03:28:42 AEST 1990
In article <9648 at cbmvax.commodore.com>, grr at cbmvax.commodore.com (George Robbins) writes:
> In article <3888 at ucrmath.UCR.EDU> russ at mays.ucr.edu () writes:
> > We have a VAXstation II running Ultrix 2.0 with
> > a TK50, and two RD53 drives (one recently added).
> > From time to time console messages appear saying
> > Force Error Modifier set LBN ......
> > ra1g: hard error sn .....
>
> It is important to understand that these messages are basically *fatal* -
> meaning that you need to take action as soon you see them. It is probably
> an indication that either your drive wasn't adequately formatted/tested
> initially or that it is picking up a new errors.
One of the features of the Digitial Storage Architecture
(DSA) is that it tries to provide applications a view
of disks that make them appear to be error free. It does
this mapping bad sectors to good ones. Any initially bad
sectors are mapped when the disk is formatted. For errors
that occur after formatting there are parts of the architecture
that describe what is to be done.
For this commentary I'll call the process Bad Block Replacement
(or BBR). There two kinds of BBR, static and dynamic. Pre-V2.0
version of ULTRIX and BSD 4.2 (and probably 4.3) do static BBR.
If a bad block appeared and had to be fixed you booted a stand-
alone program (rabads I think) that would let you scan the disk
would do the BBR for you. Dynamic BBR has been supported by
every version of ULTRIX since V2.0 and some disk controllers.
The UDA50, KDA50 and KDB50 disk controllers will report a bad
block to the host and expect the host to perform BBR. The
RQDX3 and HSC family will do the BBR themselves. Part of the
BBR process is to attempt to read the block many times in order
to get a good copy of the data. If the attempt fails then the
original copy of the data is written to a replacement block and
a bit is set in the block header. This is the "Forced Error"
referred to in the error message. The block is good, but the
data is corrupted from what it should have been. Rather than
gloss over it, the drivers force an Input error when the block
is accessed. The bit gets cleared when it is written to.
In V2.0 and later is a program called radisk(8) that has options
to scan for bad blocks, clear forced errors and start the BBR
algorithm for a specific block or set of blocks (more on this
one later). The command to clear a forced error is:
radisk -c LBN length special
LBN is the logical block number of where the forced
error is. The length is generally 1, but if you have
set of sequential forced errors you can get them all
at once. The last argument is the special device file
for the disk.
NOTE: Radisk should only be run with the system single user.
This is a documented restriction of the program.
The scan operation tells the controller to scan the disk and
doesn't transfer any of the data back to host. This makes it
faster than doing something like a dd(1) to read every block.
The command is:
radisk -s LBN length special
If you want to scan the entire disk you can use:
radisk -s 0 -1 special
and radisk(8) will figure out the length. The command to
force BBR is:
radisk -r LBN special
The algorithm doesn't automatically replace a block, but
execises it to make sure that it is bad. If the block isn't
bad then it won't replace it.
>
> Typically, after getting a hard error, you want to do a backup, address
> the error condition and the restore the filesystem. You can use something
> like "tar cvf /dev/null /mount_point" to try to figure out which file(s)
> the bad spot(s) are in, if you care.
Once you've cleared a Forced Error on a replaced block you
need to determine if the block was important. George's
suggestion is ok, but if know the block numbers and can
translate them into blocks numbers within the partition
there are simpiler ways of finding the file.
First identify where the block is:
icheck -b block-number special
Icheck(8) can take a list of block numbers and identify
where the blocks are. It will say whether the block is
part of the inode list (and which inodes), a data block
of a file, a free block, a superblock (or backup superblock),
etc. If the block belongs to a file you can track down the
file name by the inode number with:
ncheck -i inode-number special
This can be slow, so if you can mount the file system another
method is to use ls and grep:
ls -Rli | grep inode-number
Once you know the file you can replace it with a good copy of
it from a backup or the distribution (or other system). Some-
times it will be a file that is easily recreated (object file
for example).
If a block of inodes is bad you'll have to determine if any
of them are used. Generally for this I use fsck so I can
repair any damage that there is. Sometimes the damage will
bad enough that it's simplier to restore from a backup.
>
> If the bad block(s) are in inodes or other unpleasant spots, your system may
> crash when accessing the mounted filesystem or the filesystem may becomre
> more corrupt.
For this reason its a good idea to avoid mounting the file
system until you know where the problem is.
> George Robbins - now working for, uucp: {uunet|pyramid|rutgers}!cbmvax!grr
> but no way officially representing: domain: grr at cbmvax.commodore.com
> Commodore, Engineering Department phone: 215-431-9349 (only by moonlite)
--
Alan Rollow alan at nabeth.enet.dec.com
More information about the Comp.unix.ultrix
mailing list