4.3 Tahoe dump bug

Mon Dec 19 09:25:43 AEST 1988

In the process of trying to get the 4.3 Tahoe dump running on a Sun 3
running SunOS 3.X, I, along with others, have run into the following
bug (feature) (shown below).

>Writing dump file 0 (/research)
>  DUMP: Date of this level 1 dump: Sat Dec 17 12:59:10 1988
>  DUMP: Date of last level 0 dump: Wed Dec 14 19:08:42 1988
>  DUMP: Dumping /dev/rxy1g (/research) to /dev/rmt1h on host houdini
>  DUMP: mapping (Pass I) [regular files]
>  DUMP: mapping (Pass II) [directories]
>  DUMP: (This should not happen)bread from /dev/rxy1g [block 58766]: count=24, got=512
>  DUMP: (This should not happen)bread from /dev/rxy1g [block 60802]: count=536, got=1024
>				.
>				.
>				.
>  DUMP: (This should not happen)bread from /dev/rxy1g [block 372316]: count=1040, got=1536
>  DUMP: (This should not happen)bread from /dev/rxy1g [block 378344]: count=24, got=512
>  DUMP: More than 32 block read errors from 152660
>  DUMP: This is an unrecoverable error.
>  DUMP: NEEDS ATTENTION: Do you want to attempt to continue?: ("yes" or "no") no
>  DUMP: The ENTIRE dump is aborted.

This error is produced in dumptraverse.c routine bread.  I am having 
a difficult time trying to figure out what the heck this routine is
"supposed" to be doing.  I say there are several bugs in this routine
and that it should look something like the following:

bread(da, ba, cnt)
        daddr_t da;
        char *ba;
        int     cnt;
{
        int n;
	if (lseek(fi, (long)(da * dev_bsize), 0) < 0){
		msg("bread: lseek fails\n");
	}
	while( cnt ) {
	   n = read(fi, ba, cnt);
	   if( n == 0 ) {
	      msg("(This should not happen)bread from %s [block %d]: count=%d, got=%d\n",
		disk, da, cnt, n);
	      broadcast("DUMP IS AILING!\n");
	      msg("This is an unrecoverable error.\n");
	      if (!query("Do you want to attempt to continue?")){
		 dumpabort();
		 /*NOTREACHED*/
	         }
	      }
	   cnt -= n;
	   ba += n;
	   }
}

It currently looks like:

bread(da, ba, cnt)
        daddr_t da;
        char *ba;
        int     cnt;
{
        int n;

loop:
        if (lseek(fi, (long)(da * dev_bsize), 0) < 0){
                msg("bread: lseek fails\n");
        }
        n = read(fi, ba, cnt);
        if (n == cnt)
                return;
        if (da + (cnt / dev_bsize) > fsbtodb(sblock, sblock->fs_size)) {
                /*
                 * Trying to read the final fragment.
                 *
                 * NB - dump only works in TP_BSIZE blocks, hence
                 * rounds `dev_bsize' fragments up to TP_BSIZE pieces.
                 * It should be smarter about not actually trying to
                 * read more than it can get, but for the time being
                 * we punt and scale back the read only when it gets
                 * us into trouble. (mkm 9/25/83)
                 */
                cnt -= dev_bsize;
                goto loop;
        }
	msg("(This should not happen)bread from %s [block %d]: count=%d, got=%d\n",
		disk, da, cnt, n);
	if (++breaderrors > BREADEMAX){
		msg("More than %d block read errors from %d\n",
			BREADEMAX, disk);
		broadcast("DUMP IS AILING!\n");
		msg("This is an unrecoverable error.\n");
		if (!query("Do you want to attempt to continue?")){
			dumpabort();
			/*NOTREACHED*/
		} else
			breaderrors = 0;
	}
}

Am I misinterpreting what this routine is supposed to be doing?
Will my code work?  If not, why?

Thanks

---
W. Tait Cyrus   (505) 277-0806		e-mail: cyrus at pprg.unm.edu
University of New Mexico			
Dept of ECE - Parallel Processing Research Group
Albuquerque, New Mexico 87131