Weird problems caused by corrupted system and how I rebuilt it. (long)

Augustine Cano canoaf at ntvax.UUCP
Fri Jun 9 12:09:02 AEST 1989


Hi everyone!

The problem:  Programs that used to work, started not working.  At first I
attributed the problem to the latest (as far as I know) version of C-kermit,
that I got from Columbia U. hoping that the major problems would be solved.
Well, no luck.  Kermit (on a 3b1) still does not exit without help from ^C,
and, most disturbing, when the following was executed from a "take" file, it
locked up so badly that the only way out was to kill its parent shell from
another window.  The work-around that used to work in the previous version
did not work anymore.

set modem att
! phtoggle
set line /dev/ph0
set speed 1200
dial nnn-nnnn
connect

Well now, after rebuilding the system, it does not lock up anymore; it just
does not set the modem properly.
The funny thing is that the exact same sequence works fine when typed at
the prompt.  Am I overlooking something?  Is anybody having the same problem?
Is anybody using kermit on a UNIX PC?  I believe that this is a genuine kermit
problem.  Am I wrong?

Vtem (anyone out there using vtem, the VT100 emulator that uses the
pty's?) also acted very strangely.  I compiled it under install, and when I
ran it while being install, it worked fine.  Under another login it would
lock up, not even echo appeared on the screen.  The only solution was
to kill its parent shell, just like kermit.  Vtem also sometimes mapped the
British character set since '#' appeared as the pound-sterling symbol.
This was solved by logging out.

Finally, one day, Lenny's sysinfo program stopped working after a brief power
outage.  At his suggestion, I checked whether the lipc driver was loaded and,
sure enough it wasn't.

It turned out that quite a few files were in inconsistent states.  Many
libraries were different from the distribution ones, notably libc.a.  This
is probably the result of trying to have shcc and ccc installed at the same
time.  At least a couple of files related to loadable drivers were also
different and many ua files were inconsistent (2 entries for the same
package in installed software, pty entry remaining after it was removed, etc.)

The sysinfo problem was caused by the fact that loadavgd, when trying
to start, would dump core in /etc/lddrv (someone mentioned this symptom some
time ago) and therefore sysinfo could not communicate with it.

Rather than fixing up individual files and risk missing some, I decided that
it was time for a major overhaul.  Not only will I end up with a guaranteed
consistent system, I thought, but the HD fragmentation would go way down.
The fragmentation did go down (from about 16.00 % to under 2 %) but I don't
want to have to do this again, EVER! (unless of course someone comes up with
an automated script to do it.)

At first, I thought: no big deal! just backup the whole thing, boot from
floppy and restore everything.  After 10 seconds though, it became obvious
that you either restore everything unconditionally, putting back the corrupted
files where they were, or you have to reconfigure the system manually when
you're done.  This would mean re-creating the groups, users, links and
configuration from scratch, as well as finding out about each and every file
that didn't come in a distribution set.

The solution:

1 -  Remove all installed packages from install (it is here that I found out
     about the inconsistent ua files.)
2 -  Backup /u (all users): find /u -print | cpio -oBcv > /dev/rfp021.
3 -  MAKE SURE THAT THE CPIO SET IS READABLE: cpio -ictB < /dev/rfp021 for
     this and all future cpio sets.  This might save you a lot of grief.
     When I first tried to restore the whole HD, (a cpio set of 90+ floppies)
     cpio just quit at disk 74.  After trying a second time and failing at
     disk 71, I was getting pretty paranoid about losing irreplaceable data.
     It turned out that disks 71-91 were Kodak HD600 (96 TPI.)  I would have
     thought that better disks would have no problem at lower track densities.
     Is there something inherent in the magnetism of the (thinner?) magnetic
     coating or the sensitivity of a 48 TPI drive head that make use of such
     floppies a hazard to your data? or did I just hit a few bad disks?
     In any case, using fc, I could copy the data from the 96 TPI disks
     (sometimes after many tries) to regular 48 TPI floppies.  From then on
     there was no problem.
4 -  Make one cpio set for each directory that does not exist in the
     distribution; in my case /usr/man, /usr/lbin, /usr/src, /usr/doc,
     /usr/local, /usr/games.
5 -  Login as root.
6 -  Delete all /u files: rm -r /u (I felt really funny doing this...)
7 -  Delete all the directories backed up in 4.
8 -  Do a find / -newer /bin/cat -print > /tmp/modified.files.  This will
     make a list of all the remaining files that have been modified since
     the installation of the foundation set.
9 -  Print this file and go through it, deleting any files that you know are
     in a package that is backed up on floppy.  These files would still be 
     there because they were not removed in step 1, probably because they
     came from a non-installable package.
10-  Make a separate cpio set for each directory remaining on that list.  In
     my case: /bin, /etc, /lib, /usr/bin, /usr/lib, /usr/mail, /usr/spool.
     Mark these clearly to the effect that they will have to be reviewed
     before restoring.
11 - Reboot floppy unix and install a clean foundation set.  When asked if
     you want to wipe out the files on the HD, say yes.  (how often do you
     get to willingly destroy everything on your HD? :-)
12 - Login as install and install the appropriate installable packages in
     the appropriate order: ie. Telephone, ATE, Curses/Terminfo end user
     package, GSS Drivers, Dev. set, Enhanced editors, Encryption set (the
     order of this one is important), etc..
13 - Login as root and restore the cpio sets made in step 4: cpio -iBdcv <
     /dev/rfp021 for each of /usr/man, /usr/doc, /usr/local, etc...  The
     idea of restoring these before /u is that, since these files are
     modified less than user files, they will stay packed and unfragmented
     closer to the beginning of the disk longer.  Is this reasoning correct?
14 - Make whatever links you had that were not standard: ie.
     ln /bin/as /bin/mas, ln /bin/cc /bin/mcc, ln /usr/bin/compress
     /usr/bin/zcat, etc...
15 - cd /tmp
16 - One by one, restore the directories saved in step 10, REDIRECTING to
     the current directory: cpio -iBdcvR < /dev/rfp021
17 - For each of the directories in step 10, do: diff -r <name-of-directory>
     /<name-of-directory> > <name-of-directory>.diff.  This will give you
     a list of which files were present in your old directory and not in
     the clean one (these you want to copy to the new), which files are in
     the new and not in the old (ignore these), and which are in both and if
     they are different.
18 - For each of the directories in step 10, edit the file /tmp/<name-of-
     directory>.diff.  Delete the lines: "only in /<name-of-directory>".
     Copy the files on the lines: "only in <name-of-directory>" to the new
     one (/<name-of-directory>).  For those that exist in both the old and the
     new, you'll have to decide whether to copy them or not.  Unless you know
     what the file is for, and you're sure you want the old version, don't
     copy it.  It is better to have to do some minor configuration later on
     than having a still corrupt system.  In the case of /etc, the only files
     I copied from the old directory (now /tmp/etc) were /etc/daemons/*,
     /etc/group and /etc/passwd.  It was in this step that I found out how
     many corrupted or inconsistent files I really had.
19 - After finishing each of the directories in step 10, cleanup /tmp.  This
     will reduce external fragmentation (I think.)
20 - Install those packages that were in /usr/src.
21 - Do an unconditional restore of /u: cpio -iBdcvu < /dev/rfp021.  Before
     doing this I saved /u as it was laid out by the foundation set in a
     /tmp file and then applied step 17 to that file.  This, however is not
     necessary, since no user files were modified during installation.
22 - Reboot the system, YOU'RE DONE !!!

One very minor problem is that links cannot be made across cpio sets.  Cpio
could not recreate the link of one /usr/src file to a /u file since /u was
not on the same set.

The only re-configuration I had to do was to set up the printer, the phone
line and the screen blanking interval.  This was done in 5 minutes and could
have been avoided had I restored the files where this information is kept.  

Well, I hope this helps someone with a similar problem.  Of course, if
somebody decides to automate this procedure by putting it into a script, I
would definitely like to see it.  If someone has other ideas or comments on
how this process could be simplified, I would also like to hear them.

Augustine Cano		canoaf at dept.csci.unt.edu



More information about the Comp.sys.att mailing list