Questions in Large Installation System Administration

Thu Feb 7 12:49:24 AEST 1991

This is my personal list of interesting questions in Large
Installation System Administration. The list was originally discussed
at the Winter 1991 USENIX in Dallas at the LISA BOF, and is posted at
the request of those attending that BOF. This is a reorganized,
lengthened, and cleaned up version, which includes the questions added
by people there. Asterisks mark leaf nodes, so I can count up how many
there are (I have a vague theory that every asterisk is approximately
a paper's worth of question). Where I know them, I have listed people
working on the problem. A future version will also give references to
existing work. I will happily update the list with extra references,
questions, names of people and so on if you send them to me.

As to why you might be interested in this; well, it'll give you
something to think about during those long sleepless nights. More
seriously, it might suggest problems you ought to worry about before
they bite you; it's a good place to look for paper topics if you think
you'd like to write a paper (which is good for you personally in that
it impresses people, and good for the world in general in that it
spreads information and minimizes redundant work); it may point you
towards information, or even just useful ways of stating  questions,
to help with problems you already have. 

1) Storing data.

	*A) Partitioning disks. Little partitions separate out
different uses of disks; big partitions avoid some waste space. How do
you decide where to draw the line? How do you balance loads between
disks and controllers? Where do you tradeoff between manageability and
efficiency? What are the issues you should consider when partitioning
disks? (Some seemingly obscure things, like putting high-traffic
partitions closer to the center of the disk, can have noticeable
effects on performance.)

	B) Migration systems. Some users have a lot of data that they
don't use much but insist on having around just in case. One way to
deal with this is to provide ways to silently transfer files away from
expensive, size-limited, but quickly accessible magnetic disk onto
slower but larger and more extensible media, and transfer them back
when they're looked at. Such systems exist, but usually require
either major investments of money, or kernel modifications.

		*i) What options currently exist, and how do they
compare>

		*ii) Supposing infinite resources, what should such a
system really do? Can it be done without kernel modifications? How do
you decide when to move files off line? Are vendors making the right
assumptions about file usage patterns? Is there a single set of right
assumptions, and what can you do if there isn't? What do you do about
really long delays? How do you reclaim space on tertiary storage?

	*C) For convenience, the rest of this section is divided
between backup systems (designed to restore data lost in the event of
a failure of some sort) and archive systems (designed to save data for
long periods of time). Current systems do not make this distinction
well, so most sites use a backup scheme to provide a historical record
of some sort, as well as for immediate recovery, and patch in a second
system (or non-system) to deal with files that they know will need to
be accessed in ways that the backup system doesn't support. For both
backups and archives, assuming that a system has only the one purpose
simplifies its design. Is this really a defensible assumption? Even if
it is, where do you put the current historical uses of backups?
Because we currently use our backup system to support some archive
purposes, we keep some tapes for a very long time. We have needed
those ancient tapes for purposes we did not forsee, and therefore have
pulled off data which we would not have explicitly transferred to
archives. 

	D) Backups

		*i) Almost everybody has a locally designed backup
system; there are also commercially available systems. There are no
available source of information about what to consider when designing
such a system, or what common pitfalls are.
			[Elizabeth Zwicky, zwicky at erg.sri.com]

		*ii) Any careful attempt to design a backup system
reveals that the available programs which transfer data from
filesystems into other forms have severe problems. There are now known
serious bugs in every program (including dump) used for this purpose.
We need a new one.
			[Steve Romig, romig at cis.ohio-state.edu]

		*iii) Techniques for reliably speeding up dump are
reasonably well known. Now what do we do about restore?

		*iv) How do you back up a terabyte? Suppose you have a
migration system, with a terabyte or so of secondary storage, what do
you do if the building burns down?

	*E) Archive systems. If a user comes to me and says "I have
here 100M of data which I may or may not need to look at some time in
the next 20 years," what do I do with it? 

2) Security

	*A) What can you do when users have root? There are many
situations in which it is simply impossible to take all root
permission away from the users. What are the technical and personal
measures you can take to let the users do what they need to without
unduly compromising the security of your network?

	*B) How can you convince users to co-operate with security
precautions? You can force them to choose good passwords; if you try
hard enough, you can even pretty much not drive them mad in the
process. But you can't forcibly prevent people from writing down their
passwords, or giving them out to other people. How do you make
security safeguards that are livable and comprehensible, and get
people not to turn around and destroy them for their own purposes?

	*C) Trust in confederations. In many situations, systems with
separate administrators are grouped together in loose confederations,
where administrators on different systems are roughly peers, but need
to work together. (For instance, two departments within a company may
each have separately administered machines.) In such a situation, you
can't force your confederates to be trustworthy, but you may
nevertheless have good reason for wanting to share resources in ways
that require trust. How do you negotiate that trust while remaining secure?

	*D) What tools are there for evaluating security, and how do
they compare?

	*E) How do you decide how secure you need to be?

3) Adding machines to your network

	*A) How do you keep users from adding random machines to the
network? 

	*B) Usually, you need to be willing to deal with all
reasonable requests to add things. How can you tell which machines are
reasonable to add to the network, and which aren't? Someone shows up
in your office with a hyper-intelligent coffee maker that runs Mr.
UNIX, and wants you to integrate it. Is it going to make the network
explode or not?

	*C) How do you figure out what it costs to add a machine?
Obviously, if you have 200 Suns, adding another Sun costs something
(adds network load, uses server space, etc.), adding a VAX costs
something more (now you have to support another architecture), adding
a hyper-intelligent coffee maker running Mr. UNIX costs something more
(now you have to support another architecture that nobody can help you
with), adding a VMS VAX costs yet more (a whole new operating system,
another networking protocol...) But just what are the costs? Some
increase linearly; it takes roughly twice as long to compile a program
for two different architectures as for one. Some of them are much
worse than linear; NFS may be a great thing, but it isn't always the
same, and you need to test every architecture as client vs. every
architecture as server, giving you an order N! problem.

	*D) Once you know you have to add a machine, what do you need
to do to integrate it?

	*E) How do you plan your network to make it easier to add
things to?

	*F) What do you do when you need to increase the number of
machines on your network by a factor of 2 or more? 

4) Buying software

	*A) How do you determine what the administrative cost of a
piece of software is? Programs differ in costs to administer,
sometimes in obvious ways (for instance, they require complex and
horrible printcap files, or a separate printcap for every user), and
sometimes in unobvious ways (using mh for mail makes for vast numbers
of small files changing every day, strongly biasing the pattern of
file system usage). When you are evaluating software, where do you
look for these costs?

	*B) How do you manage to install software, since vendor
installation scripts tend to break things, or to fail?

		*i) What would a vendor install script that you didn't hate
look like? 

	*C) How do you select software to purchase? Given that users
and system administrators tend to have different agendas in selecting
software, what procedures and criteria can you use to make decisions
that everybody can live with?

5) Monitoring usage

	A) Statistics used for fairness and charging purposes

		*i) Disk quotas. Berkeley UNIX systems come with a
disk accounting system, but it isn't very effective. Many sites have
their own accounting systems. What are the appropriate abstractions in
such a system? Some quota systems look at file ownership; some look at
position of the file in the file hierarchy. What do you do about
tracking usage for multi-user groups or projects? How do you determine
where quotas are set? Do you set quotas so that they total to no more
than the available disk, and risk having users run out of quota before
you run out of disk, or do you set them higher, and having users run
out of disk before they run out of quota? Since disk space usage
changes over time, how do you charge people for it? Who do you charge
for it (users, projects, groups)? What do you do when they run out of
it? Who gets to decide how much space people need, and what criteria
do they use? How do you keep people from cheating?

		*ii) Printer accounting. There is moderate
vendor-supplied support for tracking number of pages printed on some
printers. This is not sufficient for most people who want to do
printer accounting. There are also other issues; on a PostScript printer,
you may spend many hours of printer time to produce a one-page image.
Do you start accounting for printer CPU? On a network connected
printer, where random machines may send it jobs, how do you even do
page accounting? 

		*iii) Process accounting. Again, some versions of UNIX
support some process accounting. Unfortunately, business usually want
to charge projects, not people, and support for that is non-existent.
How do you hack it in?

		*iv) OK, so you've figured out how to track usage of
printers, usage of disk, usage of CPU. How do you provide a single
interface to all this data?

		*v) You have all the data you could possibly want. How
do you charge people? Do you charge them for connect time, or CPU
cycles, or something else completely? If you don't charge them based
on usage, do you charge them a flat fee, or a fee based on
availability, or what?

	*B) Statistics used for capacity estimation. It's easy to tell
that you don't have enough network bandwidth, once you run out. How do
you tell how much you have left before you run out? How do you know
how soon you're going to run out of disk space? How do you know how
many usable spare CPU cycles you have?

	*C) Statistics used to track usage patterns for design and
optimization purposes. When you go out to design a backup system, or
speed up your network, or otherwise fiddle with things, you often need
to know exactly what it is that people do with your system, and when.
The sort of information you need tends to be different from the sort
of information you need for charging people; for instance, you may
need to know the number of files changed in a day, or the number of
NFS reads as opposed to NFS writes that occur. How do you figure out
all of these things? nfsstat will tell you about NFS traffic (once you
figure out what statistics you want and where you're going to keep
them and how you're going to analyse them), and if you happen to have
sources to it, your backup system can be instrumented to give you
information about what files are changing when. But you need to know
what statistics are important, how to gather them, and how to figure
out what they mean.

	*D) Statistics for communicating to users. System
administrators generally have a vague idea what's going on with their
machines. Users rarely have that much of an idea. How do you make
available to them information that they can understand about what the
computers are doing?

	*E) Performance monitoring. How are your machines doing, and
is it getting better or worse? Are the users complaining because users
are like that, or because something is really wrong? And where do you
get cute graphs that management likes that show how you're supplying
marvelous facilities to people?

6) Clone wars. 100 identical machine are a lot easier to deal with
than 100 different machines, and so are 100 mostly identical machines.
But how do you get them that way, and how do you keep them that way,
and how do you use their identicality to help you?

	*A) Turning chaos into clones; how do you create a cloned site
out of individual machines?

	*B) Executing across multiple hosts. There are available
programs that take a command and execute it on multiple hosts (for
instance, gsh) but they tend to be highly site specific. How do you
set one up for your site? Or, how about someone writing one that will
work for a lot of sites without too much fiddling?

	C) A cloned facility needs tools to make machines look alike.

		*i) An overview of existing methods and philosophies
for distributing changes between machines.

		*ii) An improved version of rdist that would be widely
available and applicable while implementing useful features like
time-outs. [Slightly worked on by Tom Christiansen, tchrist at convex.com]

	D) When clones develop personality; how to deal with machines
that are alike in some ways, but different in others. Machines are
never quite perfect clones of each other, especially if they sit on
people's desks. In many cases, the changes are encapsulated in pieces
of files (for instance, printcap files that differ by default printer,
or that have one printer local but all the rest remote). These are
currently handled on a case by case basis at most places, with a
program to take care of printcaps, and one to build rc.local files,
and so forth.

		*i) Those case by case programs are in themselves of
interest.

		*ii) A more general solution to the problem is also
needed; how do you provide a flexible ability to customize files for
multiple hosts? [slightly worked on by Elizabeth Zwicky,
zwicky at erg.sri.com, separately by Steve Romig, romig at cis.ohio-state.edu] 

7) Users as abstractions

	*A) Creating user accounts; how to make an add user program.
Unscientific surveys show that almost every site has their own add
user program. There are good reasons for this, which are unlikely to
change soon; what would be really nice is a comparative study of add
user programs, suggesting what one ought to do in order to be secure,
safe, effective, and flexible enough so that it won't have to be
rewritten too often.

	*B) User information beyond the password file. Gecos field or
no gecos field, the password file doesn't hold all of the information
that you want about users. What other information are people using,
and how are they storing it and keeping it in sync?

	*C) Removing user accounts. Creating accounts is comparatively
easy. Removing accounts requires that you clean up all sorts of loose
ends, and doing it from programs exposes you to all sorts of
interesting problems (for instance, the operator who told the account
program to remove an account which had / for a home directory). What
are the technical and political pitfalls, and what can you do about
them?  [Steve Simmons, scs at iti.org]

	*D) Making your users into clones. Users will persist in
having personalities, and in expressing them in their initialization
files. What ways exist to force them into some sort of regularity, and
what are their pros and cons? [One system is being worked on by J
Greely, jgreely at cis.ohio-state.edu]

8) Users as people

	*A) Training users. Answering questions is all very well, but
getting people to where they don't need to ask them is even better,
and leaves you with more free time. How do you do that?  [Bryan
MacDonald, bigmac at erg.sri.com]

	*B) Users as customers vs. users as pond scum. System
administrators are famous for a bad attitude about users (calling them
lusers, for instance), but the users are also the people who pay the
salaries. What attitude should we take towards the users, and how do
we manage to have it and spread it? [Kevin Smallwood,
kcs at houdini.cc.purdue.edu]

	*C) Making users happy without actually fixing anything.
You can't always fix everything, especially if you're hiding from user
lynch mobs. What are the non-technical tricks that allow you to make
the users happier, thereby disassembling the lynch mobs, so that you
can peacefully go about your work?

	*D) Should stupid users get stupid programs? People frequently
complain about the difficulty of common UNIX programs, and want to
replace them with easier ones for users who claim to be, or are
perceived by administrators as being, incapable of using the normal
tools. Is this a good idea? If it is, where do you find such programs?
Can you find programs that are easy to use that also lead into normal
UNIX tools?

9) Training system administrators

	*A) What is the career path for system administrators? Where
do they come from and where do they go? 

	*B) What do you do with new system administrators once you've
got them, that gives them information without risking damage to your site? 

	*C) What resources are out there for people to learn from,
particularly from other fields? (For instance, people have suggested a
reading list including such things as "Search for Excellence", and
"The Mythical Man-Month")

10) What do system administrators do?

	*A) Just what is the point of all this? Are we trying to make
existing machines run? Are we trying to provide some level of service?
To whom?

	*B) How are system administrators like and unlike user support
people, system programmers, and so on? 

	*C) How do you explain to managers what system administration
is like; why it can't be managed the same way that research
programming can, why it is difficult and takes trained people who get
paid real money?

11) Centralization vs. decentralization

	*A) How do you figure out where to make the tradeoff between
the economies of scale and administrative advantages of centralizing
things like disk space and printer service, and the fault tolerance
and individual control of distributing them? 

	*B) What administrative functions must be centrally
controlled, and which ones can be safely handed out? How do you make
central organization in a group with no center (for instance, trying
to share a network between projects, where somebody has to administer
network addresses, but nobody has authority over anybody else)?

12) Working together

	*A) How do you administer sites that are physically remote?

	*B) As a large site, how can you deal with associated tiny
sites? If you admister 300 machines in one place, and 3 in another,
how do you come up with a system that copes with both? 

	*C) How can you make a confederation of administrators within
an organization, and what can one do for you? [Mark Verber,
verber at pacific.mps.ohio-state.edu] 

13) Are those apples treated with Alar? Motherhood and apple pie
reconsidered.

	*A) Are policy free tools possible, or even advisable? Is it
really better to give people the ability to make their own stupid
policies easily, or to give them tools that implement intelligent
possibilities with a few degrees of freedom? 

14) When things break

	*A) Hack it, or track it? When you run across a problem that
has a fix, do you apply the fix even if you don't yet understand the
problem, or do you attempt to track it down even if that means leaving
things broken? Obviously, rebooting the machine will fix a lot of
problems, but sometimes it will keep you from figuring out the bug and
reporting to the manufacturer and getting it fixed forever.

	*B) 24 hour support; do you provide it and if so how? Are
beepers evil? Programmers are famous for working at all hours of the
night, which is all very well for them, but if you have to deal with a
whole bunch of them, they might want help at any hour of the day or
night. Most system administrators want a life of some sort; how do you
get one while keeping the users happy?

	*C) 20 questions to ask users when they report a problem. So a
user calls you up and says "Mail doesn't work." What do you do then?

	*D) You have found a problem. You know how to fix it. How do
you install the fix in such a way that you don't undo it later? What
should you do with the fix besides install it? Tell your vendor? Tell
other people? How?

15) Making changes

	*A) How do you help users adjust to changes? You can't run V7
on a PDP-11 forever; at some point, the users are going to have to
change hardware and software platforms. How do you reduce the trauma?

	*B) When is it time to upgrade? Folk wisdom says "never install
an even release", and self-preservation suggests that switching your
entire site over to a beta release is not a good idea. But there is no
release without bugs, and at some point you're going to have to decide
to live with it. When? For that matter, when is the pain and expense
of upgrading your hardware platform outweighed by the pain and expense
of keeping the old one?

	*C) Beating swords into plowshares versus buying tractors.
Most system administrators are virtuosi at the UNIX philosophy of
combining together old tools with baling wire and string, which is
cheap in some ways, and gives you that warm glow of accomplishment. On
the other hand, there's a lot to be said for throwing out the old and
doing something new. When do you decide that you should stop trying to
coerce the old operating system (name service, printer system) into
working and design a new one from the ground up? 

	*D) How do you manage to keep local "improvements" and still
be able to change with the rest of the world? So you rewrote the
printer system (or you wrote an adduser program, or you made talk(1) work
on everything). And then you bought 10 new machines, 2 each from 5
different hardware vendors, and all your old vendors released new OS
versions. What makes this not a really good time to become a
carpenter?

	*E) Justifying the expense. It may be obvious to you that life
would be much better if you had more than 4 M of real memory in the
Sun 3s that everyone wants to run Sun OS 4.1, OpenWindows, and 5
copies of emacs on, but how do you make it obvious to the people who
spend the money?

16) Electronic communication

	*A) Usenet; how do you control your users and your disk usage
without being (too much of) a fascist? 

	*B) Mail

		*i) Compare and contrast the various methods of
getting all the mail for a site to deliver to one place. Among these
methods; NFS mount /usr/spool, use aliases or .forwards everywhere
to deliver mail to one machine, or one machine per user, deliver mail
to home directories, automount /usr/spool/username for every user.

17) Testing

	*A) You install, reinstall, or upgrade a machine. Without
using users as test suites, how do you know it works?

	*B) You have NFS, or NIS, or Kerberos, or X Windows
implementations from many vendors. How do you figure out where they do
and don't work with each other?

18) Documentation [A & B both worked on by Elizabeth Zwicky,
zwicky at erg.sri.com, and Mark Verber, verber at pacific.mps.ohio-state.edu
jointly] 

	*A) What documentation should you produce for your site, aside
from that shipped by vendors? What available documentation is out
there to give to users?

	*B) What tools are there to make producing user documentation easier?

19) Little machines become big problems

	*A) How do you make the PCs talk to the world at large? You
need to connect them to big networks, and provide services to them
somehow. But it's a very uneasy alliance between PC programs designed
for little networks, and protocols designed for big networks. How do
you make the connections work smoothly? (And has anyone every met a
Mac mail program they actually like? The users liking it doesn't count
if the administrators hate it...)

20) Parcelling out the CPU cycles

	A) How do you let users make use of spare CPU cycles anywhere
in the network, without giving them non-spare CPU cycles? 

		*i) In a workstation environment, users coming in over
a network, instead of logging in at consoles, need to be distributed
easily between machines. How do you do that? How do you mediate
between users coming in over the wire, and users at the console of the
machine? 

		ii) Single jobs may also want to be distributed,
either by *a) using an entire network of machines as an extremely
coarse-grained parallel processor or *b) using a more powerful or less
loaded machine elsewhere on the network. Facilities for doing the
latter exist, certainly; I'm not certain there is even any help for
doing the former. How do the available facilities compare? How do you
assist people in making programs work in these situations?

	*B) How do you deal with batch jobs under UNIX?

21) Watching users

	*A) Users with questions are often unable to adequately
describe what they are doing and what the machine is doing in
response. What facilities are available for connecting to what they
are doing to watch and assist?

	*B) How do you keep an eye on suspicious or malicious users?