Half-done BSD job control (was: Is there a utility better than 'nice'?)

Mon Jan 9 14:46:58 AEST 1989

>From article <1559 at ucsfcca.ucsf.edu>, by bxc at cca.ucsf.edu (Barbara Chapman):
> Is there a common utility for preventing large hungry programs from
> interfering with interactive users?  'Nice', even at its maximum, does
> not keep large simulations from making interactive jobs MUCH slower
> than they would otherwise be.

nice(1) only really affects CPU priority, not swap/disk priority

> Ideally, a utility could stop (as with ^Z) such large jobs whenever
> there was ANY csh or sh that was not idle, and would reawaken them (as
> 'bg' does) when all shells were idle or when no shells were running.
> In that case, the stopped jobs would occupy only swap space,
> eventually, and cause no interference with otherwise moderately loaded
> machines.

Well, there is, but the mechanisms to use it properly aren't in BSD yet.

You can use killpg(2) to SIGSTOP/SIGCONT a process group.

killpg() can be invoked by using /bin/kill with a negative pid or
from csh using kill %n.  

PROBLEM:
    csh inbuilt "kill" doesn't allow negative pid's to be interpreted as pgrp's 
    and is thus incompatable with /bin/kill.  

PROBLEM:
    It should be possible to renice a job from csh without having
    to know its process-group-id.  ie:  %n should be a pgrp-id substition
    rather than a built-in magic cookie for kill, and thus renice %3 would pass
    a pgrp-id to renice (or any other process such as a users's shell-script)
    as an argument. 

PROBLEM:
    ps(1) doesn't print process-group-id's requiring much guess work.

    sps(1) does however, but isn't standardly distributed (it -should- be!).

    pstat(8) does however but the user freindliness of pstat leaves
    a lot to be desired (eg: knowing which process you're after in "pstat -p",
    since the command names and arguments aren't printed, and the display
    is far too wide for a terminal).

PROBLEM:
    Trouble is, only jobs started by an INTERACTIVE csh typically have
    the process group-id set.

    I have had to modify /usr/lib/atrun to setpgrp a pgrp-id to jobs that
    it starts so that killpg(2) is effective on at-started jobs.

    /usr/lib/atrun is not the only offender wrt setting pgrp-id's.  I beleive
    most other job starters such as rsh(1), on(1) do this too.

MAJOR PROBLEM:
    The main problem is that you cannot reliably stop a SINGLE process in
    a process group if there are others waiting on the process.
    You have to stop the entire group atomically, using killpg(2), or
    determine the process hierarchy and stop the parents first.

    ie:
	a process that does wait(2) on another will get confused
	if the other is SIGSTOP'd unless the process expects it
	because wait(2) returns a status to the parent.  This status
	can be interrogated to determine what happened but most
	processes don't do this (eg: /bin/sh) and the job will be
	typically munged.

    What should happen is that wait() should have more options:

	- a wait for specific-process-id option (useful for popen()'s etc)

	- a wait for a process-exit (the default, V7 compatable wait, except
	  that it shouldn't wait for ptrace/stopped processes either, not
	  that it will really matter since processes ptracing other
	  processes would expect wait() to return such statusses)

	- a wait for a process-state-change (eg: STOPPED or EXITED or PTRACE)
	  (used by csh and ksh, the current functionality of 4.[23]bsd wait)

    If these options were available to wait() and the default was to
    wait() for a process to exit ONLY, then you could STOP/start single 
    processes without side-effects to others that exist at the moment.

Ian D