UNIX does *not* fully support asynchronous I/O

Dan Bernstein brnstnd at kramden.acf.nyu.edu
Sat Aug 25 13:18:37 AEST 1990


In article <1990Aug21.223350.7595 at esegue.segue.boston.ma.us> johnl at esegue.segue.boston.ma.us (John R. Levine) writes:
> In article <60345 at lanl.gov> jlg at lanl.gov (Jim Giles) writes:
> >From article <126800008 at .Prime.COM>, by EAF at .Prime.COM:
> >> If your language I/O library is intelligent and you are reading sequential
> >> data, the language library will call on the OS to read the next disk 
> >> block into memory, often before it is required.
> >Not on UNIX it won't.  There is no system call for the library to use ...
  [ John talks about simple caching schemes ]

I'm afraid Jim is right, though he drastically overestimates the effect
of this failure on small machines. Let me explain.

Say a program computes some numbers. Computes them optimally, in fact,
leaving them in an array. Now it wants to write the array to disk.

If the operating system weren't in the way, the program would simply
call upon the disk device to copy the data---through DMA, of course---to
the disk.

Under UNIX, there's at least one big extra step. write(fd,buf,n) first
*copies* the data to a buffer inside the kernel's space. This takes CPU
time. Do you see now what Jim is complaining about?

Of course, on most machines disk transfer is much slower than CPU
transfer, so once you've gotten rid of the disk seek by caching, any
further asynchronicity is silly. But Jim works with very fast disks, and
a lot of them at once.

mmap() is a partial solution: it does its job well and gets rid of the
extra step, but doesn't fit into the ``UNIX model'' as well as it could.
How do you use mmap() on a pipe, for example? If two programs are
communicating via a pipe, they should be able to write data and read it
with *zero* copies in the middle. Under standard UNIX, there are two
extra copies at least: one for read() and one for write().

I've proposed a solution: make a call analogous to writev() that uses
the iovecs directly. Introduce another call that says whether a
particular iovec has been written or not. Also introduce a way to wait
on this status, similar to select(). Similarly for reading.

---Dan



More information about the Comp.unix.wizards mailing list