SCSI & IPI rates

Tue May 29 19:47:54 AEST 1990

lm at sun.eng.sun.com (Larry McVoy) writes:
> Those drives have smart controllers and I believe they have zero
> latency write ability so they don't have the problem of blowing
> revs in between each write.

It seems to me that, for most file uses, you don't *want* zero-latency
writes.  You'd like the performance, of course, but the downside is
increased risk of damage should the machine crash.  As I recall, adding
"ordered writes" was a Feature of the first System V release, intended to
add robustness to file system consistency should the machine crash during
the file transfer.  Early UNIX file systems often had many, many problems
dredged up by fsck after a crash; these days, they're fairly rare, and I
think this forced consistency has a lot to do with it.  Databases, in
particular, need some kind of write-ordering semantics to guarantee proper
recording of transactions (isn't this what they call two-phase commit?).
There was a Bell Systems Technical Journal article some years ago that
discussed this issue and its relationship to UNIX file systems.

> It is a requirement for correct operation that the data be on the
> server's drive before the server says OK to the client.  If this were
> not so then you would be in serious trouble each time a server crashed.

Exactly my point, but it doesn't just apply to NFS file systems.  Frankly,
given the number of bugs in UNIX software (yes, even SunOS 4.1 has its
share [*]), system crashes (or hangs, with user-forced reboots) are still
rather too common to ignore this issue of file system repair.  It's
certainly no fun poring through a massive fsck output listing and
wondering how much ancillary damage you might incur by choosing the wrong
order in which to choose to repair individual things that are bad in the
file system data structures, especially, when you can't see the damage in
enough detail to understand what *really* went wrong (and which files
you'll need to restore from backup).  And if you're not already a UNIX
guru at the moment the machine crashes, you don't stand a chance of
deciphering all that mumbledegook about inodes anyway.

I suppose that, with the file system interfaces becoming more flexible,
you might eventually be able to substitute your own kind of file system in
a particular disk partition for recording bulk data (say, from a fast A/D
converter), not caring about recovering such data in the event of a crash.
Then you'd want to add ioctl()s (mount options) to the device driver to
tell it when to and not to perform zero-latency writes, on a per-partition
basis.  Easy in theory, but you wouldn't want to endanger the rest of your
file systems by insufficient testing of such capabilities.

I'm curious about a related aspect of hard disk drives and device drivers,
especially for SCSI devices for which you have this extra microcomputer
embedded on the drive and interceding between you and your media.  When
you want to fsync() or msync(), has the data merely been written over the
SCSI bus to the disk cache, or has it actually been written to the media,
when the function call returns?  Does this depend on the drive
manufacturer, or is there some standard SCSI command used on *all* drives
that probes for this kind of command completion?

[*] Sun is already reporting patches to SunOS 4.1 -- including one for a
problem in which file system blocks show up in files to which they do not
belong!  Just when will 4.1.1 be out? :-)