Raw vs. block device. I'm confused.

Fri Jan 6 10:31:48 AEST 1984

From:  Steve Dyer <sdyer at bbn-unix>

Reading and writing on a disk block device participate in the kernel's
buffer cache.  That is, data transfers occur between the user's address
space and the buffers in the buffer cache, possibly implying that no I/O
was performed immediately (i.e. on a read the buffer might have already
been present in the cache, and on a write, the actual I/O request would be
enqueued, but not yet performed.) Note that when the number of bytes to be
transferred is greater than the UNIX system's buffer size, BSIZE (usually
512 or 1024), the single request given by the user program must be broken
up into multiple requests to fill a system buffer.

"Raw" disk I/O occurs directly between the user program and the hardware
device, bypassing any buffering.  Raw I/O is faster than "cooked" I/O for
two reasons: first, since data is DMA'ed directly into the user's address
space, one avoids the CPU overhead of having to copy bytes to/from an
intermediate buffer.  More importantly, when performing disk operations
like "?check", "fsck" or a disk-to-disk copy, all of which need to read
multiple contiguous physical blocks, it is often possible (depending on the
controller) to read multiple sectors in a single DMA operation.  The same
I/O request on the block device would have to be split into several
operations, almost certainly losing revolutions between successive
requests.

Adb'ing the raw disk device doesn't work because of physio(), the mediator
of raw "dma-type" requests.  Physio() hands to the disk device strategy
routine the "block number" of the request.  The block number is derived
quite simply as u.u_offset>>BSHIFT.  u.u_offset is the current "lseek"
position of the open raw device file, BSHIFT is log2(BSIZE).  Thus, all RAW
I/O operations must occur on a BSIZE boundary.  (Now only MUST, but DO!
It's quite surprising the first time you attempt raw I/O on a non-BSIZE
boundary and find that you've trashed the beginning of the block!)
Adb, like most UNIX programs, simply lseeks to the desired spot and
starts writing.

Think about it.  The primitive writable object on the surface of a disk is
a sector, which is usually 512 bytes.  To write on a disk device at other
than a sector boundary would require reading the old sector into memory,
modifying it, and writing it out again, something the raw device cannot
do, but which the block device handles quite well, since its higher levels
have already taken care of that.  Now, you might ask why physio() truncates
at BSIZE rather than SECTORSIZE (since they are no longer, since V7, one
and the same.)  I suspect it's merely a convenience, saving an extra
manifest constant to keep track with reality.

/Steve Dyer
sdyer at bbncca
decvax!bbncca!sdyer