Summary - rsize, wsize (l

Wed Jun 14 08:40:39 AEST 1989

The following is a summary of responses that I received in response to my
posting about rsize and wsize for NFS mounted partions.  The first entry
is a summary of my original posting.  Thanks to those who responded.

--monty

Summary of results:

  wc and a cp run on a 9.5MB file, all activity on this file on the
  test partition.

  trial 1: read/write sizes using default (8k)

	fstab entry for /u1 partition:
		delphi:/u1 /u1 nfs rw 0 0

	average cp: 1:33.6 (93.6s)
	average wc: 1:45.6 (105.6s)

  trial 2: read/write sizes of 2048, timeo=100

	fstab entry for /u1 partition:
		delphi:/u1 /u1 nfs rw,rsize=2048,wsize=2048,timeo=100 0 0

	average cp: 4:50.3 (290.3s) +210.2% over defaults ave
	average wc: 2:09.3 (129.3s) + 22.4% over defaults ave

  trial 3: read/write sizes of 1024, timeo=100

	fstab entry for /u1 partition:
		delphi:/u1 /u1 nfs rw,rsize=1024,wsize=1024,timeo=100 0 0

	average cp time: 1:48.6 (108.6s) 16.0% over defaults
	average wc time: 1:45.0 (105.0s) -0.6% over defaults

>----------------------------------------------<

  Date: Thu, 25 May 89 22:28:46 EDT
  From: dan at flash.bellcore.com (Daniel Strick)

  The default rsize/wsize is 8k.  The recommendation that these
  parameters be reduced to 2k or 1k was originally suggested
  to preserve the functionality of old ethernet interfaces
  with only 2k of buffer space (beyond which packets must
  be dropped).  It turns out that in addition to the limitation
  in the old interfaces, there are kernel buffer resources
  that can be exceeded (usually happens when a fast machine
  blasts away at a slower one) and therefore the rsize/wsize
  reduction recommendation is periodically repeated even
  though the old ethernet interface is history.

  If the destination of the nfs data is not overrun, the
  default 8k rsize/wsize should be marginally most
  efficient.  This is reflected by your 8k and 1k tests.
  I don't know what happened during your 2k tests (you win
  a cigar).  Perhaps the maximum ethernet packet size of
  roughly 1500 bytes is relevant.

>-----------------------------------------<

  Date: Thu, 25 May 89 22:28:40 EDT
  From: hedrick at geneva.rutgers.edu

  There's no reason to descrease rsize and wsize between Sun 3's and 4's
  on the same Ethernet.  Rsize and wsize are a hack, for use only with
  Ethernet controllers that don't have enough buffering to receive 6
  back to back packets.  The 3Com Ethernet cards used on most Sun 2's
  have this problem.  So you want to reduce wsize on a 3 that has
  mounted a 2 or rsize on a 2 that has mounted a 3.  Some gateways or
  bridges have trouble with large numbers of back to back packets also.
  This may be load-dependent.  The newest cisco hardware works fine with
  default settings, as they are now using controller cards with lots of
  on-board buffering.  Older cisco gateways (particularly those using
  3Com controller cards, but sometimes the Interlan cards have trouble
  also) need reduced rsize and wsize.  I assume the same may be true of
  other vendors.  Finally, if you have a link that tends to lose packets
  (e.g. a noisy serial line), reducing the sizes could help too.  If you
  lose one packet you have to resend the whole bunch, so reducing the
  size of the bunch could help.  But you'd need very high error rates
  before you'd see this.  If you don't have one of these special
  situations where you need a smaller size, then the defaults do better,
  since they decrease the RPC processing overhead needed to handle a
  given amount of data.

  Your test had a server that was faster than the client.  If you had
  the reverse, e.g. a Sun 4 client and a Sun 3 server, when the client
  writes data to the server, the server may get overrun.  Generally we
  suggest reducing the number of biod's rather than using rsize and
  wsize, but if you needed to throttle just one particular mount, wsize
  might be the way to do it.  We've never seen trouble due to the server
  being faster than the client.

>-------------------------------------<

  Date: Fri, 26 May 89 11:05:19 EDT
  From: jas at proteon.com (John A. Shriver)

  The default rsize and wsize are 8192.

  The problem is that they send one giant UDP packet of wsize, and let
  IP fragmentation make it small enough to go across the Ethernet.  For
  8192, that six packets.  These packets are sent as a *very* fast
  burst.  If any of the fragments get lost, all are useless because of
  the IP unique ID.  This message explains:

     Date: Sat, 28 Dec 85 19:00:04 est
     From: Larry Allen <apollo!lwa at uw-beaver.arpa>
     Subject: ip fragmentation follies

     I've been playing with IP fragmentation/reassembly and have discovered a
     major crock in the Berkeley way of doing things.  This may have been
     noticed by someone before, but I hadn't really thought about it.

     What caused me to notice this was claims by some people (namely Sun)
     that using very large IP packets and using IP-level fragmentation makes
     protocols like NFS run faster.  This makes some sense (less
     context-switching, etc), so we decided to try it.  We quickly noticed a
     problem, though: if a fragmented packet has to be retransmitted (eg
     because one of the fragments was dropped somewhere) the fragments of the
     retransmitted packet are not and can not be merged with those of the
     original packet!  Why?  Because the Berkeley code has no notion of
     IP-level retransmission, and hence assigns a new IP-level packet
     identifier to each and every IP packet it transmits!  And since the
     IP-level identifier is the only way the receiver can tell whether two
     fragments belong to the same packet, this means that the fragments of a
     retransmitted packet can never be combined with those of the original.

     What all this means in practice is this: for a fragmented IP packet to
     get through to its receiver, all the fragments resulting from a single
     transmission of that packet must get through.  If a single fragment is
     lost, all the other fragments resulting from that transmission of the
     packet are useless and will never be recombined with fragments from past
     or future transmissions of the same packet.

     This all explains (or at least provides a partial explanation) for why
     people running 4.2 TCP connections across the Arpanet using 1024-byte
     packets were losing so badly. If the probability of fragment lossage is
     even moderately high, it will often take three or more tries to get a
     fragmented packet across the net.  Meanwhile, of course, the useless
     fragments from previous transmissions are sitting on reassembly queues
     in the receiver (for 15 seconds, I think?), tying up buffering resources
     and increasing the chances that fragments will be dropped in the future!

     In the current Berkeley code, it's possible to imagine workarounds for
     this problem for TCP: because TCP is in the kernel, it could have a side
     hook into the IP layer to tell it "this packet is a retransmission,
     don't give it a new IP identifier". For protocols like UDP, however, the
     acknowledgment and retransmission functions are done outside of the
     kernel, and the only kernel interface that's available is Berkeley's
     socket calls (sendto, recvfrom, etc).  Needless to say, the socket
     interface gives you 1) no way to find out what IP identifier a packet
     was sent with; 2) No way to specify the IP identifier to use on an
     outgoing packet.

     I don't really have any idea what to do about this problem.  And, it's
     not entirely Berkeley's fault; the BBN TCP/IP for 4.1bsd did the same
     thing...  In any case, until there's a fix I don't think using IP
     fragmentation/reassembly when talking to 4.2bsd systems is a very good
     idea. 

     						-Larry

  Well, the important thing is that this only matters when packets are
  being lost.  The least likely time for that to happen is on an idle
  network at midnight.  The net has to be busy.  Also, the problem is
  for receiving data on the slow host (3/50 in your case).  Try reading
  two large files, from two different file servers, at the same time,
  with your 3/50.  That will start causing it to lose packets.

  For files the 3/50 mounts, you may only need to set the rsize lower,
  the wsize may be fine.

  The case is not that smaller rsize/wsize improves performance.  The
  case is that if you are losing enough packets to blow performance to
  hell, lowering rsize/wsize will save your ass.

  Much of this should be greatly improved with SunOS 4.1 comes out, with
  adaptive retransmission in NFS.

>----------------------------------------------<

  Date: Fri, 26 May 1989 11:25-EDT 
  From: David.Maynard at K.GP.CS.CMU.EDU

  About 2 years ago I did some fairly extensive benchmarks on the rsize,
  wsize, and timeo options.  In addition to having machines of different
  speeds (Sun-2/120 vs. Sun-3/160), I had to deal with LANbridges and IP
  routers on a heavily loaded network.  About 6 months ago I did some
  more limited tests using a Sun-3/50 instead of the Sun-2 on a similarly
  convoluted network.  These tests were done under 3.X so things could be
  very different under 4.0.  In addition, the client machines had local
  disks so I was not affected by page/swap traffic that might change your
  results.

  First to answer your question, the default maximum transfer size is
  8192 unless the server is a Sun-2 with the 3Com ethernet board.  This
  corresponds to the page size on most of the newer Suns so you only need
  one transfer to get a whole page.  

  In most cases, the default rsize and wsize settings should work well.
  Problems generally arise if your combination of hardware and loading
  prevent one of the machines from handling a fairly steady stream of
  large packets.  Two possible sources of such problems are, 1) speed
  differences between the client and the server, and 2) limitations in
  the network itself.

  If the server machine is much faster than the client, then what the
  server considers a steady stream of packets may be an unmanageable
  flood to the client.  With Sun-2's this could be a real problem.  I've
  also heard of people having similar problems between Sun-3's and
  Sun-4's.  In this case, the load on the client plays a major role in
  how bad the problem is.

  The second source of problems is limitations in the network itself.  In
  Sun-2's with the 3com controller, the network interface doesn't deal
  well with packets longer than 4K.  If your network has IP routers or
  bridges, these network links can greatly limit your ability to transfer
  streams of large packets.  Some IP routers are especially notorious for
  dropping things under heavy loads.

  The key to minimizing these problems for NFS is limiting the overhead
  of having large numbers of small packets while reducing the number of
  retransmissions due to dropped or late packets.  

  To get a feel for how your network behaves, try using 'spray' with
  various packet sizes.  It isn't as accurate as NFS tests, but is easier
  to do while others are working.  Be sure to spray both from client to
  server and from server to client.  By comparing the percentage of
  packets dropped in the two directions you can get an idea of how CPU
  speed differences might affect NFS (although only roughly since spray
  represents the extreme case of streaming packets).  Then, look at the
  bandwidth numbers for the different sizes.  Bandwidth should increase
  as packet size increases (reduced overhead).  This is why you want
  rsize and wsize to be as large as possible.  However, the number of
  dropped packets also tends to increase with size.  Unlike spray, NFS
  has to retransmit dropped packets, so dropped packets can greatly
  reduce NFS performance.  If your network has routers, you will also
  probably notice a drop-off point where performance degrades rapidly for
  larger packets.

  Once you have an idea of how the network behaves, start doing NFS tests
  with 'cp,' 'wc,' or your favorite command.  Adjust rsize and wsize from
  8192 down to 1024.  Also, adjust the timeo option from 7 (the default)
  up to 20 or so.  For each test look at the elapsed time for the
  commands AND the statistics reported by 'nfsstat' on the client.
  (Remember to zero the nfsstat statistics between tests.)  The 'Client
  rpc' data reported by nfsstat will tell you how many (if any) of the
  calls timed out (i.e., were dropped or were too late).  You want to
  keep the number of retransmissions low to get the best performance.
  One way of reducing retransmissions is to increase the timeo option.
  However, increasing the timeout introduces a delay before dropped
  packets are retried.  With timeo=100, it will be 10 seconds before a
  dropped packet is retried!  This delay can really really hurt NFS
  performance.  Even on a bad network I have found that limiting the
  timeo to 10 or less gives me the best overall performance.  On the
  other hand, that extra 3/10's of a second greatly reduces the number of
  timeouts for our particular network.

  To comment on your specific results, I would suspect that either you
  don't have a problem and you should just use the defaults, or that your
  tests were skewed by the large timeo values.  One quick way to tell is
  to look at the nfsstat results on a client that has been running for
  awhile under normal load.  If the percentage of client rpc calls that
  has timed out is greater than 1/2% of the total then you should
  probably do some more rsize and wsize tests.  Because of our heavily
  loaded network and routers, I get the best performance when around 1%
  of the packets time out.

  I hope you don't mind the long explanation.  I guess it might be more
  appropriate for Sun-Spots where it might help someone who isn't
  familiar with the background and hasn't already done a lot of tests.
  Anyway, I hope it helps.