ANSI memmove(), optimized for GnuC/68010

Tue Sep 26 11:05:28 AEST 1989

/* 
  The style of this file has been inspired by several references I have
encountered in recent months to what has been called, I suspect
unfortunately, "literate programming".  One article was in JACM by D. E.
Knuth.  The general idea is to produce files which are both an English
language document and compilable source files.  Mr. Knuth is working on
some specialized tools with an eye toward research papers and
educational expositions.  I like the general idea a great deal and
thought I would try it out on the less formal vehicle of a UseNet
article, without using special tools.

Karl D. Botts  5480 Hyde Park Blvd.  Chicago IL, 60615
  {gargoyle|mcdchg|att}!chinet!kdb
*/

/*
  The result of compiling this file is an ANSI C memmove(), for systems
that don't have memmove() and have a memcpy() that doesn't handle
overlaps, i.e., SysV.2 ported by Convergent Technologies, as found on my
AT&T 3b1.  It is specially optimized for GnuC and an mc68k, as described
in detail in the internal comments.  It compiles with PCC but the result is
not as fast.  The internal comments also describe some peculiarities of the
Gnu C optimizer discovered while tinkering with this code.

  All cases are handled, including any of the "null copy operations" --
dest == src || n == 0.  Note that in the first case the block will indeed
get copied over itself, but this can do nothing except burn cpu unless, I
suppose, something in the block is volatile.

  Note also that n is expected to be unsigned; if size_t is signed (as it
is by default on my SysV) and you pass a negative number to this routine it
will interpret the number as unsigned and proceed.  This will cause a core
dump but I believe it is correct.  

  I wound up doing a bunch of timing tests on this code; some results and
some remarks are appended to this file.  
/*

/*
  By the way, I have seen usenet code -- pcomm for one -- which does

  #define memmove memcpy

when configured for the 3b1 and comments that memcpy() handles overlaps
and so can be used as memmove().  This is variant (as is stated by
memory(3C)) and is not true on my system (as determined the hard way),
although pcomm seems to work anyhow.  I don't use it regularly.
*/

#include <stdio.h>
/* The goal here is to get a typedef name "size_t" defined. */
#if __STDC__
# include <stddef.h>
#else
#if defined(SYSV) && defined(PCC)
# include <unistd.h>/* defines size_t as int; see below */
#endif
#endif

/* The goal here is to find a declaration or prototype for
memmove(), in the hope of generating an error message if it doesn't
match the definition in this file.  The main thing to worry about is
whether the third arg is signed or unsigned.  */
#ifdef KDB
# include <kdb/stdlib.h>
#else
#if __STDC__
# include <stdlib.h>
# include <memory.h>
#endif
#endif

#if __STDC__ || defined(PROTOS)
void *memmove(void *d, void *s, register size_t n)
#else
void *memmove(d, s, n)
void *d;
void *s;
register size_t n;
#endif
{
   register long *dl = (long *)d, *sl = (long *)s;
   int nstash;

  /* The rule for handling overlap is: Always start copying from the end
  of the source toward which the destination is offset.  If the offset is
  zero it doesn't matter.*/
  if ( dl > sl ) {
  /* This ain't ANSI C (a cast does not form an lvalue) but both SysV.2 PCC
  and GnuC accept it, and it's damn convenient.  Also optimizes very
  nicely.*/
  ((char *)dl) += n;
  ((char *)sl) += n;
  }

  if ( (int)dl & 1 ) {
  if ( (int)sl & 1 )
  goto byte_aligned;
  /* else drop thru */
  } else {
  if ( ! ((int)sl & 1) )
  goto word_aligned;
  /* else drop thru */
  }

mis_aligned:
  /* I don't think I can do much better than the simple bytewise copy.  As
  written here, Gnu C optimizes the loop into the two instruction
  sequence:

  L%12:
  mov.b -(%a2),-(%a3)
  dbra %d0,L%12

  which is the special and only copying loop which the 68010 can keep entirely
  in the prefetch queue, thus freeing the bus for full-time data and nearly
  doubling the speed of otherwise apparently equivalent loops.  Therefore I
  can't think of any fancy juggling to handle the odd-alignment that
  seems likely to turn a profit.

  The loop here (or actually, its longword analog, below), should still be
  the fastest way to copy a block of memory, but it won't be as much
  faster on a 68020 as on a 68010 because equivalent loops with a couple
  of extra instructions will still get kept in the prefetch queue.  The
  same should be true on a 68000 because none of the loops get kept in the
  prefetch queue.  I don't have a 68020 or a 68000 machine handy so this
  paragraph is sheer speculation.  */

  if ( dl > sl ) {
  if ( n-- )
  do
  *--((char *)dl) = *--((char *)sl);
  while ( n-- );
  } else {
  /* Here's the strange part.  If the following loop is written with the
  casts-to-lvalue like its complement just above, instead of the local
  char * vars, Gnu C doesn't optimize it to the complement of the
  seqeunce given above, which would be:

  L%12:
  mov.b (%a2)+,(%a3)+
  dbra %d0,L%12

  Instead it produces:

  L%16:
  mov.l %a3,%a0
  addq.w &1,%a3
  mov.l %a2,%a1
  addq.w &1,%a2
  mov.b (%a1),(%a0)
  dbra %d0,L%16

  I noticed this because it made enough difference to badly skew
  some timing tests I was running on a parsing algorithm.  Anyhow,
  adding the char *temps as here coaxes Gnu C into performing the
  desired optimization.

  It is worth noting that SysV.2 PCC, which I also tried, did not
  make this optimization despite my best attempts to devise
  suggestively convoluted loops.  The Gnu code is almost twice as fast
  as the best PCC could do (on a 68010 -- see above).

  The upshot of all this is that the inner loop of the resulting code
  (together with the longword loop below) is identical to the main
  loops of the standard library memcpy() routine on my 3b1, which I assume
  is coded in assembler (but doesn't handle overlap, which is why I had to
  write this routine in the first place.) This means that memory copying
  functions, one of the last remaining rationalizations for coding library
  routines in assembler, will no longer serve as an excuse.  It also means
  that inline C code to copy small blocks can be significantly faster than
  calling memcpy() (it seldom was before, in practice.)

  But see the caveats below.*/

  register char *dc = (char *)d, *sc = (char *)s;/* casts needed by PCC */
  if ( n-- )
  do
  *dc++ = *sc++;
  while ( n-- );
  }
  return(d);

byte_aligned:
  /* both byte aligned -- copy the odd byte */
  if ( n-- )
  if ( dl > sl )
  *--((char *)dl) = *--((char *)sl);
  else
  *((char *)dl)++ = *((char *)sl)++;
  /* drop thru... */
  else
  return(d);

word_aligned:
  nstash = n;
  n /= sizeof(long);
  if ( dl > sl ) {
  /* GnuC performs an optimization on this loop equivalent to the
  one described above, but copying longwords.  It really cooks along.
  However, the loop must be written exactly as here; any deviation,
  even to the extent of not performing the initial decrement of <n> and
  using predecrement in the loop control expression, results in the
  desired optimization not being done.  I tried nearly a dozen
  variations, including several array notation equivalents, as well as
  using an end pointer instead of a loop counter; none resulted in the
  desired optimization.

  This is a shame, because what's below is a non-intuitive way to write
  this loop -- I certainly wouldn't write it this way off the top of my
  head, and I doubt if it occurs in this precise form in much existing
  code.  It seems to me that the compiler should be able to figure out
  equivalent loops and perform the optimization.  The speed increase is,
  as noted above, close to doubling in some cases, and it is hard to think
  of a more useful construct to optimize than block-copy loops.*/

  if ( n-- )
  do /* main loop -- copy longwords */
  *--dl = *--sl;
  while ( n-- );
  n = nstash;
  if ( n & 2 )
  *--((short *)dl) = *--((short *)sl);
  if ( n & 1 )
  *--((char *)dl) = *--((char *)sl);
  } else {
  if ( n-- )
  do
  *dl++ = *sl++;
  while ( n-- );
  n = nstash;
  if ( n & 2 )
  *((short *)dl)++ = *((short *)sl)++;
  if ( n & 1 )
  *((char *)dl)++ = *((char *)sl)++;
  }
  return(d);
}

#ifdef TESTING
/*
  These are some functions and code fragments used in the timeing tests,
the results of which are appended.
*/

/*
  maxcpy() receives the GnuC/68010 optimization but doesn't
handle overlap, or any alignment except both destination and source
word-aligned.  It is about the fastest copy function you can write in C for
the 68010, and I don't think you could do much better in assembler.
*/
char *maxcpy(dl, sl, n)
register long *dl, *sl;
register int n;
{
  n /= sizeof(long);
  if ( n-- )
  do
  *dl++ = *sl++;
  while ( n-- );
}

/*
  null() is the null function.

  Incidentally, GnuC does significantly better than PCC at optimizing
function calls.  It is smarter about picking up arguments off the stack
into registers and automatic vars, and it pushes the args for and calls
several functions before it cleans up the stack.  That is, it does:

  push args
  call foo()
  push args
  call bar()
  pop args for both foo() and bar()

  GnuC sometimes calls four or more functions before cleaning up.  All
this adds up to a null function about 3/2 as fast as PCC produces.
*/

char *null(d, s, n)
register char *d, *s;
register int n;
{
  return(d);
}

/*
  "inline longwords" is the same code as maxcpy() but in-line; as you would
expect, the the times for null() and inline longwords add up pretty closely
to the time for maxcpy() in all tests.

  "inline bytes" is the same, but it copies bytes instead of longwords.

  Since maxcpy() and inline longwords would cause a core dump in the
non-word-aligned tests, the results are not here.
*/

#endif

/*
  Here are the results, in seconds of copying five sizes of blocks by
several methods with each of the basic alignment variations.  d is the
destination offset from longword alignment (here, always zero), s is the
source offset, n is the length of the block in bytes, and i is the number
of iterations tested.  The total bytes copied , n * i, is the same for
each test.

  This test sequence is mainly to test alignment differences.  Note that
for small blocks the alignment, which only effects the speed of the innermost
loop, is pretty much irrelevent; the function call and setup overhead
dominates completely.  For big blocks, mis-alignment slows both memcpy()
and memmove() by a factor of three -- mis-alignment forces byte by byte
copying.

d: 0  s: 0  n: 4  i: 262144  null:4.3
d: 0  s: 0  n: 4  i: 262144  memcpy:9.5
d: 0  s: 0  n: 4  i: 262144  maxcpy:8.0
d: 0  s: 0  n: 4  i: 262144  memmove:14.9
d: 0  s: 0  n: 4  i: 262144  inline bytes:4.1
d: 0  s: 0  n: 4  i: 262144  inline longwords:2.0

d: 0  s: 1  n: 4  i: 262144  null:4.2
d: 0  s: 1  n: 4  i: 262144  memcpy:9.5
d: 0  s: 1  n: 4  i: 262144  memmove:14.9
d: 0  s: 1  n: 4  i: 262144  inline bytes:4.1

d: 0  s: 2  n: 4  i: 262144  null:4.2
d: 0  s: 2  n: 4  i: 262144  memcpy:9.2
d: 0  s: 2  n: 4  i: 262144  maxcpy:7.9
d: 0  s: 2  n: 4  i: 262144  memmove:14.8
d: 0  s: 2  n: 4  i: 262144  inline bytes:4.1
d: 0  s: 2  n: 4  i: 262144  inline longwords:2.0

d: 0  s: 3  n: 4  i: 262144  null:4.2
d: 0  s: 3  n: 4  i: 262144  memcpy:9.5
d: 0  s: 3  n: 4  i: 262144  memmove:14.7
d: 0  s: 3  n: 4  i: 262144  inline bytes:4.2

d: 0  s: 0  n: 16  i: 65536  null:1.1
d: 0  s: 0  n: 16  i: 65536  memcpy:2.8
d: 0  s: 0  n: 16  i: 65536  maxcpy:2.4
d: 0  s: 0  n: 16  i: 65536  memmove:4.3
d: 0  s: 0  n: 16  i: 65536  inline bytes:2.2
d: 0  s: 0  n: 16  i: 65536  inline longwords:1.4

d: 0  s: 1  n: 16  i: 65536  null:1.0
d: 0  s: 1  n: 16  i: 65536  memcpy:3.6
d: 0  s: 1  n: 16  i: 65536  memmove:5.2
d: 0  s: 1  n: 16  i: 65536  inline bytes:2.2

d: 0  s: 2  n: 16  i: 65536  null:1.1
d: 0  s: 2  n: 16  i: 65536  memcpy:2.7
d: 0  s: 2  n: 16  i: 65536  maxcpy:2.4
d: 0  s: 2  n: 16  i: 65536  memmove:4.2
d: 0  s: 2  n: 16  i: 65536  inline bytes:2.2
d: 0  s: 2  n: 16  i: 65536  inline longwords:1.3

d: 0  s: 3  n: 16  i: 65536  null:1.1
d: 0  s: 3  n: 16  i: 65536  memcpy:3.6
d: 0  s: 3  n: 16  i: 65536  memmove:5.2
d: 0  s: 3  n: 16  i: 65536  inline bytes:2.2

d: 0  s: 0  n: 64  i: 16384  null:0.3
d: 0  s: 0  n: 64  i: 16384  memcpy:1.1
d: 0  s: 0  n: 64  i: 16384  maxcpy:1.1
d: 0  s: 0  n: 64  i: 16384  memmove:1.6
d: 0  s: 0  n: 64  i: 16384  inline bytes:1.7
d: 0  s: 0  n: 64  i: 16384  inline longwords:0.8

d: 0  s: 1  n: 64  i: 16384  null:0.3
d: 0  s: 1  n: 64  i: 16384  memcpy:2.1
d: 0  s: 1  n: 64  i: 16384  memmove:2.8
d: 0  s: 1  n: 64  i: 16384  inline bytes:1.7

d: 0  s: 2  n: 64  i: 16384  null:0.3
d: 0  s: 2  n: 64  i: 16384  memcpy:1.1
d: 0  s: 2  n: 64  i: 16384  maxcpy:1.1
d: 0  s: 2  n: 64  i: 16384  memmove:1.6
d: 0  s: 2  n: 64  i: 16384  inline bytes:1.7
d: 0  s: 2  n: 64  i: 16384  inline longwords:0.8

d: 0  s: 3  n: 64  i: 16384  null:0.3
d: 0  s: 3  n: 64  i: 16384  memcpy:2.1
d: 0  s: 3  n: 64  i: 16384  memmove:2.8
d: 0  s: 3  n: 64  i: 16384  inline bytes:1.7

d: 0  s: 0  n: 512  i: 2048  null:0.0
d: 0  s: 0  n: 512  i: 2048  memcpy:0.7
d: 0  s: 0  n: 512  i: 2048  maxcpy:0.7
d: 0  s: 0  n: 512  i: 2048  memmove:0.8
d: 0  s: 0  n: 512  i: 2048  inline bytes:1.7
d: 0  s: 0  n: 512  i: 2048  inline longwords:0.6

d: 0  s: 1  n: 512  i: 2048  null:0.0
d: 0  s: 1  n: 512  i: 2048  memcpy:1.7
d: 0  s: 1  n: 512  i: 2048  memmove:2.2
d: 0  s: 1  n: 512  i: 2048  inline bytes:1.6

d: 0  s: 2  n: 512  i: 2048  null:0.0
d: 0  s: 2  n: 512  i: 2048  memcpy:0.7
d: 0  s: 2  n: 512  i: 2048  maxcpy:0.7
d: 0  s: 2  n: 512  i: 2048  memmove:0.9
d: 0  s: 2  n: 512  i: 2048  inline bytes:1.7
d: 0  s: 2  n: 512  i: 2048  inline longwords:0.6

d: 0  s: 3  n: 512  i: 2048  null:0.0
d: 0  s: 3  n: 512  i: 2048  memcpy:1.6
d: 0  s: 3  n: 512  i: 2048  memmove:2.1
d: 0  s: 3  n: 512  i: 2048  inline bytes:1.6

d: 0  s: 0  n: 4096  i: 256  null:0.0
d: 0  s: 0  n: 4096  i: 256  memcpy:0.6
d: 0  s: 0  n: 4096  i: 256  maxcpy:0.6
d: 0  s: 0  n: 4096  i: 256  memmove:0.7
d: 0  s: 0  n: 4096  i: 256  inline bytes:1.6
d: 0  s: 0  n: 4096  i: 256  inline longwords:0.6

d: 0  s: 1  n: 4096  i: 256  null:0.0
d: 0  s: 1  n: 4096  i: 256  memcpy:1.6
d: 0  s: 1  n: 4096  i: 256  memmove:2.1
d: 0  s: 1  n: 4096  i: 256  inline bytes:1.6

d: 0  s: 2  n: 4096  i: 256  null:0.0
d: 0  s: 2  n: 4096  i: 256  memcpy:0.6
d: 0  s: 2  n: 4096  i: 256  maxcpy:0.6
d: 0  s: 2  n: 4096  i: 256  memmove:0.7
d: 0  s: 2  n: 4096  i: 256  inline bytes:1.6
d: 0  s: 2  n: 4096  i: 256  inline longwords:0.7

d: 0  s: 3  n: 4096  i: 256  null:0.0
d: 0  s: 3  n: 4096  i: 256  memcpy:1.6
d: 0  s: 3  n: 4096  i: 256  memmove:2.2
d: 0  s: 3  n: 4096  i: 256  inline bytes:1.6

  This sequence of tests does all longword copying, with more iterations
and more block sizes for better resolution, to test raw speed.

d: 0  s: 0  n: 4  i: 1048576  null:17.0
d: 0  s: 0  n: 4  i: 1048576  memcpy:36.8
d: 0  s: 0  n: 4  i: 1048576  maxcpy:31.6
d: 0  s: 0  n: 4  i: 1048576  memmove:60.0
d: 0  s: 0  n: 4  i: 1048576  inline bytes:16.5
d: 0  s: 0  n: 4  i: 1048576  inline longwords:7.9

d: 0  s: 0  n: 8  i: 524288  null:8.5
d: 0  s: 0  n: 8  i: 524288  memcpy:19.5
d: 0  s: 0  n: 8  i: 524288  maxcpy:16.9
d: 0  s: 0  n: 8  i: 524288  memmove:31.1
d: 0  s: 0  n: 8  i: 524288  inline bytes:11.5
d: 0  s: 0  n: 8  i: 524288  inline longwords:8.1

d: 0  s: 0  n: 16  i: 262144  null:4.2
d: 0  s: 0  n: 16  i: 262144  memcpy:10.9
d: 0  s: 0  n: 16  i: 262144  maxcpy:9.6
d: 0  s: 0  n: 16  i: 262144  memmove:17.0
d: 0  s: 0  n: 16  i: 262144  inline bytes:8.7
d: 0  s: 0  n: 16  i: 262144  inline longwords:5.3

d: 0  s: 0  n: 32  i: 131072  null:2.1
d: 0  s: 0  n: 32  i: 131072  memcpy:6.7
d: 0  s: 0  n: 32  i: 131072  maxcpy:5.9
d: 0  s: 0  n: 32  i: 131072  memmove:9.9
d: 0  s: 0  n: 32  i: 131072  inline bytes:7.6
d: 0  s: 0  n: 32  i: 131072  inline longwords:3.9

d: 0  s: 0  n: 64  i: 65536  null:1.0
d: 0  s: 0  n: 64  i: 65536  memcpy:4.5
d: 0  s: 0  n: 64  i: 65536  maxcpy:4.2
d: 0  s: 0  n: 64  i: 65536  memmove:6.4
d: 0  s: 0  n: 64  i: 65536  inline bytes:6.9
d: 0  s: 0  n: 64  i: 65536  inline longwords:3.2

d: 0  s: 0  n: 128  i: 32768  null:0.5
d: 0  s: 0  n: 128  i: 32768  memcpy:3.5
d: 0  s: 0  n: 128  i: 32768  maxcpy:3.3
d: 0  s: 0  n: 128  i: 32768  memmove:4.6
d: 0  s: 0  n: 128  i: 32768  inline bytes:6.6
d: 0  s: 0  n: 128  i: 32768  inline longwords:2.8

d: 0  s: 0  n: 512  i: 8192  null:0.1
d: 0  s: 0  n: 512  i: 8192  memcpy:2.7
d: 0  s: 0  n: 512  i: 8192  maxcpy:2.6
d: 0  s: 0  n: 512  i: 8192  memmove:3.3
d: 0  s: 0  n: 512  i: 8192  inline bytes:6.4
d: 0  s: 0  n: 512  i: 8192  inline longwords:2.5

*/