fastest way to copy hunks of memory

Dan Mercer mercer at ncrstp.StPaul.NCR.COM
Fri May 4 08:09:53 AEST 1990


In article <5531 at helios.ee.lbl.gov> tierney at ux1.lbl.gov (Brian Tierney) writes:
:
:Which method is fastest?
:
:1.  char *p1, *p2;
:
:    for (j = 0; j < size; j++)
:       *p1++ = *p2++;
:
:2.  bcopy(p2,p1,size);
:
:3.  memcpy(p1,p2,size);
:
:In general, system calls are slower, right? (ie, 1 faster that 2 and 3)
:
:BTW, what's the difference between bcopy and memcpy anyway??
:
:THANKS.  
:
:
:/---------------------------------------v-------------------------------------\
:| Brian Tierney, Computer Graphics Lab  | internet:   tierney at ux1.lbl.gov   |
:| Lawrence Berkeley Laboratory          |    or       BLTierney at lbl.gov      |
:|      "I am her hero, and she is my heroin": Rocky Ericson                  |


It's c language implementation independent.  Bcopy or memcpy may be
implemented as inline assembly (as strcpy is on the NCR Tower's SVr3
machines if <istring.h> is included).  They may also be implemented
in other fashions that are quicker than standard C function calls.
If the hardware architecture supports move instructions the savings can
be significant.  

for (j = 0; j < size; j++) *p1++ = *p2++; 

At the minimal,  this breaks down to:

    exclusive or register x against register x - 1 cycle
    load register a with size - 1 cycle
    label:
    move character based at value in x plus base y to position at x plus 
        base z - 1 cycle
    increment x - 1 cycle
    compare x to a - 1 cycle
    branch not equal to label - 1 cycle

Cost = 2 + (size * 4) cycles

    In fact,  if the architecture supports a branch and decrement
instruction,  the following c code would be faster -

for (j=size;j;j--) *p1++ = *p2++; 

    load register x with size - 1 cycle
    label:
    move character based at value in x plus base y to position at x plus 
        base z - 1 cycle
    decrement x and branch to label if not zero - 1 cycle

Cost = 2 + (size * 2) cycles

However,  if the bcopy or memcpy is expanded inline and the architecture
support move character instructions,  the savings may be far greater.

For instance,  we have a processor that does an MVC - move characters.
While the bytes being moved are not on full words,  or the length to
be moved is less that a full word,  one cycle is used for each byte.
But if full words can be moved,  only one cycle is required to move
each 4 byte chunk.

Socrates once came upon a bunch of philosophers arguing about how many
teeth a horse had.  One group argued for a certain number based on
the length of the horses snout,  a second upon the diet the horse
ate,  a third group based upon number theory.  Socrates disappeared
for a minute and came back and told them the exact number.  They were
incensed.  How could he be so sure of his number,  when he hadn't even
offered a theory to explain it.  Simple,  he replied,  while they were
arguing,  he had gone back to the stables and counted.

So,  if you want the real answer,  I suggest you do the same.  If
you can't decipher the answer from the assembly generated,  put
together a nice looping package and try that.

-- 

Dan Mercer
Reply-To: mercer at ncrstp.StPaul.NCR.COM (Dan Mercer)



More information about the Comp.unix.wizards mailing list