fastest way to copy hunks of memory

Tue May 8 01:29:49 AEST 1990

In article <1990May4.172145.4085 at agate.berkeley.edu>
c60c-3cf at e260-3f.berkeley.edu (Dan Kogai) writes:

| In article <1990May2.200732.11851 at eci386.uucp> clewis at eci386.UUCP (Chris Lewis) writes:
| >Perhaps 
| >
| >    while(size--)
| >	*p1++ = *p2++;
| 
| or even
| 
| void *memcpy(void *to, void *from, size_t size){
| 	register int 	size_l = size / 4,	/* or (size >> log2(sizeof int)) */
| 					tail = size % 4;	/* or (size & log2(sizeof int)) */
| 	void			*result = to;				
| 	while(size_l--) (int *)to++ = (int *)from++;
| 	while(tail--) (char *)p1++ = (char *)p2++;
| 	return result;
| }
| 
| 	This shold work almost 4 times as fast compared to just inclementing
| by bytes--it uses full length of register.  The problem is that it doesn't
| work if either (void *to) and (void *from) is not aligned and the macine
| architecure doesn't allow unaligned assignment.  Such functions as
| memcpy() should be written in assembler, I think...

The above code will not work on machines with strict alignment
requirements (ie, RISC machines) if either the 'to' or 'from' pointers
are not aligned on input, since the user could certainly do something
like:

	memcpy (to+1, from, size);

It also will not work under ANSI C compilers, since the construction:

	(int *)to++ = ...

is illegal ANSI C.  Finally, to get the most of the performance on
RISC machines, you have to know about the underlying machine
characteristics.  For example, on the 88k, there is a 2 cycle delay
after the load instruction has been initiated, and before it is in a
register (there are hardware interlocks, so that even naive code will
work).  Thus on the 88k, after dealing with any initial unaligned
pointers, and such, the main loop would look like:

	...

	{
		register int word1, word2, *word_to, *word_from;

		word_to = (int *) to;
		word_from = (int *) from;

		do {
			word1 = word_from[0];
			word2 = word_from[1];
			word_from += 2;
			size -= 2 * sizeof (int);
			word_to[0] = word1;
			word_to[1] = word2;
			word_to += 2;
		} while ( size > 2 * sizeof(int) );
	}

Optimizing bcopy/memcpy/memmove is not as simple as it looks.  It
takes a lot of skull sweat, and worrying about unusual cases.
--
Michael Meissner	email: meissner at osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so