Optimal for loop on the 68020.

Tue Jun 6 06:15:12 AEST 1989

In article <11993 at well.UUCP> Jef Poskanzer <pokey at well.com> writes:
>I compiled the following different kinds of for loops:
>    for ( i = 0; i < COUNT; i++ )
>    for ( i = 0; i < COUNT; ++i )
>    for ( i = 0; ++i <= COUNT; )
>    for ( i = 0; i++ < COUNT; )
>    for ( i = COUNT; i > 0; i-- )
>    for ( i = COUNT; i > 0; --i )
>    for ( i = COUNT; --i >= 0; )
>    for ( i = COUNT; i-- > 0; )
>on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
>generated code and the timings.  COUNT was a small (< 127) compile-time
>constant.

A huge amount of time ago, i did this for the PDP-11 using the
Ritchie compiler.  This compiler is not awesomely bright, but i
had the impression that it wasn't quite as bad as PCC (or even
less portable).  The optimal loop would use the "sob" (subtract
one and branch if not zero) instruction.  The code that caused it
to be emitted was
	register i;
	i = COUNT;
	do {
		loop body;
	} while (--i);

This is much clearer than anything with for(), since it tends to get
(even stupidly) compiled to:

	register i;
	i = COUNT;			mov $COUNT,r2
	do {			L1:
		loop body;		loop body...
	} while (--i);			sob r2,L1

The compiler seemed to be smart enough to know not to use it if
the loop was too long. Thus, at worst, the sob would be replaced by:
					sub $1,r2
					jne L1

The trouble with "for" loops is that it is defined as "test at
the top", when most single instructions loops are "test at bottom".

The VAX (PCC based) compiler's optimizer would convert many loops
that used multiple instructions to use a single complicated
instruction.  Unfortunately, the VAX 780 (ubiquitous at the time)
generally ran these slower.  One case was the acbl (add, compare
and branch longword).  The optimizer would replace the three
instructions (something like:)
	add $1,r2
	cmp $END,r2
	bne L1

with the "acbl" and all its arguments.  Both codings take the
same number of bytes (25!).  The multiple instructions are faster
(on the 780).  Newer VAXen have different relative timing.  I
recall this case coming from the prime number sieve benchmark.
The loop could not be easily converted to the do-while.

I haven't gotten a chance to play with gcc (real work to do).
I've heard it can do wonders, and thought it did better
multi-statement optimization.  Still, all the silly side-effects
semantics of C make it hard to tell a compiler how to take
mediocre source and produce good code from it.  It is best to
start with good source.  Though i hardly ever write assembler
anymore (and it seems never for speed or size), i've still yet to
bump into a C compiler that produces code that i can't improve on
at a glance.  This may change if/when i start working with
scoreboarded RISC processors (88K), in that instruction cycle
counting looks hard (requiring more than a glance).

Stephen.