Optimal for loop on the 68020.

Mon Jun 5 10:18:11 AEST 1989

SUMMARY
-------

I compiled the following different kinds of for loops:

    for ( i = 0; i < COUNT; i++ )
    for ( i = 0; i < COUNT; ++i )
    for ( i = 0; ++i <= COUNT; )
    for ( i = 0; i++ < COUNT; )
    for ( i = COUNT; i > 0; i-- )
    for ( i = COUNT; i > 0; --i )
    for ( i = COUNT; --i >= 0; )
    for ( i = COUNT; i-- > 0; )

on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
generated code and the timings.  COUNT was a small (< 127) compile-time
constant.  The loop body did not reference i and had no side-effects.
In theory, all eight of these loops should have optimized to the most
efficient loop possible, ignoring the otherwise unreferenced variable i
and simply traversing the loop body the proper number of times.  On the
68020, this most efficient loop is a dbra instruction.  But in fact, cc
never generates dbra, and gcc generates it for only one of the above
loop types, and only when the -fstrength-reduce flag is specified.

In contrast, the BSD cc on a vax generates a single instruction loop in
four out of the eight cases, including the two most commonly-used
ones.  This is not because the vax compiler is any smarter -- the fact
that it wins only half the time, instead of all the time, shows that it
is also failing to recognize that it can freely optimize the loop.  The
only reason the vax's cc does better is that the vax has a richer
instruction set.

Since the loop overhead for dbra is 1.7 times less than for the code
from the common for loops, it is useful to know how to provoke your
compiler into generating it.  But it would be more useful for compilers
to be smarter.

And more generally, those of you who blithely assume (as I used to)
that compilers are smart enough to turn clearly-expressed C into
optimal assembler, should perhaps look at their assembler output more
often.

FULL DATA
---------

Case 1) First, let's look at the generic loops, counting up from zero
to the limit:

    for ( i = 0; i < COUNT; i++ )
    for ( i = 0; i < COUNT; ++i )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
tag:			tag:
    <loop body>		    <loop body>
    addql  #1,d6	    addql  #1,d0
    moveq  #COUNT,d1	    moveq  #COUNT-1,d3
    cmpl   d1,d6	    cmpl   d0,d3
    jlt    tag		    jge    tag

    0.97 usecs		    0.97 usecs

The two compilers generate essentially the same instructions.  Pre or
post increment doesn't make any difference.  Note that the moveq of
COUNT could be moved above the loop -- the loop body doesn't doesn't
call any subroutines that might smash it.  This looks like a missed
optimization.

Case 2a) Second, let's try doing the increment and test in one C
statement.  Seems to me this shouldn't make any difference to a smart
optimizer, but...

    for ( i = 0; ++i <= COUNT; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    addql  #1,d6	    addql  #1,d0
    moveq  #COUNT,d1	    moveq  #COUNT,d3
    cmpl   d1,d6	    cmpl   d0,d3
    jle    tag1		    jge    tag1

    0.97 usecs		    0.97 usecs

No important difference from case 1.

Case 2b) But if we try the post increment version:

    for ( i = 0; i++ < COUNT; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    movl   d6 d0	    addql  #1,d0
    addql  #1,d6	    moveq  #COUNT+1,d3
    moveq  #COUNT,d1	    cmpl   d0,d3
    cmpl   d1,d0	    jgt    tag1
    jlt    tag1

    1.11 usecs		    0.97 usecs

Gcc is smart enough to bias the loop count, but cc doesn't see that
optimization and so adds an extra instruction.

Case 3) Third, let's try a count-down loop:

    for ( i = COUNT; i > 0; i-- )
    for ( i = COUNT; i > 0; --i )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #COUNT,d6	    moveq  #COUNT,d0
tag:			tag:
    <loop body>		    <loop body>
    subql  #1,d6	    subql  #1,d0
    tstl   d6		    tstl   d0
    jgt    tag		    jgt    tag

    0.83 usecs		    0.83 usecs

Here the two compilers generate exactly the same instructions.  There
is one less instruction than in the count-up case, because the
compilers are smart enough to use the tstl instruction to compare
against zero, and so they don't need the moveq #COUNT instruction
(which shouldn't have been in the loop in the first place).

Case 4a) Fourth, let's try a count-down loop with the decrement
included in the test statement:

    for ( i = COUNT; --i >= 0; )

       cc -O		      gcc -O		gcc -O -fstrength-reduce

    moveq  #COUNT,d6	    moveq  #COUNT,d0	    moveq  #COUNT,d0
    jra    tag2		    jra    tag2		    jra    tag2
tag1:			tag1:			tag1:
    <loop body>		    <loop body>		    <loop body>
tag2:			tag2:			tag2:
    subql  #1,d6	    subql  #1,d0	    dbra   d0,tag1
    jge    tag1		    jpl    tag1		    clrw   d0
						    subql  #1,d0
						    jcc    tag1

    0.70 usecs		    0.70 usecs		    0.57 usecs

Cc and gcc generate code similar to case 3, except that they suddenly
decide that subql sets the condition codes just fine by itself, and a
separate tstl is not necessary.  So they shave off an instruction.
Seems like a peephole optimizer should have picked this up in the
previous cases too; it's another missed optimization.

But the big win here is with the -fstrength-reduce option in gcc.  It
actually generates the optimal one-instruction loop, for a factor of
1.7 reduction in loop overhead from the generic loop.  Not bad.  But
wait!  What's that chud after the loop?  Let's see, clear d1 to zero,
subtract one from it giving -1 and setting carry, and jump if carry is
clear.  Hmm, looks like a three-instruction no-op to me!  I guess gcc
wants to leave the loop index with the right value, and isn't smart
enough to notice it isn't used.  (But why not simply moveq the final
value?)  Oh well, since it's not inside the loop it doesn't make much
difference.

Case 4b) But if we try this with post decrement:

    for ( i = COUNT; i-- > 0; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #COUNT,d6	    moveq  #COUNT,d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    movl   d6,d0	    subql  #1,d0
    subql  #1,d6	    moveq  #-1,d3
    tstl   d0		    cmpl   d0,d3
    jgt    tag1		    jlt    tag1

    0.97 usecs		    0.97 usecs

Cc not only adds in the movl to save the unincremented value of i, it
also somehow fails to realize that subql sets the condition codes, so
the tstl is back.  And gcc gets totally confused and brings in a
literal -1.

CONCLUSION
----------

For both compilers and all levels of optimization, this loop:

    for ( i = COUNT; --i >= 0; )

gives the lowest overhead.  Using gcc -fstrength-reduce, this low
overhead means the optimal single-instruction loop; but even for cc and
for gcc without -fstrength-reduce, this loop is faster than the
others.  I don't show the results here, but this is also true for large
constant loops and for variable-length loops.

It is unfortunate that we have to use such a non-intuitive loop to get
the best results -- especially unfortunate since it is probably *not*
the best loop on some other architectures or compilers -- but that's
the facts.

I would be interested to see similar checks done on different
architectures; in particular RISC machines, which supposedly are
designed in cooperation with the compiler writers.
---
Jef

   Jef Poskanzer   {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey
 "Man invented language to satisfy his deep need to complain." -- Lily Tomlin