Optimal for loop on the 68020.

Tue Jun 13 05:12:09 AEST 1989

>Random thought: What are our "optimizing compilers" optimizing if they 
>leave loops alone?

Its not that they leave loops alone, they will munge either side of a label
as well as they can.  Optimizing across a label is hard, since such
optimizations might affect [bust] the code branching to that label.
See example to follow.

Again, one cannot just try some trivial loop cases, say "form X is fastest",
and use it blindly from there on.  Life is not that simple.  Compilers do not
say: "aha, he's using loop form number 5, that calls for a dbra."  Typically,
a good optimizing compiler is going to start by naively generating a "tree"
representing one's program segment, said "tree" unfortunately containing
loops and brach statements which complicate a compiler's life tremendously.  
Then the tree is simplified using simple rules of arthmetic: 

"[add [const 4] [const 1]] --> [const 5]"

for example.  Consider what loops and branches can easily do to complicate
a compiler's life:

	[reg0 = [const 4]]
label401:
	[reg1 = [const 1]]
	[reg2 = [add reg0 reg1]]

Can the final statement be simplified to: "[reg2 = [const 5]]" ???

Maybe, but you'd have to have knowledge of all the areas that branch to 
label401.  Clearly, compilers don't like to do this!  And situations like
this arise all the time when your for loops get converted to a branch plus
a label.

Finally, after these arithmetic simplifications, attempts are made to map
the tree onto machine instructions, or combinations of machine instructions.
Wierd, but useful instructions like "dbra" are likely to be handled as
a peep-hole optimization -- where a final stage of the compiler tries
to substitute simpler instructions for specific combinations of one or
more more expensive instructions.  So in real life programming, whether
your favorite "dbra" instruction gets used or not is close to blind luck!
Its just going to depend on exactly what you do in that loop, what registers
are needed at other points in the routine, etc, etc.  In real life programming
these things are just not very repeatable.

Consider, using gcc -O -S on:

#define OUTERLOOP 10000
#define COUNT 1000
main()
{
    int i;
    int j;
    int ary[COUNT];
    for (j=OUTERLOOP; j--;) 
    {
        for ( i = COUNT; i-- > 0; ) ary[i] = i; 
    } 
}

gives:

_main:
	link a6,#-4000
	movel d2,sp at -
	movel #10000,d1
	jra L2
L9:
	movel #1000,d0
	jra L5
L8:
	lea a6@(d0:l:4),a0
	movel d0,a0@(-4000)
L5:
	subql #1,d0
	moveq #-1,d2
	cmpl d0,d2
	jlt L8
L2:
	dbra d1,L9
	clrw d1
	subql #1,d1
	jcc L9
	movel a6@(-4004),d2
	unlk a6
	rts

which executes in 15 seconds on my mid-range 68020 system.  Using the 
"magic trick" of: "short s; for (s = COUNT; c--;) ..." to try to match
a dbra instruction:

#define OUTERLOOP 10000
#define COUNT 1000
main()
{
    short s;
    int j;
    int ary[COUNT];
    for (j=OUTERLOOP; j--;) 
    {
        for ( s = COUNT; s-- > 0; ) ary[s] = s; 
    }

generates:

_main:
	link a6,#-4000
	movel #10000,d1
	jra L2
L9:
	movew #1000,d0
	jra L5
L8:
	movew d0,a1
	lea a6@(a1:l:4),a0
	movel a1,a0@(-4000)
L5:
	subqw #1,d0
	cmpw #-1,d0
	jgt L8
L2:
	dbra d1,L9
	clrw d1
	subql #1,d1
	jcc L9
	unlk a6
	rts

which takes 16 seconds to execute!  So if one has an optimizing compiler I
think one is fooling oneself if one thinks one can memorize a favorite
loop structure and use it blindly.  Write code using common sense, that
is easy to read, and not too tricky.  If you need more speed, find a 
different algorithm, or profile it, and work on just those areas that are
too slow, re-writing in assembly if necessary.  More importantly, try
to get one's hands on the best optimizing compiler available for one's
machine.  Any damned compiler can be advertised as an "optimizing" compiler --
the word means nothing.  Any damned compiler can be tweaked to generate 
great looking code for a few do-nothing for loops -- it means nothing.
Try to get a couple of the best compilers on you machine and then try them on
a variety of code indicative of the work you do.  They may vary greatly in
areas far more important to you than loop construction.  For example, 
680x0 compilers I have seen vary *greatly* in the approaches they take to
using the 6888x coprocessor -- with many having a rudimentary or non-existant
concept of what it means to optimize floating pt instructions.   Many
only use registers 0 and 1 in the 6888x coprocessor!  Many flotch floating
pt numbers to and from the 6888x in order to convert from single to double
floating pt and back, etc, etc...... [I have yet to see a compiler which
is good at both 680x0 instruction generation and 6888x instruction generation..]