count of bits set in a long

Tue Oct 2 05:28:28 AEST 1990

I think I have finally figured out how the first part of that bit count routine
is working.  I describe the algorithm as follows:

	1.	Establish a count in each 3 bit field of the number of bits set
		in the corresponding field of the input number.
	2.	Successively "double up": combine pairs of fields while summing
		their contents.
	3.	Sum up the contents of all the fields using a modulo operation.

We have to start with a 3 bit field, because a 3 bit counter can count
the bits in a 6 bit field, but a 2 bit counter can't count the bits in
a 4 bit field.  We have to "double up" at least once, because a 3 bit
counter can't count the bits in a 32 bit word, while a 6 bit counter
can.  NOTE: with a 64 bit word, we would have to "double up" at least TWICE.

We have two implementation decisions - how many times to "double up",
and whether to do the modulo operation directly or with a loop as
proposed by tac at cs.brown.edu.  The loop is favored if '%' is expensive, as
suggested by chris at mimsy.umd.edu, or if high order bits are usually not
set.

If the loop is used, it could pay to double up an extra time so fewer
iterations are needed.  (Finding x%077 requires log to the base 63 of
x, or ln(x)/ln(63), iterations.  x%07777 requires ln(x)/ln(07777)
iterations, or about half as many).  

If we double up four times, no modulo operation is needed at all.

For example...

bitcount4(unsigned long mask)
{	unsigned long y;

	y = (mask >> 1) &033333333333L;
	y = mask - y - ((y >>1) & 033333333333L);
			/* each 3 bit field now contains the count of 
				bits set in the corresponding field of 'mask' */
	y = ((y >> 3) + y) & 030707070707L;		/* double up */
			/* each 6 bit field now contains the count of 
				bits set in the corresponding field of 'mask' */
	y = ((y >> 6) + y) & 007700770077L;		/* double up 2nd time */
			/* each 12 bit field now contains the count of 
				bits set in the corresponding field of 'mask' */
	return y % 07777;
}

I coded up six variations in all using Turbo C: 

bitcount3: naive routine using (y>>1) 32 times
bitcount1: doubling up once, finding y%077 using loop in separate function
bitcount0: doubling up once, finding y%077 using loop 
bitcount5: doubling up 4 times, no modulo operation needed
bitcount4: doubling up twice, finding y%07777 directly
bitcount2: doubling up once, finding y%077 directly

The relative execution times on a '386, as recorded by the Turbo Profiler, are:

_bitcount3         2.14 sec  22% |**************************
_bitcount1+_mod63  0.80 sec   7% |********
_bitcount0         0.64 sec   6% |*******
_bitcount5         0.38 sec   3% |****
_bitcount4         0.35 sec   3% |****
_bitcount2         0.15 sec   1% |*

For this architecture, neither extra doublings nor use of the loop
appear helpful.  I seem to remember that an 8088 needs many more cycles
for an integer division, suggesting that the loop might help there.

                         - Jim Van Zandt