Comparing strings... (long but maybe worth it)

Sun Oct 14 11:21:17 AEST 1990

In article <2205.271700c2 at cc.nu.oz.au>, v8902058 at cc.nu.oz.au writes:

> 	I'm sorry if this is a trival task, but I would like a function that
> compares two strings.  I would like to be able to find if a string is equal to,
> less than, or greater than another string.

Usually I don't answer questions that look like homework assignments.
I can't resist this one, though, seeing that someone else has already
posted an answer and also that it gives a nice opportunity to explore
some programming techniques that some people may find unfamiliar.

The first thing to do is understand the problem thoroughly.  As stated,
we do not know what a string is, or what it means for one string to be
less than another.

First let's define a string as a null-terminated character array
represented by a pointer to the initial element of the array.  Note
that this definition precludes treating a null pointer as a string;
the program we write will therefore fail if either of its arguments
is a null pointer or if either of the strings is not null-terminated.

Next, we define an ordering relation on strings.  Two strings must
be equal, of course, if all their characters are equal.  What if
they aren't?  Then there are two cases:

	(a) one string is a prefix of the other.  In this case
	we will define the shorter string as smaller.

	(b) neither string is a prefix of the other.  In this case
	we can scan the strings left to right until we find a
	character in one string that is not equal to the corresponding
	character in the other; comparing these two characters
	will determine which string is smaller.

This definition corresponds to the usual notion of lexical ordering.
For example, the null string is the smallest string of all, "a" is
less than "aa", which in turn is less than "ab", and so on.

This definition has an important property: if you pick a character
and append that same character to the BEGINNING of each of two
strings, you will not affect how the strings compare.  So, for
example, because "a" is less than "aa", "xa" is less than "xaa".

This fact is easy to prove.  If the strings started out equal,
they will still be equal after the transformation.  If not, you
can go through both cases (a) and (b) above and easily convince
yourself that whichever one applies, adding the extra character
will not change the result.

That suggests the following algorithm:

	(1) If both strings are null, they're equal.

	(2) Otherwise, if one string is null and the other one
	    isn't, the null string is shorter.

	(3) Otherwise, both strings are non-null; if their initial
	    characters are unequal, the result of that comparison
	    is the result of the algorithm.

	(4) Otherwise, we chop the initial character off both strings
	    and recursively apply the algorithm on the remainders.

This is easy to translate into a C program.  We assume the constants
LESS, EQUAL, and GREATER are defined elsewhere.  We return LESS if
the first argument is less than the second, GREATER if the first argument
is greater, and EQUAL if they're equal:

	int compare (char *p, char *q)
	{
		if (*p == '\0' && *q == '\0')
			return EQUAL;
		if (*p == '\0' && *q != '\0')
			return LESS;
		if (*p != '\0' && *q == '\0')
			return GREATER;
		if (*p < *q)
			return LESS;
		if (*p > *q)
			return GREATER;
		return compare(p+1, q+1);
	}

We now have a program that accurately reflects the definition, and
that may be good enough!  However, let's assume that we want the
program to run faster.  A dangerous assumption, but it's a wretched,
miserable, rainy night out and I want to have a little fun.

First, let's look at the first two comparisons.  We're comparing
p to '\0' twice.  That's silly; we already know the answer from
the first time.  Let's factor those two comparisons:

		if (*p == '\0') {
			if (*q == '\0')
				return EQUAL;
			return LESS;
		}
		/* ... */

Next we look at the third comparison and realize we know half the
answer to that two: we can only get there if *p != '\0'.  The first
three comparisons now look like this:

		if (*p == '\0') {
			if (*q == '\0')
				return EQUAL;
			return LESS;
		}
		if (*q == '\0')
			return GREATER;

The next trick is to eliminate the recursive function call.  Because
the function call is the last thing in our function, we can simply
replace it by a set of assignments to establish the right initial
conditions followed by a goto up to the top of the function!  Some
C compilers will do it for us, but it's more fun this way:

	int compare (char *p, char *q)
	{
	top:	if (*p == '\0') {
			if (*q == '\0')
				return EQUAL;
			return LESS;
		}
		if (*q == '\0')
			return GREATER;
		if (*p < *q)
			return LESS;
		if (*p > *q)
			return GREATER;
		p++;
		q++;
		goto top;
	}

Nobody likes goto statements these days, so let's replace this
one by a `while' loop.  It's easy to do this by moving the body
of the first `if' outside the loop and making *p == '\0' the
subject of the loop.  Of course, we have to remember to invert
the sense of this test.  While we're at it, we can swap the order
of the to comparisons of *p and *q to merge the two tests
that might result in returning GREATER:

	int compare (char *p, char *q)
	{
		while (*p != '\0') {
			if (*q == '\0' || *p > *q)
				return GREATER;
			if (*p < *q)
				return LESS;
			p++;
			q++;
		}
		if (*q == '\0')
			return EQUAL;
		return LESS;
	}

We can take advantage of the fact that if we use an integral
value by itself in the comparison, that implicitly compares it
to zero (which is identical to '\0').  That shortens the program
slightly.  We will also make p and q into register variables, and
finally note that the comparison `*p > *q' can be merged with the
increment operations on p and q (because if the `return' is
executed, the values of p and q no longer matter).  This makes the
program significantly faster on some machine architectures and
compilers:

	int compare (register char *p, register char *q)
	{
		while (*p) {
			if (!*q || *p > *q)
				return GREATER;
			if (*p++ < *q++)
				return LESS;
		}
		if (!*q)
			return EQUAL;
		return LESS;
	}

Finally, we can merge the two `return' statements at the end into
a single ?: operator:

	int compare (register char *p, register char *q)
	{
		while (*p) {
			if (!*q || *p > *q)
				return GREATER;
			if (*p++ < *q++)
				return LESS;
		}
		return *q? LESS: EQUAL;
	}

By this time, the program should be thoroughly incomprehensible,
even though it was obtained by a series of straightforward steps
from the original problem definition!  As an exercise, you might
try proving that the program works in its current form, or finding
out where I goofed if it doesn't work.

In `Structured Programming with Goto Statements,' Don Knuth said
he hoped for a system that he would tentatively name UTOPIA84
(because he thought someone might be able to get it done by
1984 -- this article was published in 1974) that would allow
people to perform these kinds of program transformations mechanically
while having the system check that correctness was being maintained.
He didn't want the system to do these things automatically --
he felt (and I agree) that it is only worth doing this kind of
optimization after you've measured a working program to see that
it is actually worthwhile.

I imagine that further optimizations are possible.  Anyone else
want to give it a try?
-- 
				--Andrew Koenig
				  ark at europa.att.com