compare strings, strcmp

Richard O'Keefe ok at mudla.cs.mu.OZ.AU
Sat Nov 18 18:08:30 AEST 1989


In article <214 at isgtec.UUCP>, robert at isgtec.UUCP (Robert Osborne) writes:
> In article <11605 at smoke.BRL.MIL> gwyn at brl.arpa (Doug Gwyn) writes:
> >In article <4463 at blake.acs.washington.edu>
> > jimli at blake.acs.washington.edu (Jimmy Li) writes:
> >>char array[10000][200];
> >>What is the fastest way to sort this array?
> >
> >Dump it into an external file and invoke the system sort utility on the file.
> This is a REALLY bad way of sorting an array this big.  ... He asked
> for the FASTEST way, this shouldn't even have be given as an option.

Oh ah, tried it, have you?  In fact it is not a REALLY bad way of sorting.
Depending on whether you have virtual memory, how much of it you have,
what the page replacement algorithm is, how much real memory you have,
&c &c sorting using an external file with a system sort utility may end
up doing FEWER I/O operations than an internal sort.  In fact in UNIX
systems sort(1) uses an algorithm (merging) which is intrinsically faster
than qsort(3)'s "quicker-sort" algorithm, but alas, the point at which it
switches over to merging depends on memory size &c &c &c.  

A general point about questions like this:  any answer we give is likely
to be an answer to the wrong question.  I find myself wondering why anyone
wants to sort char array[10000][200].  Why exactly those number?  Are the
records really fixed length?  How does anyone come by 200 characters
without any substructure?  If there is substructure, why not declare the
array more explicitly and exploit the substructure in the sorting
algorithm?  For example, suppose that the records contain a field which
is, for argument's sake, a date YYMMDD.  By setting up a table of 366
buckets (one for each bucket) the records could be distributed among the
buckets in O(N) time -- where N is the number of records -- and then
the buckets could be sorted using a list merge on the rest of the fields.
This would reduce the problem from O(N.lgN) cost to O(N.lg(N/365)), about
a factor of 2.3 faster.  Without knowing what Jimmy Li is really doing,
we might give bad advice.  Indeed, the very best way of ensuring that a
table is sorted is to generate it in sorted order in the first place;
without knowing what he is up to I can't tell whether the "null algorithm"
can be used for his problem or not.

There is another extremely important reason why Doug Gwyn's suggestion is
a REALLY good one even if it proves slower in some cases.  That is that
the system sort utility provides you with a mini-language for specifying
comparisons which can make it a lot easier to get your sort right.



More information about the Comp.lang.c mailing list