Algorithm needed: reading/writing a large file

A. Lester Buck buck at siswat.UUCP
Fri Jul 14 01:41:33 AEST 1989


In article <207 at larry.sal.wisc.edu>, jwp at larry.sal.wisc.edu (Jeffrey W Percival) writes:
> In article <8137 at bsu-cs.bsu.edu> dhesi at bsu-cs.bsu.edu (Rahul Dhesi) writes:
> >In article <205 at larry.sal.wisc.edu> jwp at larry.sal.wisc.edu (Jeffrey W
> >Percival) writes:
> >[how do I sort a large file that won't fit in memory?]
> >There are many variations on the merge sort.  Here is a simple one:
> 
> Please be careful with your paraphrasing.  I did not ask how to sort a
> large file that won't fit in memory.  I said I already had a fast and
> efficient sort and that the sorting was already known.  My question was
> about optimizing the process of rearranging a disk file according to a
> *given* mapping.
> -- 
> Jeff Percival (jwp at larry.sal.wisc.edu)

I was one who emailed about basic external sorting.  I don't understand what
you are saying above.  If your problem is to sort a large file that is on
external storage, then the classical methods are what you want.  If you
somehow have a problem where you are handed the already sorted list of
pointers to records and you want to go from that stage, that is _maybe_ a
different problem - I don't know because that sounds extrememly contrived.
It is pointed out in many places, including Knuth Vol. 3, that it gains you
next to *nothing* to get a sorted table of pointers to records when what you
almost always want is the records themselves sorted.  "Fast and efficient
sorts" that don't order the records themselves are not of much advantage.
You are just delaying the real pain of moving the records.

>From Knuth Vol. 3, p. 374:

    The external rearrangement problem which remains after [key sorting]
    seems trivial, at first glance; but in fact it is rather difficult,
    and no really good algorithms (significantly better than sorting)
    have yet been found.  We could obviously do the rearrangement in
    N steps, moving one record at a time; for large enough N this is
    better than the N(log N) of a sorting method.  But N is never that
    large, and N seeks are unthinkable!
    [...]
    But it seems wasteful to do a key sort only to follow it by a
    full sort! 
    [Goes on to discuss a nontrivial lower bound on the number
    of seeks required to rearrange records on a disk file.]


The only real gain of sorting the keys and not the records themselves is
when you _don't_ have to move the data, as in an indexed file application.
You still have to physically sort the records once if you require reasonable
sequential access.

I remember reading once that an estimated 40% of all mainframe cycles in the
world were spent on sorting.  I am sure we would all be interested in
hearing about any special case you might have where your keysort is
completely free, but I doubt whether you will be able to do much better than
the "classical" sort/merge with replacement selection.  If you do end up
with a good method, you have a highly publishable (and patentable!) piece of
work.


-- 
A. Lester Buck		...!texbell!moray!siswat!buck



More information about the Comp.unix.wizards mailing list