general hash functions for text: SUMMARY!

Fri Oct 19 14:02:34 AEST 1990

Hello!

  Well, once again I'd like to thank all who sent me suggestions.
I'll now post the results of my tests so far.  I have only tested
those functions which were easily adapted to my existing problem,
I'll be getting to the more specialized ones later.  Btw, my
problem involves storing (and later indexing) the vocabulary read
from a database of the Montreal Gazette, an average (:>) Montreal
daily.  I have tested about eight of the functions suggested on
roughly 10Mb of the database - about 30,000 distinct words.  The
size of ROOTLEN (the array of pointers to the structs I'm using)
is either 64997 (a prime!) for algorithms that require primes for
a modulus, or 65536 for routines needing a power of two.  I think
these are close enough to give a fair test.

Thank you to:

From: Larry Watanabe <watanabe at cs.uiuc.edu>
From: raymond at math.berkeley.edu (Raymond Chen)
From: dmorin at wpi.wpi.edu (Duane D Morin)
From: Mark Pledger <cti1!mpledger at uunet.UU.NET>
From: drh at cs.duke.edu (D. Richard Hipp)
Message-Id: <656143040.AA19603 at blekko.commodore.com>    (sorry! :)
From: mccaugh at suna0.cs.uiuc.edu
From: jeff at kfw.com (Jeff Henkels)
From: afoiani at NMSU.Edu
From: skrenta at cbmvax.cbm.commodore.com (Rich Skrenta)
From: mike at cosmos.acs.calpoly.edu (Mike Maughmer)
From: brnstnd at KRAMDEN.ACF.NYU.EDU (Dan Bernstein)
From: Paul John Falstad <pfalstad at phoenix.Princeton.EDU>
From: lfd at lcuxlq.att.com (Leland F Derbenwick)
From: Alan J Rosenthal <flaps at dgp.toronto.edu>
From: ath at prosys.se (Anders Thulin)
From: gaynor at paul.rutgers.edu (Silver)
From: chris at mimsy.umd.edu (Chris Torek)
From: macphee at convex.COM (Scott C. Mac Phee)

The functions (as adapted to my needs) are: 

/**** from _Aho, Sethi, Ullman_'s dragon book ****/

unsigned long hash(char *word) {

  unsigned long h, g;

  for (h = 0; *word; word++) {
    h = (h << 4) + (*word);
      if (g = (h & 0xf0000000)) {
	h ^= (g >> 24);
	h ^= g;
	}
    }
  return h % ROOTLEN;

} /* end of hash */

/**** from Sedgewick's _Algorithms_ ****/

unsigned long hash(char *word) {

  unsigned long h;

    h = *word++;
    while (*word)
      h = (h * 128 + *word++) % ROOTLEN;
    return h;

} /* end of hash */

/**** from Gosling's emaccs ****/

unsigned long hash(char *word) {

  unsigned long h = 0;

  while (*word)
    h = (h << 5) - h + *word++;

  return h & ROOTLEN;           /* ROOTLEN must be a power of 2, ie. 2^16 */

} /* end of hash */

The function from Gosling's emaccs nosed the others out by maybe one
percentage point on the ratio of collisions/probes.  Note that these
are general hashing functions for TEXT, more specialized ones (I'm
told) will likely have a better performance.  Also remember that I 
tested these on full newspaper articles, which may be different from
what you need...  I have also modified what I got for clarity, so no
doubt you can compress the way these functions are represented.

Anyway, thanks again, and I hope someone can make use of this summary!

				Cam

--
      Cameron Shelley        | "Armor, n.  The kind of clothing worn by a man
cpshelley at violet.waterloo.edu|  whose tailor is a blacksmith."
    Davis Centre Rm 2136     |
 Phone (519) 885-1211 x3390  |				Ambrose Bierce