Generating Hash Codes

To undestand how to best override the default implementation of the hashCode() method to generate hash codes for classes of our own design, it will be instructive to see how the developers of Java have done the same in some of its existing, well-known classes:

`Integer.hashCode(), Short.hashCode(), Byte.hashCode() & Character.hashCode()`

The hash code generated by the Integer class is trivial (it just returns the value of the integer). Similarly, the Short, Byte, and Character wrapper classes each simply return their underlying value cast to an int. 'Nuff said.

`Long.hashCode() & Double.hashCode()`

Recall that long values are stored with 64 bits. The hashCode() method for this class will thus need to reduce this by half, down to 32 bits, since it must produce something of type int.

We avoid simply outputing the first 32 bits as an integer, as if the context at hand doesn't deal with either negative or really large integers (i.e., over $2^{31}$) these bits may often contain several zeros (although not as many as you might think -- consult IEEE-754 encoding of floating point values for more details, and consider how and where the "exponents" are stored in this format). Still, disregarding any bits increases the chance for duplicated inputs to the hash function to occur and thus drives up the number of potential collisions.

To better distribute the keys in the array, it is best for this and all classes, to use all of the bits of the object in question in calculating the hash code. In this way, any change to the underlying object will more likely result in a change to the hash code produced. As such, we combine the first 32 bits with the last 32 bits of our 64-bit double with an bit-wise xor operator^†, as shown below.^††

public final class Long {

   private final long value;

   ...

   public int hashCode() {
      return Long.hashCode(value);
   }

   ...

   public static int hashCode(long value) {
      return (int) (value ^ (value >>> 32));
   }

   ...
}

To understand the last statement above, recall first that ">>>" is the "unsigned right-shift" operator in Java.

When applied to the long value in question above, 32 zeroes are pushed to the front of the bitstring for this value, with the same number of digits falling off the tail end, and the new bitstring that results is re-interpreted as a new long value.

The use of the bit-wise xor operator, "^", then combines the last 32 bits of the long value with the first 32 bits of the long value.

The operation of constructing the hash code is shown below for the long value x = 1234567890098765432L. The last 32 bits appear in dark red, while the first 32 bits are shown in light red. Notice how these line up vertically as a result of the aforementioned unsigned right-shift. The result of combining these right-most 32 bits via the bit-wise xor is shown in black.

  0001000100100010000100001111010001111100011100001011111001111000  \\ x 
^ 0000000000000000000000000000000000010001001000100001000011110100  \\ x >>> 32 
------------------------------------------------------------------
  0001000100100010000100001111010001101101010100101010111010001100  \\ x ^ (x >>> 32)

                                  01101101010100101010111010001100  \\ (int) (x ^ (x >>> 32))

The first 32 bits of $x$ are similarly xor'd with 32 zeros, as shown in gray above -- but the resulting bits are discarded as a result of the cast to an int.

Note, we use xor as opposed to or, as on average or produces more bits of 1 than bits of 0 (think about the 4 possible cases for "or-ing" two bits). Using or would then keep the distribution of keys from being uniform, which would cause more collisions. Using and would bias things in the opposite direction, towards hash codes that have fewer ones -- again causing more collisions.

In this way, any change to the long value will change at least one bit (e.g., either a light or dark red bit above), which in turn changes a (black) bit of the resulting hash code for the Long object due to the xor operator.

As double values are also stored with 64 bits, the Double wrapper class takes advantage of this exact same calculation to compute the output of its hashCode() method.

`Boolean.hashCode()`

With only two possible keys, true and false -- one wonders why the Boolean class even has a hashCode() method. The short answer is that when we turn our attention to creating hash codes for classes of our own design, we will cleverly combine the hash codes associated with all the instance variables associated with that class. As some of these instance variables might be boolean values, the Boolean class will need a hashCode() method to invoke.

Still -- as you might have guessed -- this method is not complicated at all. If the boolean value is true we return one number, if not we return another. This can be accomplished with a quick application of a conditional operator, as seen below.

public final class Boolean {

   private final boolean value;

   ...

   public int hashCode() {
      return hashCode(value);
   }

   ...

   public static int hashCode(boolean value) {
      return value ? 1231 : 1237;
   }

   ...
}

Note the unusual choice of the two possible outputs that could result. Why aren't more simple outputs like $1$ and $2$ used? Why use $1231$ and $1237$?

The key is that these two values are both large primes.

Why primes? Remember that the result of the hashCode() method (made positive by our previous and-based trick) is expected to be reduced down to an appropriate array index using the mod operator and the array size $m$. If the outputs were composite, say $1000$ and $2000$, note what happens when the array size is a common factor of these composite numbers:

if $m=8$, the resulting array indices, (1000 % 8) and (2000 % 8), collide
if $m=10$, the resulting array indices, (1000 % 10) and (2000 % 10), collide
if $m=20$, the resulting array indices, (1000 % 20) and (2000 % 20), collide

As can be seen, there can be a lot of collisions when the outputs are composite and can share factors. We avoid this undesirable result by keeping them prime.

Why large primes? Here again, when creating hash codes for classes of our own design, it is common to add the hash codes for the different instance variables together. If these values are small, their sum is small -- which risks an uneven distribution of places used in the hash table.

`String.hashCode()`

Before we discuss how hash codes are found for , let us take a moment to discuss something that -- on the surface -- appears completely unrelated...

Suppose one wishes to convert a base 16 value to its base 10 equivalent. Remembering that the digits represent powers of 16, we could go about this in the following, very direct, way:

$$(d_n d_{n-1} \cdots d_0)_{16} = d_n \cdot {16}^n + d_{n-1} + d_{n-1} \cdot {16}^{n-1} + \cdots d_0 \cdot {16}^0$$ A concrete example might be more illustrative. Consider the base 16 value 2AC3F. Recalling that in base 16 that $$A=10, B=11, C=12, D=13, E=14, \textrm{ and } F=15$$ we can find the corresponding base 10 value with the following calculation: $$2AC3F_{16} = 2 \cdot {16}^4 + 10 \cdot {16}^3 + 12 \cdot {16}^2 + 3 \cdot {16} + 15$$ Note how doing this calculation directly will require only $4$ additions, but $ 4 + 3 + 2 + 1 = 10$ multiplications. Having seen sums like this before, it should not be surprising that In general, an $n$ digit number will require $n(n-1)/2 \sim n^2/2$ multiplications.

We can improve on this, however. Notice how all but the last term on the right has a factor of $16$. Suppose we factor these out of the first 4 terms to get: $$16 \cdot (2 \cdot 16^3 + 10 \cdot 16^2 + 12 \cdot 16 + 3) + 15$$ This drops the number of multiplications required down to $1 + (3 + 2 + 1) = 7$ and doesn't alter the number of additions. Of course, seeing all of multiplications needed for the powers of $16$ that still remain, we could apply this strategy a second time to obtain $$16 \cdot (16 \cdot (2 \cdot 16^2 + 10 \cdot 16 + 12 \cdot) + 3) + 15$$ which reduces the number of multiplications down to $5$. Still, we can do this one more time to get which involves only $4$ multiplications and the same number of additions with which we began. $$16 \cdot (16 \cdot (16 \cdot (16 \cdot 2 + 10) + 12) + 3) + 15)$$ This is quite a bit better than the 10 multiplications with which we started. More generally, if we had started with a base 16 value of $n$ digits we can reduce the number of multiplications needed to convert it to its base 10 equivalent down to $n-1$ additions and $n-1$ multiplications!

Further, had the values $2$, $10$, $12$, $3$, and $15$ been stored in some array, the code to evaluate the value shown would be trivial:

int n = 0;
for (int i = 0; i < a.length(); i++) 
   n = 16 * n + a[i];

Of course, there is nothing special about base 16, we could have replace it with any base $b$ that we wish -- and the algorithm above will still produce the correct answer with a minimal number of calculations.

This highly efficient method of evaluation is known as Horner's Method, and lies at the root of how to calculate hash codes for strings in Java.

Recall every char stored in a string has some integer value associated with it (e.g. 'A'==65, 'B'=66, etc). In this way, a string -- a "list" of char values -- can be made to function just like the array of integer values in the example above. We can think of these characters as the "digits" of some (potentially giant) number in some base $b$, and then convert them to some base $10$ integer using Horner's method.

Below is code equivalent to that used in the String class:^‡

h = 0;
for (int i = 0; i < length; i++) 
    h = 31 * h + s.charAt(i);        // note we have used a "base" of 31 here instead of 16

Two things bear mentioning about the above:

One should be aware that for strings of even moderate length, the integer value that would result if you did the calculation by hand would be gigantic -- well beyond the maximum value that an int variable can hold. However, in practice this is not a difficulty. As was mentioned when the hashCode() was first introduced, Java's mechanism for storing integers (i.e., Two's Complement) effectively curves the number line into a circle -- connecting the maximum value to the minimum. Thus when $h$ exceeds Integer.MAX_VALUE things just "wrap around". $h$ is then likely very negative, but continues to accumulate normally from that point forward (until another "wrap" is required).
The choice of $31$ in the method above likely catches one's eye. Why use this value? Part of the reason is that this value is prime, which reduces the potential for many collisions -- as already mentioned in the discussion of the Boolean.hashCode() method above.

As for why we use $31$ and not some smaller prime comes down to performance. As Joshua Bloch, who led the design and implementation of the Java Collections Framework (among other things) points out:
If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting... A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

The mathematically-inclined may also shudder at the use of $31$ as the "base" when the "digits" (i.e., the integer values of the corresponding char values) are potentially so much larger. For example, in base 16, we certainly never have a single digit equivalent to 65 (i.e., the int value of 'A'). Indeed, the maximum digit in any base $b$ is $b-1$. This is so that every value written in base $b$ has a unique representations. However, uniqueness of representation is not a concern in this context. Remember, we are only using Horner's method to combine all of the information in the string at hand (i.e., its characters) to generate a hash code value.

As one final note about the String class -- one that might be useful to consider as we develop hashCode() methods for our own classes -- the hash code generated is cached upon creation. That is to say, the String class keeps an int instance variable called hash, initially set to $0$ -- which upon the first call to hashCode() is set to the value generated by the process described above. Then, in subsequent calls to hashCode(), the value stored in hash is immediately returned instead of doing the entire calculation again.

One can imagine for large strings, this saves a considerable amount of computation -- however, one should be aware the use of this strategy is intimately tied to the immutability of the class. If string objects were not immutable, we would need to recalculate the hash code every time their content changed.

User-Defined Types

With all the lessons the above classes afford, we now turn our attention to how to override hashCode() in classes of our own design.

In every one of the classes just discussed, even the tiniest change in the underlying value (e.g., changes to individual characters, or even individual bits) likely caused a change in the hash code generated. This helps uniformly distribute the keys and consequently reduce collisions.

We similarly wish to make the hash code generated in classes of our own design sensitive to any changes in the data they store -- that is to say, changes to the values of their instance variables.

Noting that different classes we design may have different numbers of instance variables, just as different strings have different numbers of characters, the hashCode() method of the String class provides a model to follow.

Upon finding hash codes for each instance variable value associated with a given object, we can use these to generate a hash code for entire object in the same way that we used the integers associated with the characters of a given string to generate a hash code for the entire string. In short, we can use Horner's method with a "base" of 31 to combine all of these integers into a single integer. The following provides a concrete example:

public final class Transaction {

  private final String    who;
  private final Date      when;
  private final double    howMuch;
  private final Notes     additionalDetails;
  private final Person[]  signatories;
  ...

  public int hashCode() {

    int h = 17;

    h = 31 * h + who.hashCode();
    h = 31 * h + when.hashCode();

    h = 31 * h + ((Double) howMuch).hashCode();  // We convert to the wrapper class Double
                                                 // here, as howMuch was of a primitive type
                                                 // and thus has no hashCode() method 
                                                 // by itself to draw upon
    
    if (additionalDetails != null)                  // The above calculations assume 
       h = 31 * h + additionalDetails.hashCode();   // the references for who and when  
                                                    // were not null. Perhaps this is a 
                                                    // reasonable assumption for this class
                                                    // But sometimes, reference variables
                                                    // can be null -- and trying to access
                                                    // the instance method (or any instance
                                                    // method for that matter) using a null
                                                    // reference will produce an error. 
                                                    // As such, you need to check first!

    for (int i = 0; i < signatories.length; i++)   // We need to use ALL of the information
       h = 31 * h + signatories[i].hashCode();     // in the object.  So when we have an  
                                                   // array, we use the same strategy 
    return h;                                      // applied to EVERY entry in the array.
}

Note, the values $31$ and $17$ used above can certainly be replaced with other values, as desire or context dictates. Although choosing odd prime values, as was seen in the Boolean and String classes, will generally be a good idea.

Lastly, remember that if the computation of the hash code is expensive in terms of time or memory space, we can always apply the same trick that the String class employs, and cache the hash code generated. As was mentioned in the last section though, one needs to be mindful of the mutability of the class when caching any hash codes computed.

^†As a reminder of how the xor operator works, for bits $x$ and $y$, $x \wedge y$ equals $1$ if exactly one of $x$ or $y$ is $1$, and equals $0$ otherwise.

^†† You can easily find this code for yourself, if desired. In Eclipse, just click on the "JRE System Library" in your Package Exploer, and then navigate to the Long class by expanding "java.base", followed by "java.lang", "Long.class", and finally "hashCode():int" or "hashCode(long):int". One can similarly find the source code for any other class in the Java libraries, as desired.

^‡ Technically, the String class uses the hashCode() method of the StringUTF16 class, where similar code is found -- except that this method works on an array of bytes, and consequently uses getChar(value,i) instead of charAt(i).