To undestand how to best override the default implementation of the `hashCode()`

method to generate hash codes for classes of our own design, it will be instructive to see how the developers of Java have done the same in some of its existing, well-known classes:

`Integer.hashCode(), Short.hashCode(), Byte.hashCode() & Character.hashCode()`

The hash code generated by the `Integer`

class is trivial (it just returns the value of the integer). Similarly, the `Short`

, `Byte`

, and `Character`

wrapper classes each simply return their underlying value cast to an `int`

. 'Nuff said.

`Long.hashCode() & Double.hashCode()`

Recall that `long`

values are stored with 64 bits. The `hashCode()`

method for this class will thus need to reduce this by half, down to 32 bits, since it must produce something of type `int`

.

We avoid simply outputing the first 32 bits as an integer, as if the context at hand doesn't deal with either negative or really large integers (i.e., over $2^{31}$) these bits will often all be zero -- resulting in a massive number of collisions.

To better distribute the keys in the array, it is best for this and all classes, to use all of the bits of the object in question in calculating the hash code. In this way, any change to the underlying object will more likely result in a change to the hash code produced. As such, we combine the first 32 bits with the last 32 bits of our 64-bit double with an *bit-wise xor* operator^{†}, as shown below.^{††}

public final class Long { private final long value; ... public int hashCode() { return Long.hashCode(value); } ... public static int hashCode(long value) { return (int) (value ^ (value >>> 32)); } ... }

To understand the last statement above, recall first that "`>>>`

" is the "unsigned right-shift" operator in Java.

When applied to the long value in question above, 32 zeroes are pushed to the front of the bitstring for this value, with the same number of digits falling off the tail end, and the new bitstring that results is re-interpreted as a new long value.

The use of the *bit-wise xor* operator, "`^`

", then combines the last 32 bits of the long value with the first 32 bits of the long value.

The operation of constructing the hash code is shown below for the long value `x = 1234567890098765432L`

. The last 32 bits appear in dark red, while the first 32 bits are shown in light red. Notice how these line up vertically as a result of the aforementioned unsigned right-shift. The result of combining these right-most 32 bits via the *bit-wise xor* is shown in black.

0001000100100010000100001111010001111100011100001011111001111000 \\ x ^ 0000000000000000000000000000000000010001001000100001000011110100 \\ x >>> 32 ------------------------------------------------------------------ 0001000100100010000100001111010001101101010100101010111010001100 \\ x ^ (x >>> 32) 01101101010100101010111010001100 \\ (int) (x ^ (x >>> 32))

The first 32 bits of $x$ are similarly *xor'd* with 32 zeros, as shown in gray above -- but the resulting bits are discarded as a result of the cast to an `int`

.

Note, we use *xor* as opposed to *or*, as on average *or* produces more bits of 1 than bits of 0 (think about the 4 possible cases for "*or-ing*" two bits). Using *or* would then keep the distribution of keys from being uniform, which would cause more collisions. Using *and* would bias things in the opposite direction, towards hash codes that have fewer ones -- again causing more collisions.

In this way, any change to the long value will change at least one bit (e.g., either a light or dark red bit above), which in turn changes a (black) bit of the resulting hash code for the `Long`

object due to the *xor* operator.

As `double`

values are also stored with 64 bits, the `Double`

wrapper class takes advantage of this exact same calculation to compute the output of its `hashCode()`

method.

`Boolean.hashCode()`

With only two possible keys, `true`

and `false`

-- one wonders why the `Boolean`

class even has a `hashCode()`

method. The short answer is that when we turn our attention to creating hash codes for classes of our own design, we will cleverly combine the hash codes associated with *all* the instance variables associated with that class. As some of these instance variables might be boolean values, the `Boolean`

class will need a `hashCode()`

method to invoke.

Still -- as you might have guessed -- this method is not complicated at all. If the boolean value is true we return one number, if not we return another. This can be accomplished with a quick application of a conditional operator, as seen below.

public final class Boolean { private final boolean value; ... public int hashCode() { return hashCode(value); } ... public static int hashCode(boolean value) { return value ? 1231 : 1237; } ... }

Note the unusual choice of the two possible outputs that could result. Why aren't more simple outputs like $1$ and $2$ used? Why use $1231$ and $1237$?

The key is that these two values are both large primes.

Why primes? Remember that the result of the `hashCode()`

method (made positive by our previous *and*-based trick) is expected to be reduced down to an appropriate array index using the mod operator and the array size $m$. If the outputs were composite, say $1000$ and $2000$, note what happens when the array size is a common factor of these composite numbers:

- if $m=8$, the resulting array indices,
`(1000 % 8)`

and`(2000 % 8)`

, collide - if $m=10$, the resulting array indices,
`(1000 % 10)`

and`(2000 % 10)`

, collide - if $m=20$, the resulting array indices,
`(1000 % 20)`

and`(2000 % 20)`

, collide

Why *large* primes? Here again, when creating hash codes for classes of our own design, it is common to add the hash codes for the different instance variables together. If these values are small, their sum is small -- which risks an uneven distribution of places used in the hash table.

`String.hashCode()`

Before we discuss how hash codes are found for , let us take a moment to discuss something that -- on the surface -- appears completely unrelated...

Suppose one wishes to convert a base 16 value to its base 10 equivalent. Remembering that the digits represent powers of 16, we could go about this in the following, very direct, way:

$$(d_n d_{n-1} \cdots d_0)_{16} = d_n \cdot {16}^n + d_{n-1} + d_{n-1} \cdot {16}^{n-1} + \cdots d_0 \cdot {16}^0$$ A concrete example might be more illustrative. Consider the base 16 value 2AC3F. Recalling that in base 16 that $$A=10, B=11, C=12, D=13, E=14, \textrm{ and } F=15$$ we can find the corresponding base 10 value with the following calculation: $$2AC3F_{16} = 2 \cdot {16}^4 + 10 \cdot {16}^3 + 12 \cdot {16}^2 + 3 \cdot {16} + 15$$ Note how doing this calculation directly will require only $4$ additions, but $ 4 + 3 + 2 + 1 = 10$ multiplications. Having seen sums like this before, it should not be surprising that In general, an $n$ digit number will require $n(n-1)/2 \sim n^2/2$ multiplications.We can improve on this, however. Notice how all but the last term on the right has a factor of $16$. Suppose we factor these out of the first 4 terms to get: $$16 \cdot (2 \cdot 16^3 + 10 \cdot 16^2 + 12 \cdot 16 + 3) + 15$$ This drops the number of multiplications required down to $1 + (3 + 2 + 1) = 7$ and doesn't alter the number of additions. Of course, seeing all of multiplications needed for the powers of $16$ that still remain, we could apply this strategy a second time to obtain $$16 \cdot (16 \cdot (2 \cdot 16^2 + 10 \cdot 16 + 12 \cdot) + 3) + 15$$ which reduces the number of multiplications down to $5$. Still, we can do this one more time to get which involves only $4$ multiplications and the same number of additions with which we began. $$16 \cdot (16 \cdot (16 \cdot (16 \cdot 2 + 10) + 12) + 3) + 15)$$ This is quite a bit better than the 10 multiplications with which we started. More generally, if we had started with a base 16 value of $n$ digits we can reduce the number of multiplications needed to convert it to its base 10 equivalent down to $n-1$ additions and $n-1$ multiplications!

Further, had the values $2$, $10$, $12$, $3$, and $15$ been stored in some array, the code to evaluate the value shown would be trivial:

int n = 0; for (int i = 0; i < a.length(); i++) n = 16 * n + a[i];

Of course, there is nothing special about base 16, we could have replace it with any base $b$ that we wish -- and the algorithm above will still produce the correct answer with a minimal number of calculations.

This highly efficient method of evaluation is known as **Horner's Method**, and lies at the root of how to calculate hash codes for strings in Java.

Recall every `char`

stored in a string has some integer value associated with it (e.g. `'A'==65`

, `'B'=66`

, etc). In this way, a string -- a "list" of `char`

values -- can be made to function just like the array of integer values in the example above. We can think of these characters as the "digits" of some (potentially giant) number in some base $b$, and then convert them to some base $10$ integer using Horner's method.

Below is code equivalent to that used in the `String`

class:^{‡}

h = 0; for (int i = 0; i < length; i++) h = 31 * h + s.charAt(i); // note we have used a "base" of 31 here instead of 16

Two things bear mentioning about the above:

One should be aware that for strings of even moderate length, the integer value that would result if you did the calculation by hand would be gigantic -- well beyond the maximum value that an

`int`

variable can hold. However, in practice this is not a difficulty. As was mentioned when the`hashCode()`

was first introduced, Java's mechanism for storing integers (i.e., Two's Complement) effectively curves the number line into a circle -- connecting the maximum value to the minimum. Thus when $h$ exceeds`Integer.MAX_VALUE`

things just "wrap around". $h$ is then likely very negative, but continues to accumulate normally from that point forward (until another "wrap" is required).The choice of $31$ in the method above likely catches one's eye. Why use this value? Part of the reason is that this value is prime, which reduces the potential for many collisions -- as already mentioned in the discussion of the

`Boolean.hashCode()`

method above.As for why we use $31$ and not some smaller prime comes down to performance. As Joshua Bloch, who led the design and implementation of the Java Collections Framework (among other things) points out:

*If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting... A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.*The mathematically-inclined may also shudder at the use of $31$ as the "base" when the "digits" (i.e., the integer values of the corresponding

`char`

values) are potentially so much larger. For example, in base 16, we certainly never have a single digit equivalent to 65 (i.e., the`int`

value of`'A'`

). Indeed, the maximum digit in any base $b$ is $b-1$. This is so that every value written in base $b$ has a unique representations. However, uniqueness of representation is not a concern in this context. Remember, we are only using Horner's method to combine all of the information in the string at hand (i.e., its characters) to generate a hash code value.

As one final note about the `String`

class -- one that might be useful to consider as we develop `hashCode()`

methods for our own classes -- the hash code generated is *cached* upon creation. That is to say, the `String`

class keeps an `int`

instance variable called `hash`

, initially set to $0$ -- which upon the first call to `hashCode()`

is set to the value generated by the process described above. Then, in subsequent calls to `hashCode()`

, the value stored in `hash`

is immediately returned instead of doing the entire calculation again.

One can imagine for large strings, this saves a considerable amount of computation -- however, one should be aware the use of this strategy is intimately tied to the immutability of the class. If string objects were not immutable, we would need to recalculate the hash code every time their content changed.

With all the lessons the above classes afford, we now turn our attention to how to override `hashCode()`

in classes of our own design.

In every one of the classes just discussed, even the tiniest change in the underlying value (e.g., changes to individual characters, or even individual bits) likely caused a change in the hash code generated. This helps uniformly distribute the keys and consequently reduce collisions.

We similarly wish to make the hash code generated in classes of our own design sensitive to any changes in the data they store -- that is to say, changes to the values of their instance variables.

Noting that different classes we design may have different numbers of instance variables, just as different strings have different numbers of characters, the `hashCode()`

method of the `String`

class provides a model to follow.

Upon finding hash codes for each instance variable value associated with a given object, we can use these to generate a hash code for entire object in the same way that we used the integers associated with the characters of a given string to generate a hash code for the entire string. In short, we can use Horner's method with a "base" of 31 to combine all of these integers into a single integer. The following provides a concrete example:

public final class Transaction { private final String who; private final Date when; private final double howMuch; private final Notes additionalDetails; private final Person[] signatories; ... public int hashCode() { int h = 17; h = 31 * h + who.hashCode(); h = 31 * h + when.hashCode(); h = 31 * h + ((Double) howMuch).hashCode(); // We convert to the wrapper class Double // here, as howMuch was of a primitive type // and thus has no hashCode() method // by itself to draw upon if (additionalDetails != null) // The above calculations assume h = 31 * h + additionalDetails.hashCode(); // the references for who and when // were not null. Perhaps this is a // reasonable assumption for this class // But sometimes, reference variables // can be null -- and trying to access // the instance method (or any instance // method for that matter) using a null // reference will produce an error. // As such, you need to check first! for (int i = 0; i < signatories.length; i++) // We need to use ALL of the information h = 31 * h + signatories[i].hashCode(); // in the object. So when we have an // array, we use the same strategy return h; // applied to EVERY entry in the array. }

Note, the values $31$ and $17$ used above can certainly be replaced with other values, as desire or context dictates. Although choosing odd prime values, as was seen in the `Boolean`

and `String`

classes, will generally be a good idea.

Lastly, remember that if the computation of the hash code is expensive in terms of time or memory space, we can always apply the same trick that the `String`

class employs, and cache the hash code generated. As was mentioned in the last section though, one needs to be mindful of the mutability of the class when caching any hash codes computed.

^{†}As a reminder of how the *xor* operator works, for bits $x$ and $y$, $x \wedge y$ equals $1$ if exactly one of $x$ or $y$ is $1$, and equals $0$ otherwise.

^{††} You can easily find this code for yourself, if desired. In Eclipse, just click on the "JRE System Library" in your Package Exploer, and then navigate to the `Long`

class by expanding "java.base", followed by "java.lang", "Long.class", and finally "hashCode():int" or "hashCode(long):int". One can similarly find the source code for any other class in the Java libraries, as desired.

^{‡} Technically, the `String`

class uses the `hashCode()`

method of the `StringUTF16`

class, where similar code is found -- except that this method works on an array of bytes, and consequently uses `getChar(value,i)`

instead of `charAt(i)`

.