The key thing to understand about hashCodes is that they need not be unique,
just as close to unique as practically possible.
HashCodes in a Nutshell
If you want to file something away for later retrieval,
it can be faster if you file it numerically rather than by a long alphabetic key.
A hashCode is a way of computing a small (32-bit) digest numeric key from a long
String or even an arbitrary clump of bytes. The numeric key itself is
meaningless and the hashCode functions for computing them can look a bit insane.
However, when you go to look for something, you can do the same digest
calculation on the long alphabetic key you are looking for, and no matter how
bizarre an algorithm you used, you will calculate the same hashCode, and will be
able to look up numerically with it. Of course there is always the possibility
two different Strings will have the same digest hashCode. However, even then,
all is not lost; it greatly narrows down the search, hence speeding it up. A Hashtable
goes a step further, scrunching down the hashCode even further to an even
smaller number that it can use to directly index an array, usually by dividing
it by some (ideally prime) number and taking the remainder.
String.hashCode
Implementation
In JDK 1.0.x and 1.1.x the hashCode function for
long Strings worked by sampling every nth character. This pretty well guaranteed
you would have many Strings hashing to the same value, thus slowing down Hashtable
lookup. In JDK 1.2 the function has been improved to multiply the result so far
by 31 then add the next character in sequence. This is a little slower, but is
much better at avoiding collisions.
Object.hashCode
Implementation
The default hashCode() method uses the 32-bit
internal JVM address of the Object as its hashCode.
However, if the Object is moved in memory during
garbage collection, the hashCode stays constant. This default hashCode
is not very useful, since to look up an Object in a HashMap,
you need the exact same key Object by which the key/value
pair was originally filed. Normally, when you go to look up, you don’t
have the original key Object itself, just some data
for a key. So, unless your key is a String, nearly always you will need to
implement a hashCode and equals
method on your key class.
The Gemini Twins: equals
and hashCode
Equal hashCodes in general are not sufficient to ensure Object
equality. However, if the hashCodes are not equal, you know the Objects
can’t possibly be equal. Consider how many 50-character Strings there are
(65535^50) and how many possible hashCodes there are (2^32). It should be
obvious there are way more Strings than
hashCodes. So the same hashCode has to be reused over and over for
different Strings.
The default hashCode just uses the address of the Object
and the default equals method just compares
addresses. If you override one of these two methods, you must override the other
to match. The rules are:
- It is ok, but not desirable, if two Objects that do
not compare equal have the same hashCode.
- Two Objects that compare equal must have the same
hashCode.
So if you had a Fruit Object
with a flavour and colour
field, and you decided that any two Objects with the
same flavour were for all intents and purposes equal,
you would define your equals and hashCode
methods like this:
As a rule of thumb, any time you use an Object as a
key in a Map or Set
(e.g. Hashtable, HashMap,
HashSet, TreeMap etc.)
you must redefine both equals and hashCode
in such a way both incorporate all the fields of the logical key. Fields in the
key Object irrelevant to lookup should not be
included in either method.
Calculating Aggregate hashCodes with XOR
The xor ^ operator is useful in computing hashing
functions. To create a hashCode based on two fields,
compute the hashCodes of the two fields separately
and xor them together with the ^ operator. To
create an hash on all the elements of an array you could xor all the values
together. The result independent of the order. If you want the order to matter,
use some digest function. xor also has the odd
properly that if you have a pair of identical hashCodes xored together, it is as
if they were not there. When you are expecting duplicates, you might want to use
some other combining technique.
XOR has the following nice properties:
- It does not depend on order of computation.
- It does not “waste” bits. If you change even one bit in one of the
components, the final value will change.
- It is quick, a single cycle on even the most primitive computer.
- It preserves uniform distribution. If the two pieces you combine are uniformly
distributed so will the combination be. In other words, it does not tend to
collapse the range of the digest into a narrower band.
The nasty properties of XOR are:
- It treats identical pairs of components as if they were not there.
- It ignores order. A ^ B is the same a B ^
A. If order matters, you want some sort of checksum/digest such as Adlerian. XOR
would not be suitable to compute the hashCode for a List
which depends on the order of the elements as part of its identity.
- If the values to be combined are only 8 bits wide, there will be effectively
only 8 bits in the xored result. In contrast, multiplying by a prime and adding
would scramble the entire 32 bits of the result, giving a much broader range of
possible values, hence greater chance that each hashcode will unique to a single Object.
In other words, it tends to expand the range of the digest into the widest
possible band (32 bits).
Here is another approach that would work better if you had two Strings
in your Object. It gives a different hash code for
your two Objects when:
o1.string1 = "apple";
o1.string2 = "orange";
o2.string1 = "orange";
o2.string2 = "apple";
It works like this to combine hash codes of the fields in your object:
Here is a faster, but not as universally applicable technique:
Here is roughly how String.hashCode
works inside:
Here is a fast hash algorithm you can apply to bytes, short, chars, ints, arrays
etc. I used an assembler version of it in my BBL Forth compiler hashCode where
it is essentially implemented in two instructions, ROL
and XOR.
Consider using an CRC-32 or an Adlerian
digest for your hashCode when you can reduce the
key part of your Object to a long string of bytes.
This give a nice even spread over the range of possible integers.
When Do You Need A Custom equals and hashCode?
The hashCode method only gets invoked when you use
the Object as the key to a Hashtable.
It is not used when the Object is merely a Hashtable
value. Most of the time your Hashtable keys are
simple Strings, so you rarely need to write custom equals
and hashCode methods. When you use a HashSet
to help you check for duplicate Objects, then you
likely will need a custom equals and hashCode
method. There your Objects act as both key and value.
If you know the key values in advance, it is possible to construct
a hashCode function that has no collisions.
The One Key Catch
You can define only one hashCode/equals
method for your HashSet Objects.
That limits you to one type of HashSet lookup for
your Objects. There is no equivalent to Comparator
for HashSets. You can look up your Objects
by only one key, though that key might contain several fields. You can’t
have several HashSets each accessing the same Objects
by different keys. You can of course have several HashSets
each accessing a different subset of the same group of Objects
using the same key.
In contrast, with HashMap you have more freedom.
Your Objects don’t have to implement a useful hashCode/equals,
but any keys you use do. Since you can define different hashCode/equals
for different types of key, you can have multiple HashMaps
on the same group of Objects looking up by different
keys.
Arrays are Objects and use the lame Object.
hashCode. To get a proper hashCode
that is based on the values in the array, you need Arrays.
hashCode.
Learning More
Sun’s Javadoc on
Object.
hashCode() : available:
Sun’s Javadoc on
Object.
equals( Object ) : available:
Sun’s Javadoc on
Arrays.
hashCode() : available:
Sun’s Javadoc on
Arrays.
deepHashCode() : available: