For Strings, Hashtables work pretty much as you would expect if you are familiar
with hashing. The only catch is remembering to import
java.util.Hashtable (with a small t in table). Here is how you use
The key point for novices to remember is the capacity of the table has to be somewhat
larger than the number of elements you plan to look up. The more spare capacity you
give, the faster the lookup. There are, of course, diminishing returns. Typically, you
specify about 25% more capacity than you have elements and
a load factor of 0.75 which controls automatic growth of the lookup table when it
gets too full. More details on that later.
The capacity of a table is not the number of
elements you plan to add! If you specify a capacity of 100 and a load factor of .75,
Hashtable will allocate only 100 slots, not the 133 you
need for efficient access. The load factor kicks in only when deciding when it is
time to double the size of the table, in this case, just after you have added 75
elements, not in deciding its initial size. It is up to you to leave some breathing
room.
Tips
- Hashtables work fastest if the capacity is a prime number since it reduces
collisions. Your hashcodes are unlikely to have a frequency distribution pattern
based on modulo a large prime number, where they easily could have one based on a
power of two. When you use a power of two, you are effectively masking off all but
the low order bits. In Java 1.4.1+, for HashMap there no
longer a payback for specifying prime table lengths because HashMap rounds whatever capacity you specify up to a power of two,
to avoid the cost of a division (they can do an AND bit mask instead) and to help
keep memory from being fragmented. However, Hashtable,
which is synchronized, uses the value you give it, so giving it a prime is still
worth while.
Now it is more important than ever to have a high quality hashCode function that has good distribution, especially in the
low order bits. It must not favour one bit pattern over another, e.g. not tend to
generate hashcodes all ending in binary 0000. In JDK (Java Development Kit)
1.4.1+ there in a rehash
function used inside HashMap to improve the quality of
your provided hashCode by redistributing the bits.
rehash has to be quick, or they might as well have
used tables whose lengths are multiples of primes and used the modulus
operator.
In theory though, you should round your capacity up to prime number. However,
avoid 31 or multiples of 31 since the new hashCode algorithm for Strings in
Java version 1.2 or later uses 31 as a magic
number in the hashing formula. In a Hashtable, you are only using 4 bytes per
additional entry. Here are some primes just larger than various powers of two
that would be good choices.
11 |
17 |
37 |
67 |
131 |
257 |
521 |
1031 |
2053 |
4099 |
8209 |
16411 |
32771 |
65537 |
131101 |
262147 |
524309 |
1048583 |
2097169 |
4194319 |
8388617 |
- You cannot have elements with duplicate keys. If you add a new element that
duplicates the key of an existing element, it will replace it.
- The initial capacity is not the number of elements you plan to
put in the table. If you specified a capacity of 101 and a load factor of .75f,
then the initial table could hold only 75 elements, before it had to grow. This
means you must specify plenty of slop in your capacity setting. If you used a load
factor of 1.0, your performance would be slow because you would have many
collisions rather than direct lookups.
- Hashtables will automatically grow when you add too many elements. However,
growing requires copying, rehashing and rechaining, quite an involved process. That
is why you want an accurate estimate for the initial capacity, slightly on the high
side.
- A load factor of .75f means that the table will be expanded when it gets three
quarters full. Hashing works faster when the table is not full. A value of .25f
would use more RAM (Random Access Memory) but would work faster. A
value of 1.0f would use minimal RAM
but would run extremely slowly. The reason you want a decent supply of unusued
slots is to reduce collisions, two entries that hash to the same slot. They have
to be resolved by chasing a linear chain. If there are very few empty slots left,
and new entries are assigend effectively random slots, the odds are they will be
assigned a slot already taken. If there is almost no room left, the chains of
elements that hash to the same slot rapidly grow longer. That is why you have a
load factor, to grow the entire structure when space gets tight and keep the
chains of elements that collide on the same slot as short as possible.
- HashCodes are usually not unique. If you think about it, how could they be?
Consider Strings of length 10 characters (20 bytes, 160
bits). How many different possible strings are there? 2^160. HashCodes are only
32 bits long. How many possible different hashCodes are
there? 2^32. You can see there are way more strings that HashCodes, so strings
would have to share HashCodes.
- If two keys hash to the same hashCode, or to the same hashCode modulo the size
of the table, all still works. It just takes a little longer to look up that key,
since get() has to chase a chain of clashing keys to
find the match. Keys are matched with the equals()
method, not ==, or simply by comparing hashCodes.
- Let me repeat this another way since it is a point people seem to miss on first
reading. Collisions happen not only when two of the HashCodes are identical, (quite infrequent) but when two of the
Hashcodes modulo the initial capacity are equal (which almost always happens at
least a few times). Collisions are unavoidable. The Hashtable class deals with them automatically. They slow down
lookup, but do not derail it. The smaller the load factor, the more
RAM you waste,
but the fewer the collisions, so the faster the lookup. Using prime numbers for the
divisor/capacity also tends to reduce collisions.
- This is the same math that comes up when you ask how many people do you need to
have in a room for it to be likely that two people have the same birthday.
There’s a > 50% chance with 23 people which is a
bit more than the square root of 365.
- The source code for Hashtable is very short. It
takes an order of magnitude more English to describe how it works than does the
Java code itself. I suggest you peruse it if you are in the least curious how it
works inside. Hashing is something every programmer should
understand. You might also study String.hashCode().
- Hashtables use generic objects or both keys and
values. You need to cast these back to Strings, or whatever you are using.
- If you want to use something other than Strings as your keys, make sure you
implement an equals() and hashCode() method for those key objects to override the default
Object equals() and hashCode() methods. The default equals() method just checks that the two objects are one in the
same (using ==), not that they consist of the same bit
patterns.
- There are two methods for enumerating the Hashtable,
keys() which gives you the keys and elements() which gives you the values. You could step
through both enumerations simultaneously instead of using get() as I showed. JDK
1.2 HashMap (the
unsynchronized replacement for Hashtable) lets you iterate over keys, values or
key-value pairs. Note, the enumeration is not sorted in any
particular order, not even the order you added your elements. You need either
keyset().iterator(), values().iterator() or entryset().iterator() for pairs. For sorted Hashtables, you need to
sort the output yourself with Arrays.sort(
hashtable.values().toArray( new X[hashtable.size()]) ) or use a TreeMap.
- The size method keeps track of how many unique keys
are in the Hashtable, not the size of the table including empty slots. Hashtables
never contain items with duplicate keys.
- Wherever possible use Vectors or ArrayLists in preference to Hashtables
when you can assign a unique dense int key.
Vectors are much faster to look up when you know the
index. The Vector.get index lookup process does not
scan through other objects, (dragging them from backing store into real
RAM). Thus
Vectors even moreso are better than Hashtables when your virtual RAM
is not backed totally by real RAM.
Vector.contains in contrast does a linear search,
comparing each object in turn. This is much slower than a Hashtable lookup except for very small tables. If you are single
thread, use ArrayList in preference to Vector. You can build a free slot chain through the embedded unused
Vector slots to rapidly recycle numbers.
- Never disturb fields in the Hashtable/HashMap/ HashSet key objects that would
change the hashCode or the results of equals. Hashtable requires these to sit still
for it to work. You can change them, but only when the Objects are not actively
being used as keys.
- Hashtable and HashMap do
not cache a recent get. So if you frequently fetch the
same item over and over, you might consider keeping a copy of the previous
get result.
- When you serialize a Hashtable, it just outputs
key-entry pairs, none of the lookup mechanism. When your load it back again it
rehashes and rebuilds the lookup structure.
- Serializing and generics do not mesh well. To get rid of all warning messages
and to ensure type safety you are best to output an array and reconstitute the
Hashtable yourself.
- Hashtable is a Map,
not a Collection, though it is considerd part of the
Java Collections Framework. However, the result of Hashtable. values and Hashtable. keySet is a Collection. You can use Collection. toArray to export either
to an array.
Dealing With Duplicates
Hashtables don’t permit duplicates.
What do you do if some of your objects have duplicate keys? When you have duplicates,
what do you want to find when you do a lookup? Just the first, just the last? both?
If you are clever, you ca get all those behaviours with a Hashtable.
To get the first duplicate, before you put, do a check
to see if the element is already in the Hashtable, if so,
do nothing.
To get the last duplicate, just put. It will replace
what was there before. To get both, store an ArrayList as
the value. Store each of the duplicates in one of the slots of the ArrayList. This is a bit inefficient for non-duplicates. For them, you
could store just the plain Object. Any you might handle
the special case of just one duplicate with an array. Your logic might look a bit
like this:
Of course, are free to simplify the algorithm, going straight to ArrayLists even for singles. My complex recipe is designed to conserve
RAM on large
lists.
Oracle’s Javadoc on
Hashtable class : available: