Interned Strings avoid duplicate Strings.
Interning saves RAM at the expense of more CPU time to detect and replace
duplicate Strings. There is only one copy of each String
that has been interned, no matter how many references point to it. Since Strings
are immutable, if two different methods "incidentally" use the same String,
(even if they concocted the same String by totally
independent means, e.g. one might use the String "sin"
in the context of Moses and another in the context of trigonometry.) they can
share a copy of the same String. The process of
converting duplicated Strings to shared ones is
called interning. String.intern()
gives you the address of the canonical master String.
You can compare interned Strings with simple == (which
compares pointers) instead of equals which compares
the characters of the String one by one. Because Strings
are immutable, the intern process is free to further save space, for example, by
not creating a separate String literal for "pot"
when it exists as a substring of some other literal such as "hippo
potamus"
Why Intern?
There are two reasons for interning Strings:
- To save space, by removing String literal duplicates.
- To speed up String equality compares. Interned Strings
will compare faster even if you use equals instead
of ==.
For example, if you wanted to read CSV files containing the party affiliation of
20,000 people into a HashMap, you would have 20,000 Strings
floating around in memory to record the affiliations. If you interned the
affiliation String, there would only be a dozen or
so. Every Democrat would safely share the same copy of the immutable "democrat"
String.
Interning and String.substring
when you use String.substring
the JVM allocates a new String descriptor, but it
just points into the original String literal.
It does not need to allocate space for the substring. It does not copy any
characters. String. substring
does not intern the result. The
original base String cannot be garbage collected as
long as there are any live references to substrings inside it.
Empty Strings resulting from String.
substring are not automatically interned either.
Because of this, the resulting empty substring can still indefinitely encumber a
long base String preventing it from being garbage
collected.
Interning and the void String
To ensure you don’t accidentally encumber base Strings,
and to avoid the confusion of using a mixture of blank
(i.e x.length() != 0 && x.trim().length() == 0,
e.g. " "), empty
(i.e. x.length() == 0, e.g. "")
and null (i.e. x == null) to
represent the void String,
you may want to use code like this:
The Intern Gotcha
All String literals present at compile time are
automatically interned. It is only Strings generated
on the fly as the program runs that might not be interned. A nasty side effect
of this behaviour is that a program will work fine for some simple cases, but
fail on complex ones. The problem comes if you used ==
to test for String equality where you should have
used equals. The wrong code will still work much of
the time because most String literals are naturally
interned.
Intern and new String(
String)
Newbies often say foolish things like
String s = new String( "hello" );
instead of
String s = "hello";
This is the opposite of interning. You are deliberately creating a duplicate
distinct (but identically valued, and definitely not interned) "hello"
String object. There are two legitimate uses for
doing that:
- To provide a unique String synchronization object.
- Unencumbering the huge base String on which a
substring is embedded. By making a copy with new String(
String ), the original String
is free to be garbage collected. It can pay to use new
String( String ), if you
have only a few short substrings into a common mother base String.
Then garbage collection can let go of the mother String.
If you have a large number of substrings so that the entire mother String
is represented in some substring, then there is no point in doing that. It is
more efficient to just reference into the common mother String
with the substring.
Is new String compelled
to create a brand new underlying String when you use new
String( String )? Yes!
You might imagine a clever JVM that always interned every new
String or that simply passed back the original
reference, treating it as a no-op. The language specification says that it is
fact compelled, that new String
must create a new unique reference, however, the JVM could theoretically do that
by treating new String
as if it were String.substring(0)
or String.intern().substring(
0 ) and avoid actually making a physical copy.
This brings up yet another related question. Is s ==
s.substring( 0 )
compelled to be false? Yes!
One other place will see new String
used legitimately is in:
String password = new String ( jpassword.getPassWord() );
getPassword returns a char[],
so it is not the silliness it first appears to be. It does this to permit you to
empty the char array after use in high security
situations.
Consider piece of code like this: String s
= new String( "Hello"
); The compiler puts the literal "Hello"
the class file is such a way that it will become an interned String
when the class is loaded. When you stupidly use new String
you create a new String on the heap, one with an
address different from the interned version. Had you written sensible code like
this: String s = "Hello";
you would not have created a duplicate String Object.
You would not have defeated the interning. s would
point directly to the interned String "Hello".
Intern and garbage Collection
In the early JDKs, any String you interned could
never be garbage collected because the JVM had to keep a reference to in its Hashtable
so it could check each incoming String to see if it
already had it in the pool. With JDK 1.2 came weak
references. Now unused interned Strings will be
garbage collected.
Overflow
java.lang.OutOfMemoryError: String
intern table overflow means you have too many interned Strings.
Some older JVM’s may limit you to 64K Strings,
which leaves perhaps 50,000 for your application. The IBM Java 1.1.8 JRE has
this limit. This is an Error not an Exception
if you want to catch it. Here is the source for a simple Java program called InternTest
to test your JVM.
Under The Hood
This is a simplified version of how interning works under the hood. Inside the
JVM is the heap where all allocated Objects reside.
This includes Strings both interned and ordinary.
In addition, interned Strings are registered in a
weak HashMap.
The collection of Strings registered in this HashMap
is sometimes called the String pool. However, they are
ordinary Objects and live on the heap just like any
other (perhaps in an optimised way since interned Strings
tend to be long lived). The String Object
lives on the heap and a reference to it lives in the HashMap.
There is so separate pool of interned String objects.
Whenever a String is interned, it is looked up in
the HashMap to see if it exists already. If so the
user gets passed a reference to the master copy. Normally he will use that copy
in preference to his. His duplicate copy then will likely soon have no
references to it and will be eventually garbage collected. If the String
has never been seen before, a reference to it will be added to the HashMap and intern
will hand him a reference to his own String, now
registered as the unique master. Note that the intern process does not make a
copy of the String, it just keeps a reference to the
unique master copies.
All the Strings, interned and ordinary live on the
heap. When there are no references left to a String
except the intern HashMap
registry reference, it will be garbage collected since intern
keeps only a weak reference to it.
When you say new String,
it is not automatically interned. Thus there may then be duplicates on the heap.
If you later use intern on that String,
those duplicates won’t be cleaned up. Only when you intern
all copies of a String, and discard references to
the uninterned versions do you maintain but a single copy.
Manual Interning
The big problem with intern is once you intern
a String, you are stuck with it in RAM until the
program ends. It is no longer eligible for garbage collection, even if there are
no more references to it. If you want a temporary interned String,
you might consider interning manually.
However, in the most recent JVMs, the interned string cache is now usually
implemented in soft references fashion, so that interned strings may become
eligible for garbage collection as soon as they are no longer strongly
referenced.
For example, let as assume you were reading a CSV file of names and addresses
and storing it internally in a Collection of
some sort. Since many people live in the same city, RAM will soon become
cluttered with hundreds of duplicate String object
copies of the names of local cities.
Create a HashMap (not a HashSet)
to look up by city a master String object for each
city. Every time you get a city, you look it up in the HashMap.
If it is there, replace your reference with a reference to the master copy. Your
String object copy will then become eligible for garbage collection. If it not
there, add the city’ String it to the HashMap.
When you are finished with the adding cities, you can discard the HashMap.
The master city Strings you put in the HashMap
will still exist, will still be unique, will still behave as if they had been String.
interned, except those without any other references
will become eligible for garbage collection.
Learning More
Sun’s Javadoc on
String.
intern : available: