Strings are quite different from C++.
They are immutable, i.e. You can’t
change the characters in a string. To look at individual characters, you
need to use charAt(). Strings in Java are 16-bit Unicode. To edit strings, you need to use a
StringBuffer object or a char.
In Java version 1.5 or later you use StringBuilder,
which works exactly like StringBuffer, but
it is faster and not thread-safe.
You get the size of a String (length in
chars) with String. length(),
not.length or. size()
used in other classes.
For manipulating 8-bit characters, you
want an array of bytes — byte.
There are three types of empty string, null, " and ".
Here is how to check for each flavour:
if ( "abc".equals (s) ) echo ( "matched" ); is
if ( s.equals ( "abc" ) ) echo ( "matched" ); because the first form won’t raise
an exception if s is null. It will treat the
strings as not equal.
Unless Strings have been interned, with String.intern(),
you cannot compare them for equality with ==.
You have to use equals() instead.
The compiler will not warn you if you inadvertently use ==.
Unfortunately, the bug may take a long time to surface if your
compiler or virtual machine is doing transparent interning. Interning
gets you a reference to the master copy of a String.
This allows the duplicates to be garbage collected sooner. However,
there are three disadvantages to interning:
If you want to compare for < or > you cannot use the usual
comparison operators, you have to use compareTo()
or compareToIgnoreCase() instead.
- It takes extra time to look up the master string in a Hashtable.
- In some implementations, you can have a maximum of 64K
- In some implementation, interned Strings
are never garbage collected, even when they are no longer used. The
interning process itself acts as a packratter. The answer is to
implement them with weak references.
- String comparision does not logically trim leading and trailing
whitespace before compare. If you want that effect use.trim().
String s = "apple";
String t = "orange";
if ( s.compareTo(t) < 0 )
out.println( "s < t" );
You can think of it roughly like treating the Strings as numbers and
- some positive number if string s lexically comes after t.
- 0 if s is the same as t.
- some negative number if s sorts earlier than t.
Novices might be astonished by the following results:
When you write your own classes, the default Object.equals
does not do a field by field comparison. You have to
write your own version of equals to get
that effect. The default version simply tests the equality of the two
references — that they both point to the same object.
- abc.compareTo( ABC) returns abc > ABC. compareTo
- abc .compareTo ( abc(
Blanks are treated like any other character.
- "".compareTo( null)
raises a java.lang.NullPointerException.
- "" is not the same thing as null. Most
String functions will be happy to handle "",
but very few will accept null.
- The comparison is done by straightforward Unicode numeric
character by character comparison. There is no adjustment for locale
Case-Sensitive and Case-Insensitive Comparison
Your basic tools are indexOf and lastIndexOf.
They both have variants with a base fromOffset
where to start searching. The result is relative to the start of the
entire String, not the fromOffset. The common17
package contains a StringSearch class that
will search for many different strings. These searches are all
case-sensitive. To get case-insensitive searches, convert both Strings
to all upper case or all lower case first. You must be
There are variants of the methods that search for a single character.
These are faster than the equivalent methods that look for a 1-character
String. It would be nice if the compiler were smart enough to optimise a
1-character String constant to a char
as the parameter of indexOf. You can
y ) >= 0
as x.contains ( y ).
Strings are immutable. Therefore they can be reused indefinitely and
they can be shared for many purposes. When you assign one String
variable to another, no copy is made. Even when you take a substring
there is no new String created, though a new String descriptor is. New
Strings are created when:
- you concatenate.
- you read Strings from files.
- you foolishly
String(String);. There is one situation where its use is
legit. See substring
for the explanation.
- you use new String( somethingElse
) ; for conversion.
- You use StringBuffer/StringBuilder toString/substring.
Every Object has a method called toString
that makes some sort of attempt to convert the contents of the Object
into human-readable form as a Unicode String
for display. Normally, when you write a new class, you write you own
corresponding toString method for it, even
if just for debugging.
You use it like this: String toShow
= myThing. toString();
The default Object.toString
method is not very clever. It does not display all
the primitives in your class with field names as you might expect. If
you want that, you must code it yourself. A default toString
will typically, instead, do something lame like dump the hashCode
or the Object’s address — only
toString has a magical property. It
appears to get invoked automatically to convert to String
without you having to mention toString.
- In one case, System.out.println (and brothers), it is not really
magic. println pulls it off with a
plethora of overloaded
methods. println has many overloaded
methods, one for each of the primitive
types and then each overloaded method converts its primitive
parameter to a String for you and
passes that on to the variant of println
that can only handle Strings. But, you
say, (glad to see you are so attentive), primitives don’t have
a toString method! That is true, but
there are static conversion
methods to get that effect, such as String.
). For any Object other than a String,
println invokes the Object’s
usually-overridden custom toString
method and passes the result on to the String-eating
- When you use concatenation, toString
truly does get called for you magically, sometimes. If ever you try
to add two Objects, Java presumes you
are really trying to concatenate them and transparently calls each
of their toString methods and
concatenates the results giving a String.
It even works when you try to add a String
and a primitive. Concatenation will convert the primitive to a String for you and concatenate the results,
transparently. This can lead to surprising
( char target, char replacement ) is
considerably faster than String. replace(
String target, String replacement ).
Both replace all occurrences. So if you are replacing
just a char, use single quotes.
Unforunately, String. replace(
String target, String replacement ) is
only available in Java version 1.5 or later.
regex, String replacement ) also replaces all instances. The
difference is, replaceAll looks for a
regex String not a simple String.
Beware of using replaceAll( String
regex, String replacement) when you meant replace(
String target, String replacement ).
The second parameter is not just a simple String.
behaves like Matcher. replaceAll.
$ is a reference to a captured String in
the search pattern and \ is the regex
quote character, meaning literal \ must
be coded as \\\\ and literal $
regex, String replacement ) also takes a regex. There is no replaceFirst that takes only a simple String.
in the Javadoc is shown with CharSequence
parameters. Don’t let this frighten you. String
implements CharSequence, so replace
works fine on Strings. replace
works on some other things as well such as StringBuilders.
You can use
isLegal to ensure a String
contains only the characters you consider legal. You can download it. It is pretty
simple, using indexOf on the legal String.
You can also use charAt to extract the
characters one by one, then categorise them with the Character
methods such as isDigit.
String borrows some convenience regex methods, such as split,
and replaceFirst. Normally you would use
the more efficient java.util.regex methods
where you precompile your Pattern and reuse
it. The String versions are for one-shot use
where efficiency is not a concern.
Not only replaceAll but replace
is implemented in an inefficient way, compiling a regex pattern every
time it is invoked:
So, if you are going to use replace or replaceAll more than once, you should use a
separate regex compile done only once.
a, b ) is not the method to use to replace all
instances of b in a. Instead you use String.
replace ( a, b ). replaceAll
is a convenience regex
( a, b ) does not modify a. It creates a new modified String.
This is true of all String methods. Strings are immutable. No method can modify
the original String.
- Consider lastIndexOf( s,
is the offset near the end of the string where to start searching
backwards for a match earlier in the string. It is not the index of
a substring to search, i.e. the place where the reverse searching
I have three ideas to improve the efficiency of the way String
- When a String is created the JVM (Java Virtual Machine)
copies a char into a new
char attached to the String.
It does not simply put a reference to the char
into the String because it is worried
somebody might subsequently change the contents of the char,
thus violating the immutability contract of String.
However, almost never does anyone change a char
after feeding it to new String. I think
the JVM could take advantage of that. If it knew for sure no one
would change it, it could safely and rapidly just insert a
reference. If it was not sure, it might just insert the reference
anyway and put a lock on it, so that if anyone ever did try to
change it, they would be blocked, the char
could be copied and the write completed to the old char
and the String could attach itself to
the new copy of the char.
- Interned Strings
are dangerous. Both interned and
non-Interned Strings have the same type
— namely String. You must manually
manage the intern method to control
precisely when the interning is done. You must use either == or equals.
If you get it wrong, there are no error messages, just puzzling
results. So I suggest using an Interned separate type for
interned Strings. You would use == both for interned and non-interned String
compare. They are automatically interned as appropriate (managed in
much the way hashCode is).
- There are two kinds of String:
I suggest sequential Strings be stored
internally in UTF-8 to conserve RAM (Random Access Memory). I suggest random Strings be
stored in 32-bit UTF-32 code points for ease of processing. The
compiler would have to guess which type a given String
should be. It might have to change its mind and convert the String
part way through run time. It might even store some Strings in both
forms. The readers and writers when using UTF-8 should be able to
take advantage of the fact no translation is necessary. You could
have two separate types for sequential (UString) and random (String)
in the Java language, but would be to onerous for programmers.
- sequential: Strings you either treat
as an atomic whole, treat char by char starting from the
- random: Strings you operate on
randomly with charAt or substring.
Oracle’s Javadoc on String
class : available:
Oracle’s Javadoc on StringBuffer
class : available:
Oracle’s Javadoc on StringBuilder
class : available:
Oracle’s Javadoc on StringJoiner
class : available: