codePoint : Java Glossary

codePoint
Sometimes written as two words: code point. Unicode started out as a 16-bit code with 64K (65536) possible characters. At first, this seemed more than enough to encode all the world’s alphabets, since people had been getting by with 8-bit charsets with a gamut of only 256 possible characters. Unicode was such a success, scholars also soon wanted to encode dead scripts such as cuneiform as well, and soon all the 65536 slots were full.

Unicode was extended to 32 bits, with the corresponding UTF-16 encoding also extended with a clumsy system of surrogate characters to encode the 32-bit characters above 0xffff.

The term codepoint in Java tends to be used to mean a slot in the 32-bit Unicode assignment, though I suspect the term is also valid to mean a spot in Unicode-16 or any other character set.

Java now straddles the 16-bit and 32-bit worlds. You might think Java would now have a 32-bit analog to Character, perhaps called CodePoint, and a 32-bit analog to String, perhaps called CodePoints, but it does not. Instead, Strings and char[] are permitted to contain surrogate pairs which encode a single high-32-bit codepoint.

StringBuilder.appendCodePoint( int codepoint ) will accept 32-bit codepoints to append.

StringBuilder.append( int number ) just converts the number to a String and adds that, not what you want!

FontMetrics.charWidth( int codepoint ) will tell you the width in pixels to render a given codepoint.

Character.isValidCodePoint( int codepoint ) will tell you if there is a glyph assigned to that codepoint. That is still no guarantee your Font will render it though. Character. codePointAt and codePointBefore let you deal with 32-bit codepoints encoded as surrogate pairs in char arrays. Most of the Character methods now have a version that accepts an int codepoint such as toLowerCase.

Iterating Over A String

What’s the problem? The way the JVM (Java Virtual Machine) represents the String internally is hidden. It is usually a char[] array of 16 bit values. It could in theory be an implemented an UTF-8 or int[]. However, to the programmer String.length, String.charAt and String.codePointAt both index as if the representation were char[]. The programmer knows the length of the String in 16-bit chars, but there is no corresponding method to tell you its length in code points. The programmer cannot ask for the 42nd code point in the String. He can only ask for codepoint that starts at 16-bit off set 42. To ge the 42nd code point, the programmer must iterate along the String from the beginning. There is no internal index structure for the String, even for ones chock full of 32-bit characters.

32-bit literals

You might think you could simply embed 32-bit characters into Java String literals, the way you can in C with \Uxxxxxxxx instead of \uxxxx, but that method has not yet been made part of the Java language. Instead you must encode it with a pair of 16-bit surrogate characters.
To make it easier for programmers to compose Java code with 32-bit characters embedded in String literals, I offer the online Surrogate Pair Amanuensis Applet. Source provided to run locally as a hybrid application or Applet.

CMP homejump to top You can get the freshest copy of this page from: or possibly from your local J: drive (Java virtual drive/mindprod.com website mirror)
http://mindprod.com/jgloss/codepoint.html J:\mindprod\jgloss\codepoint.html
logo
Please email your , letters to the editor, errors, omissions, typos, formatting errors, ambiguities, unclear wording, broken/redirected link reports, suggestions to improve this page or comments to Roedy Green : feedback email. If you want your message, your name or email kept confidential, not considered for public posting, please explicitly specify that. Unless you state otherwise, I will treat your message as a letter to the editor that I may or may not publish in the feedback section. After that, it will be too late to retract it. If you disagree with something I said, please quote it and cite the web page where you found it, tell me why you think it is wrong, and, if possible, provide some supporting evidence. Threatening to kill me or spouting obscenities has yet to persuade me to change my mind.
mindprod.com IP:[65.110.21.43]
view BlogYour face IP:[38.107.179.214]
You are visitor number 16,293.