Unicode™ : Java Glossary

* 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all)

menu
Unicode Glyph Ranges	BOMs : Byte Order Marks
What Is Unicode?	What’s Missing From Unicode?
Symbols	Unicode Editors
Arrows	Viewer Applet
Hyphens	Notepad Unicode
Viewing Glyphs	Books
Creating Unicode Documents	Links
Unicode Literals in Java

Unicode 9.0 Glyph Ranges

links to pdf show all the Unicode glyphs. The table in organised in groups for each language or function.
Unicode 16 and Unicode 32 Glyphs
in Downloadable Acrobat PDF (Portable Document Format) Format
hex code ⁶=Unicode 6	size	Sample Glyph	Description
0000	383k	A	Basic Latin
0080	412k		Latin-1 Supplement: accented letters, basic symbols
0100	191k		Latin Extended-A: Esperanto accented letters
0180	362k	Ɖ	Latin Extended-B: African
0250	246k		IPA Extensions: International PhoneticAlphabet
02B0	195k	ˤ	Spacing Modifier Letters
0300	214k		Combining Diacritical Marks
0370	281k	Ω	Greek
0400	242k	Д	Cyrillic
0500	115k	Ԏ	Cyrillic Supplement
0530	106k	Մ	Armenian
0590	109k	א	Hebrew
0600	172k	ص	Arabic
0700	91k	ܛ	Syriac
0780	74k	ޘ	Thaana: Maldives
⁶0840	69k		Mandic: African
0900	110k		Devanagari: Hindi
0980	103k	ত	Bengali
0A00	98k	ਣ	Gurmukhi: Punjabi
0A80	96k	ઇ	Gujarati: Gujarat
0B00	105k	ଚ	Oriya: Odiya Orissa
0B80	136k		Tamil: India and Sri Lanka
0C00	137k	మ	Telugu: Andhra Pradesh
0C80	122k		Kannada: Karnataka
0D00	123k		Malayalam: Kerala
0D80	104k		Sinhala: Sri Lanka
0E00	100k	ฏ	Thai
0E80	100k	ຟ	Lao
0F00	219k	ཌ	Tibetan
1000	116k		Myanmar
10A0	100k	Ⴇ	Georgian
1100	131k	ᄘ	Hangul Jamo: Korean
1200	179k	ጜ	Ethiopic
13A0	85k	Ꮡ	Cherokee
1400	183k		Canadian Aboriginal Syllabic
1680	106k	ᚔ	Ogham: Old Irish
16A0	122k		Runic
1700	73k		Tagalog: Philippino
1720	76k		Hanunoo: Mindoro in the Philippines
1740	68k		Buhid: Mindoro in the Philippines, used to write Tagalog
1760	73k		Tagbanwa: Philippines
1780	128k	ផ	Khmer: Cambodian
1800	146k	ᠠ	Mongolian
1900	83k		Limbu: Tibet/Burma
1950	72k	ᥠ	Tai Le: China
19E0	75k	᧤	Khmer Symbols: Cambodian
⁶1BC0	69k		Batak: Sumatra Indonesia
1D00	250k	ᴂ	Phonetic Extensions
1E00	247k	Ḍ	Latin Extended Additional: dotted letters, letters with two accents.
1F00	175k	ἁ	Greek Extended
2000	283k	’	General Punctuation
2070	108k	₅	Superscripts and Subscripts
20A0	238k		Currency Symbols: including new 20b9 Rupee
20D0	145k		Combining Marks for Symbols
2100	276k	™	Letterlike Symbols
2150	184k		Number Forms ⅐ ⅑ ⅒
2190	109k		Arrows
2200	309k		Mathematical Operators: ∇ del, ∈ element, ∃ there exists, ∀ for all, ∪ union, ∩ intersection, ∋ contains member, ⋅ dot product, ∴ therefore, √ square root, ∧ logical and, ∨ logical or, ∑ summation, ∏ product, ≠ not equal, ≤ less or equal
2300	263k		Miscellaneous Technical: APL operators.
2400	88k		Control Pictures: for displaying unprintable ASCII control chararacters.
2440	73k		Optical Character Recognition
2460	140k		Enclosed Alphanumerics: see Dingbats 2700 for more circled digits.
2500	121k		Box Drawing: single/double lines also triangles
2580	78k		Block Elements
25A0	182k		Geometric Shapes
2600	337k		Miscellaneous Symbols: chess, astrology, I-ching, telephones, hazards, religious symbols, hammer and sickle.
2700	215k		Dingbats: asterisks, ornaments, hands, right-pointing arrows, pencils, scissors, pens. See 2460 for more circled digits.
27C0	150k		Miscellaneous Mathematical Symbols-A: including SQL left, right and full joins.
27F0	95k	⟰	Supplemental Arrows-A
2800	95k		Braille Patterns
2900	134k	⤱	Supplemental Arrows-B
2980	196k		Miscellaneous Mathematical Symbols-B
2A00	164k		Supplemental Mathematical Operators: including variants of + - × ÷
2B00	158k		Miscellaneous Symbols and Arrows
2C00	128k		Glagolytic: pre Cyrillic Bulgarian
2E80	184k	⺮	CJK Radicals Supplement: Chinese Japanese Korean
2F00	184k	⼮	Kangxi Radicals: fragments combined to write Chinese
2FF0	67k	⿱	Ideographic Description Characters
3000	206k	〖	CJK Symbols and Punctuation: Chinese Japanese Korean
3040	142k		Hiragana: (Japanese) Used when no Kanji character exists.
30A0	148k		Katakana: (Japanese) mainly for foreign names
3100	125k		Bopomofo: phonetic script for Mandarin
3130	124k	ㄱ	Hangul Compatibility Jamo: Korean
3190	124k	㆙	Kanbun: used by Japanese to annotate classic Chinese
31A0	102k	ㆥ	Bopomofo Extended: phonetic script for Mandarin
31F0	84k	ㇻ	Katakana Phonetic Extensions: Japanese
3200	250k	㈄	Enclosed CJK Letters and Months: Chinese Japanese Korean
3300	261k	㌗	CJK Compatibility: Chinese Japanese Korean
3400	5781k	㖣	CJK Unified Ideographs Extension A: Chinese Japanese Korean
4DC0	75k	䷱	Yijing Hexagram Symbols: I Ching symbols
4E00	25871k		CJK Unified Ideographs: Chinese Japanese Korean including Kanji digits 零一二三四五六七八九
A000	424k	ꅖ	Yi Syllables: classical Yi language of China
A490	83k	꒶	Yi Radicals: classical Yi language of China
AB00	79k		Ethiopic Extended-A
AC00	701k	귖	Hangul Syllables: Korean
D800	23k		High Surrogates
DC00	23k		Low Surrogates
E000	23k		Private Use Area
F900	590k	麟	CJK Compatibility Ideographs: Chinese Japanese Korean
FB00	116k		Alphabetic Presentation Forms: ligatures including Hebrew
FB50	236k	ﱺ	Arabic Presentation Forms-A
FE00	69k		Variation Selectors: non-printing control characters
FE20	82k		Combining Half Marks
FE30	129k	︾	CJK Compatibility Forms: Chinese, Japanese, Korean vertical brackets
FE50	148k	﹟	Small Form Variants: small punctuation
FE70	117k	ﺚ	Arabic Presentation Forms-B
FF00	274k	Ｈ	Halfwidth and Fullwidth Forms: wide and narrow letters, digits and punctuation
FFF0	72k		Specials: byte order marks.
00010000	93k		Linear B Syllabary ancient Cretan
00010080	123k		Linear B Ideograms
00010100	84k		Aegean Numbers
00010300	102k		Old Italic
00010330	97k		Gothic
00010380	100k		Ugaritic: Cuneiform
00010400	108k	𐐁	Deseret: Mormon
00010450	112k	𐑻	Shavian: George Bernard Shaw’s alphabet
00010480	102k	𐒁	Osmanyav: Somalian
00010800	106k		Cypriot Syllabary
⁶00011000	81k		Brahmi: ancient Indian scripts
⁶00016800	322k		Bamum Supplement: Cameroons
⁶0001B000	95k		Kana Supplement: Japanese
0001D000	230k		Byzantine Musical Symbols
0001D100	172k		Musical Symbols
0001D300	125k	𝍎	Tai Xuan Jing Symbols: Look like I-Ching hexagrams truncated to four lines.
0001D400	418k		Mathematical Alphanumeric Symbols
⁶0001F0A0	106k		Playing Cards
⁶0001F300	625k		Miscellaneous symbols and pictographs: including pile of poo.
⁶0001F600	119k		Emoticons
⁶0001F680	130k		Transport and Map Symbols
⁶0001F700	193k		Alchemical symbols
00020000	28317k		CJK Unified Ideographs Extension B: Chinese Japanese Korean
⁶0002B740	212k		CJK Unified Ideographs Extension D: Chinese Japanese Korean
0002F800	548k		CJK Compatibility Ideographs Supp.: Chinese Japanese Korean
000E0000	136k		Tags: control characters.
000E0100	84k		Variation Selectors Supp.: non printing control characters
000F0000	23k		Supplementary Private Use Area-A
00100000	23k		Supplementary Private Use Area-B

What Is Unicode?

Informally, Unicode is a 16-bit character encoding, with surrogate pairs to handle 32-bit, used internally in programs written in Java. More precisely, Unicode is not a character encoding, but a 32-bit character set. UTF-8, UTF-16 and UTF-32 are character encodings in which the Unicode character set can be encoded.

See the example glyphs, in PDF format. Requires Adobe Acrobat to view. Also available as ASCII text file describing the glyphs with cross references to similar glyphs. Unicode does not standardise the precise shapes of the letters, i.e. the glyphs. It does, however, provide example glyphes. This distinction is most important for Hangul which encodes Chinese, Japanese and Korean. They use the same Unicode encodings, but quite different looking renderings of the characters. These differences are handled by the font designer who uses Chinese, Japnese or Korean style.

Sometimes called UCS (Universal Character Set) or ISO (International Standards Organisation) 10646. Unicode allows Java to handle international characters for most of the world’s living languages, including Arabic, Armenian, Bengali, Bopomofo, Chinese (via unified Han), Cyrillic, English, Georgian, Greek, Gujarati, Gurmukhi, Hebrew, Hindi (Devanagari), Japanese (Kanji, Hiragana and Katakana via unified Han), Kannada, Korean (Hangul via unified Han), Lao, Maylayalam, Oriya, Tai, Tamil, Telugu, Tibetan… Unicode will make it much easier for non-English speaking programmers to write programs for English speaking users and vice versa.

To get musical symbols you need 32-bit Unicode support.

Emoji are scattered all over the map, not collected togethir in a block the way everything else is.

In Java, you get at the exotic characters by encoding them in hex in your strings like this: \u00f7\u2713 to produce ÷ ✓. See String literals for more details.

In HTML (Hypertext Markup Language), you get at the exotic characters by encoding them as entities such as ÷✓ to produce ÷ ✓.

Unicode Symbols

There are even codes for:

A sampling of Unicode symbols
apple		'\uf000' unofficial, private use area
British pound sign	£	'\u20a4'
checkmark	✓	'\u2713'
copyright	©	'\u00a9'
degree	°	'\u00b0'
dharma wheel	☸	'\u2638'
division	÷	'\u00f7'
bullet	•	'\u2022'
euro	€	'\u20ac'
female	♀	'\u2640'
funeral urn	⚱	'\u26b1'
heart	♥	'\u2665'
bullet (as mathematical operator)	∙	'\u2219'
infinity	∞	'\u221e'
integral	∫	'\u222b'
male	♂	'\u2642'
pi	π	'\u03c0'
PI	Π	'\u03a0'
registered trade mark	®	'\u00ae'
sun	☀	'\u2600'
telephone	☎	'\u260e'
trademark	™	'\u2122'

This does not mean your fonts will support all these wonders, of course.

In addition there all kinds of interesting special characters such as: Alphabetic Presentation Forms, APL (A Programming Language), Arrows, Bengali, Block Elements, Box Drawing, Braille Patterns, Byzantine Musical Symbols, Combining Diacritical Marks, Combining Half Marks, Combining Marks for Symbols, Control Pictures — icons for control chars, Currency Symbols, Dingbats, Enclosed Alphanumerics, General Punctuation, Geometric Shapes, Halfwidth and Fullwidth Forms, High Surrogates, Ideographic Description Characters, IPA (International Phonetic Alphabet) Extensions, Letterlike Symbols, Low Surrogates, Mathematical Alphanumeric Symbols (32-bit Unicode), Mathematical Operators, Mathematical Symbols, Miscellaneous Symbols (astrology, chess, playing cards), Miscellaneous Technical (del, grad, integral), Musical Symbols, Number Forms (e.g. Roman numerals), OCR (Optical Character Recognition) — the OCR-A (Optical Character Recognition font-A) MICR (Magnetic Ink Character Recognition) characters used in magnetic ink cheque encoding), Old Italic, Runic, Small Form Variants, Spacing Modifier Letters, Specials, Superscripts and Subscripts, Tags (letters with price tags), Unified Canadian Aboriginal Syllabic and Variation Selectors.

Unicode Arrows

There are also arrows:

unicode arrow characters
←	\u2190
↑	\u2191
→	\u2192
↓	\u2193
↔	\u2194
↕	\u2195
↢	\u21a2
↬	\u21ac
↭	\u21ad
↰	\u21b0
↶	\u21b6
⇅	\u21c5
⇎	\u21ce
⇐	\u21d0
⇑	\u21d1
⇒	\u21d2
⇓	\u21d3
⇔	\u21d4
⇕	\u21d5
⇜	\u21dc

There are even more arrows defined in Unicode: 2190-21ff, To use these characters in HTML, you need to code them as &… entities.

Hyphens

There are also are variety of hyphen characters:

unicode hyphen characters
-	\u2d	hyphen-minus
	\uad	soft-hyphen
‐	\u2010	hyphen
‑	\u2011	non-breaking hyphen
‒	\u2012	figure dash hyphen
–	\u2013	en dash hyphen
—	\u2014	em dash hyphen
−	\u2212	minus sign
𐆑	0x10191 (\ud835\udd04)	roman uncia sign

Viewing Unicode Glyphs

Nic Fulton of Reuters has written an Java Test Applet that can display all 64 thousand Unicode characters including the Chinese/Korean Han. How many of them actually display on your screen depends on the font handling ability of your browser and operating system and which fonts you have installed. In Java programs, intractable Unicode characters are represented in the form '\uffff', with four hex digits. Ordinary characters like 'A' are actually 16-bit Unicode too.

Creating Unicode Documents

How do you create and edit the various flavours of Unicode documents? You can create them in some specific encoding then convert them. To write a little utility to do that read up on encoding and ask the File I/O Amanuensis for sample code. You can use lowly Notepad in Windows NT/W2K/XP to edit existing documents but not earlier Windows versions. You would have to acquire an almost empty Unicode document for getting started with new documents. It is even clever enough to deal with byte order (endian) marks. Recent version of MS Word in Windows NT/W2K/XP/W2K3 also work.

Java

See the literals section for a full explanation of how to code 16-bit Unicode characters in Java programs.

Java does not have 32-bit String literals, like C style code points e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the SurrogatePair applet to convert C-style code points to arcane surrogate pairs to let you use 32-bit Unicode glyphs in your programs.

Byte Order Marks

There are two different standards, Unicode which assigns glyphs to numbers and UTF (Unicode Transformation unit) which describes how you encode these number in a file. Byte order marks are part of the UTF standard, not the Unicode standard. See more on BOMs (Byte Order Marks).

What’s Missing From Unicode?

There are no Unicode glyphs for the following:

bold
italic
Small caps
Old style numerals:
Variant forms for Arabic letters use at the beginnings, middle and ends of words.

Unicode is not concerned with typesetting, just with raw text. In other words, it is about characters, (logical letters) not glyphs (how letters are precisely shaped). Unicode has various flavours of digits, that look much the same, but they are intended to be used in different contexts.

To typeset, you need separate fonts to handle such variants, with the letters encoded with the same Unicode character. The word processor automatically selects the appropriate variant. I don’t know the mechanism by which a word processor can tell which fonts are related and which styles and font-weights each supports. Presumably it is encoded somehow in the font files.

To a large extent ligatures are handled outside Unicode by automatically combining Unicode characters, though there are a few ligatures that rate a special Unicode character.

Unicode Editors

Where do Unicode files come from? You can create them with:

custom Java program that uses a FileWriter with UTF-16, UTF-16BE, UTF-16LE, or UTF-8 encoding.
nativetoascii.exe, Oracle’s encoding translation utility.
Eclipse IDE (Integrated Development Environment).
JEdit: a programmer’s text editor that also supports a few dozen other encodings and has piles of plugins for various purposes, plus syntax highlighting for lots of languages.

You can edit or create UTF-8 or UTF-16 files with windows notepad.

Unicode 8.0.0

Unicode 8.0.0 is the latest version of the Unicode Standard. JDK (Java Development Kit) 1.8.0_131 supports version 6.2.0 though I doubt Java will need to change at all to support 8.0.0. All later versions of Unicode do is add more potential characters to fonts.

Books

recommend book⇒The Unicode 5.0 Standard

The Unicode Consortium

978-0-321-48091-0

hardcover

birth

1991 age:26

publisher

Addison-Wesley

published

2006-11-19

Unicode 5.0 adds the following:

Security mechanisms
a standard collation algorithm for various national orderings.
A common locale data repository.
Improvements to the encoding model for UTF-8.
Rigorous stability of case folding.
a systematic framework covering combining characters, Unicode strings, line breaking and segmentation

The current version is 8.0. There have been no comprehensive books published since this one.

Online bookstores carrying The Unicode 5.0 Standard
	abe books anz	abe books.ca
	abe books.de	amazon.ca
	amazon.de	Chapters Indigo
	amazon.es	Chapters Indigo eBooks
	iberlibro.com	abe books.com
	abe books.fr	amazon.com
	amazon.fr	Barnes & Noble
	abe books.it	Nook at Barnes & Noble
	amazon.it	Kobo
	junglee.com	Google play
	abe books.co.uk	O’Reilly Safari
	amazon.co.uk	Powells
	other stores

Greyed out stores probably do not have the item in stock. Try looking for it with a bookfinder.

BOM
Characters vs Glyphs
codepoint
common Fonts
complete list of Unicode character names in HTML
coverage fonts
CSS default Fonts
Emoji in Unicode
encoding
endian
entities
Esperanto Fonts
Fonts
Free Fonts
Google Noto Fonts: available for most of the world’s languages
hex entities
how emoji render on different OSes
HTML Cheat Sheet
HTML5 entities
i18guy
Joel Spolsky on Unicode
last resort font
ligature
literal
logical Fonts
Macintosh OS X Fonts
monospaced Fonts
Notepad Unicode
nt
office Fonts
ornament
physical Fonts
PostScript Fonts
proportional Fonts
Reuters Unicode Test Applet
sans serif Fonts
searchable ASCII text list of Unicode chars
String literals
surrogate pair
Tiresias Fonts
Unicode 5.2
Unicode 6.1: supported in Java 1.7
Unicode 6.2
Unicode 6.3
Unicode 7.0
Unicode 8.0
Unicode as a set of text files
Unicode character search: search by partial name
Unicode Code Chart: arranged alphabetically
Unicode FAQ
Unicode on one big page
Using Unicode for Math
UTF
UTF-16 explanation
Vista Fonts
XP Fonts

standard footer
	This page is posted on the web at:	http://mindprod.com/jgloss/unicode.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\jgloss\unicode.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[216.73.216.142]
Feedback	You are visitor number