UTF-8 is especially compact when most of your characters are in the range 0..0x7f (ordinary 7-bit ASCII (American Standard Code for Information Interchange) ). It uses a mixture of 8, 16 and 24-bit codes. UTF-8 and ISO-8859-1 encode 7-bit characters identically, 0x00…0x7f, but after than that are quite different. To the casual glance, UTF-8 looks like ISO-8859-1 sprinkled with odd combinations three glyphs to stand for characters like “. You need a modern text editor that handles UTF-8 to view it properly.RFC 3629 officially describes the UTF-8 format.
If you’re viewing a file and it contains bits of gibberish beginning with € chances are the file in encoded in UTF-8, but your viewer thinks it is in ISO-8859-1.
UTF-16 normally uses purely 16-bit codes, either big or little endian. It can be extended to also encode 32-bit Unicode.
UTF-32 uses purely 32-bit codes, either big or little endian.
UTF-7 but it encodes 16-bit Unicode using only 7-bit ASCII characters.
Byte Order Marks | UTF-32 |
How UTF-8 Works | DataOutputStream.writeUTF |
UTF-8 Encoding | Exploring |
UTF-8 Decoding | 32-bit Unicode |
UTF-8 Fine Points | Debugging |
UTF-7 | Notepad UTF |
UTF-16 | Links |
UTFBOM(Byte Order Mark) Unicode-encoding Endian Indicators | |
---|---|
0xfeff BOM as it appears encoded |
Description |
ef bb bf | UTF-8 endian, strictly speaking does not apply, though it uses big-endian most-significant-bytes first representation. |
fe ff | UTF-16 for 16-bit internal UCS-2, big endian, Java network order |
ff fe | UTF-16 for 16-bit internal UCS-2, little endian, Intel/Microsoft order. Note you must examine subsequent bytes to tell this apart from a UTF-32 BOM since they both start ff fe. |
00 00 fe ff | UTF-32 for 32-bit internal UCS-4, big-endian, Java network order |
ff fe 00 00 | UTF-32 for 32-bit internal UCS-4, little endian, Intel/Microsoft order. |
There are also variants of these encodings that have an implied endian marker.
How UTF-8 Encoding Works | |||||
---|---|---|---|---|---|
Use | Range | Unicode Bit Assignment | UTF-8 Bit Assignment | bytes required to represent the character in UTF-8 |
bits required to represent the character internally |
ASCII | 0 .. 0x007f |
00000000 0xxxxxxx | 0xxxxxxx | 1 | 7 |
Latin, Greek, Hebrew, Arabic | 0x0080 .. 0x7fff |
00000yyy yyxxxxxx | 110yyyyy 10xxxxxx | 2 | 11 |
Asian languages, symbols | 0x0800 .. 0xffff |
zzzzyyyy yyxxxxxx | 1110zzzz 10yyyyyy 10xxxxxx | 3 | 16 |
Ugaritic, musical symbols. CodePoints required to access this range. | 0x1_0000 .. 0x1f_ffff |
00000000 000aaazz zzzzyyyy yyxxxxxx | 11110aaa 10zzzzzz 10yyyyyy 10xxxxxx | 4 | 21 |
future use: range not yet assigned. | 0x20_0000 .. 0x3ff_ffff |
000000bb aaaaaazz zzzzyyyy yyxxxxxx | 111110bb 10aaaaaa 10zzzzzz 10yyyyyy 10xxxxxx | 5 | 26 |
For example:
é Unicode 0x00e9 in UTF-8
is 0xc3a9.
ï Unicode 0x00ef in UTF-8
is 0xc3af.
€ Unicode 0x20ac in UTF-8
is 0xe282ac.
The UTF8Encoder/UTF8Decoder example classes above do not handle 32-bit characters (aka code points). The IETF (Internet Engineering Task Force) ’s ( RFC 3629 obsolete but has easy-to-understand bit diagrams) and RFC 3629 explain the UTF-8 format.
You can edit or create UTF-8 files with windows notepad.
How UTF-16 Encoding Works | |||
---|---|---|---|
Unicode | UTF-16 | bytes required to represent the character | Notes |
00000000 yyyyyyyy xxxxxxxx | yyyyyyyy xxxxxxxx | 2 | for numbers in range 0x0000 to 0xffff just encode them as they are in 16 bits. |
000zzzzz yyyyyyyy xxxxxxxx | 110110zz zzyyyyyy 110111yy xxxxxxxx | 4 | for numbers above 16 bits, in the range 0x10000 to 0x10ffff, you have 21 bits to encode. This is reduced to 20 bits by subtracting 0x100000. The high order bits are encoded as a 16-bit base 0xd800 + high order 10 bits and the low order bits are encoded as a 16-bit base 0xdc00 + low order 10 bits. The resulting pair of 16-bit characters are in the so-called so-called high-half zone or high surrogate area (the 210 = 1024-wide band 0xd800-0xdbff) and low-half zone or low surrogate area (the 210 = 1024-wide band 0xdc00-0xdfff). Characters with values greater than 0x10fff cannot be encoded in UTF-16. Values between 0xdc800-0xdbff and 0xd800-0xdfff are specifically reserved for use with UTF-16 for encoding high characters and don’t have any characters assigned to them. |
You can edit or create UTF-16 files with windows notepad.
Here is how you would encode 32-bit Unicode to UTF-16
UTF-32 is not of much practical use since any file using it is mostly zeroes. It is perhaps 3 times as bulky as UTF-8 with nothing special to recommend it. Perhaps it will catch on the conspicuous consumers who designed XML.
Java does not have 32-bit String literals, like C style code points e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the SurrogatePair applet to convert C-style code points to arcane surrogate pairs to let you use 32-bit Unicode glyphs in your programs.
Further, 0x00 is encoded as is 0xc0 0x80 instead of 0x00, to help C from getting confused reading such a file and thinking the 00 meant end-of-string.
But the biggest difference in Oracle’s writeUTF variant is in the handling of 32-bit codepoints. Most Java programs don’t use 32-bit codepoints, but if you do, beware! UTF-8 codes them as 4-byte sequences. Sun is coding them as 6-byte sequences! e.g. consider the encoding of 0x10302, standard UTF-8 gives:
Why would Sun use such an inefficient encoding? I believe it is to be backward compatible with datastreams written prior to the introduction of codepoints, where the surrogate pairs were treated as just ordinary data characters. Sun has to be able to read files written by earlier versions of Java. At some point, Sun will have to deprecate writeUTF and invent something else that properly encodes 32-bits codepoints and has a scheme to handle arbitrarily long Strings, using a variable length count. Alternatively, they could use the sign bit of the count field as an indicator of the new format.
The writeUTF variant shows up in Serialized Objects, RMI (Remote Method Invocation) streams, class file formats…
Here is the output of the program if you are curious, but not so curious that you feel compelled to run the program yourself:
I bought an early book on Unicode and marvelled at the extravagant number of symbols for every imaginable purpose. I thought surely no font would ever support all this. I thought, they won’t be going beyond 16 bits until rendering technology catches up to let us use the 64,000 symbols they have already provided, which was already over 100 times bigger than fonts of the time were supporting. But before long, the slots 0..0xffff were used up and Unicode had to be expanded to 32 bits.
Personally, I don’t see the point of any great rush to support 32-bit Unicode. The new symbols will be rarely used. Consider what’s there. The only ones I would conceivably use are musical symbols and Mathematical Alphanumeric symbols (especially the German black letters so favoured in real analysis). The rest I can’t imagine ever using unless I took up a career in anthropology, i.e. linear B syllabary (I have not a clue what it is), linear B ideograms (Looks like symbols for categorising cave petroglyphs), Aegean Numbers (counting with stones and sticks), Old Italic (looks like Phoenician), Gothic (medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian (George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot syllabary, Byzantine music symbols (looks like Arabic), Musical Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK (Chinese-Japanese-Korean) extensions CJK and tags (letters with blank price tags).
I think 32-bit Unicode becomes a matter of the tail wagging the dog, spurred by the technical challenge rather than a practical necessity. In the process, ordinary 16-bit character handling is turned into a bleeding mess, for almost no benefit.
I think programmers should for the most part simply ignore 32-bit and continue using the String class as we always have presuming every character is 16-bits.
Various ingenious and convoluted schemes have been invented to allow gradual and partial migration to 32-bit Unicode.
To allow 32-bit code points in 16-bit Java internal Strings, Sun encoded them using UTF-16, so that chars in the range 0..0xffff ( exclusive of the reserved low and high surrogate bands ) are encoded in 16 bits and the characters above 0xffff are encoded in 24, 32… bits.
To allow 32-bit code points in UTF-8 streams, UTF-8 was extended to handle them. If you want the details see IETF ’s ( RFC 3629 obsolete but has easy-to-understand bit diagrams) and RFC 3629 to explain the extended UTF-8 format.
You have to laugh at what a Rube Goldberg machine the process of 32-bit encoding and decoding becomes. In order for Java’s classes to encode an internal Strings in UTF-8, it must watch out for embedded 32-bit character encoded with UTF-16, decode them and then encode them again with 32-bit extended UTF-8. To decode, Java must decode the extended UTF-8 to 32 bits internally, then re-encode to 16 bits.
Perhaps the implementation of String will change at some point in future so that Strings internally are all pure 8-bit, 16-bit or 32-bit characters, rather than containing a variable number of bytes per character as they do now.
Java does not have a 32-bit String literal, like C style code points \U0001d504. I wrote the SurrogatePair applet to convert C-style code points to arcane surrogate pair to let you use 32-bit Unicode glyphs in your programs.
You need a tool to see what the codes is the file actually do look like. Try:
You also need a tool to validate the encoding:This page is posted |
http://mindprod.com/jgloss/utf.html | |
Optional Replicator mirror
|
J:\mindprod\jgloss\utf.html | |
Please read the feedback from other visitors,
or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission. | ||
Canadian
Mind
Products
IP:[65.110.21.43] Your face IP:[3.138.134.149] |
| |
Feedback |
You are visitor number | |