encoding : Java Glossary

* 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all)

Unfortunately, Oracle has effectively decommitted Applets. This means you can no longer run the various CMP programs in a browser. You must download them and install them. You must have the most recent Java JRE (Java Runtime Environment) 1.8.0_131 32-bit or 64-bit. It no longer matters which browser you use.

The CurrCon Java Applet displays prices on this web page converted with today’s exchange rates into your local international currency, e.g. Euros, US dollars, Canadian dollars, British Pounds, Indian Rupees… CurrCon requires an up-to-date browser and Java version 1.8, preferably 1.8.0_131. If you can’t see the prices in your local currency, Troubleshoot. Use Firefox for best results.

Oracle has effectively decommited Applets, so this Applet will no longer run online in your browser, but it is a hybrid you can also download, install and run it on your own machine as standalone application. It will start and run faster if you do that. It will also work safely even if you have disabled Java in your browser.

encoding

This page contains two signed Applets and one unsigned Applet. You must grant permission for the two signed Applets to run to view the page.

Normally Readers translate from various 8-bit byte streams to standard 16-bit Unicode to read. You can specify the sort of translation to use when you create the Reader. Similarly, normally Writers translate from internal 16-bit Unicode into various 8-bit byte streams.

However, encodings are more versatile than that. They also let you read and write big or little endian 16-bit Unicode character streams. In theory encodings could support complex encoding structures, translation, compression, or quoting. One letter may become many or vice versa. Letters may be suppressed.

Encodings are usually trap door. When you translate to 8-bit you lose information. When you translate it back to Unicode some characters will not come back the same way they were originally. Some may even be missing.

Encodings are not used in AWT (Advanced Windowing Toolkit) or Swing. You use pure 16-bit Unicode chars and Strings. How it displays depends on how clever the Font is at displaying Unicode. Normally it will display only some small subset of the characters properly. See FontShower to learn a bit about what your Font supports.

Java 1.8.0_131 supports 166 different encodings. If you are an English-speaking Windows developer, the ones you will use most often are UTF-8, windows-1252 (the default), ISO-8859-1, ASCII (American Standard Code for Information Interchange), UTF-16 and IBM437.

Don’t assume that all files are UTF-8. Your Java source code may be windows-1252 and your console may be IBM437. When you shift presumed encodings without translation, characters mysteriously change.

menu
Possible Encodings	Reversibility
Encodings Supported in your Browser	Tracking
Determining the Default Encoding	Rant on Encoding Identification
Official Encoding Name Given Alias	Choosing An Encoding
Table of Possible Encodings	HEX
Why So Many Encodings?	Java Source Code Encoding
ISO	Default File and Console Encoding
Roll Your own	Learning More
Converting	Links
native2ascii

Possible Supported Encodings

The complete set of which encodings supported anywhere/everywhere is not documented. However, starting with JDK (Java Development Kit) 1.4 there is a way to find out just which encodings are supported in your particular JVM (Java Virtual Machine) using java.nio.charset. Charset. availableCharsets().

The class is called java.nio.charset.Charset charset.Charset not java.nio.charset.CharSet. To aid in the confusion, CharSet is a class in the Apache commons classes.

There are five sources of information:

Oracle’s Java 1.8 documentation on : nio encoding

That lists them, but does not tell you much about them. Note that java.nio uses different canonical names from java.io and java.lang.
A place to look for supported character sets :
- in C:\Program Files\java\jre1.8.0_131\lib\charsets.jar in JRE 1.8.0_131 on your local Windows C: drive.
- in J:\Program Files\java\jdk1.8.0_131\jre\lib\charsets.jar in JDK 1.8.0_131 on your local Windows J: drive.
- in X:\Program Files (x86)\jet12.0-pro-x86\profile1.8.0_131\jre\lib\charsets.jar in Jet jet12.0-pro-x86/1.8.0_131 on your local Windows X: drive.
- in X:\Program Files\JetBrains\IntelliJ IDEA 2017.3\jre\jre\lib\charsets.jar in IntelliJ Idea IntelliJ IDEA 2017.3 on your local Windows X: drive.
Another place to look for supported character Sets :
- in C:\Program Files\java\jre1.8.0_131\lib\rt.jar in JRE 1.8.0_131 on your local Windows C: drive.
- in J:\Program Files\java\jdk1.8.0_131\jre\lib\rt.jar in JDK 1.8.0_131 on your local Windows J: drive.
- in X:\Program Files (x86)\jet12.0-pro-x86\profile1.8.0_131\jre\lib\rt.jar in Jet jet12.0-pro-x86/1.8.0_131 on your local Windows X: drive.
- in X:\Program Files\JetBrains\IntelliJ IDEA 2017.3\jre\jre\lib\rt.jar in IntelliJ Idea IntelliJ IDEA 2017.3 on your local Windows X: drive.
The following Applet that lists the encodings supported on your particular browser/java.
The following Table that lists the encodings supported on one some Java, somewhere. I manually collect this lore.

In addition, you could consider compression and encryption and specialised types of encoding.

Supported Encodings in this Browser

List of encodings supported in this browser and this Java. Source available.

The key to this Applet is java.nio.charset.Charset. availableCharsets().

Java Requirements and Troubleshooting

Encodings is a signed Java Applet (that can also be run as an application) to Encodings. You are welcome to install it on your own website. If it does not work…

For this Applet hybrid to work, you must click grant/accept/always run on this site/I accept the risk to give it permission to discover the default encoding via the file.encoding restricted system property. If you refuse to grant permission, the program may crash with an inscrutable stack dump on the console complaining about AccessController.checkPermission.
In the Java Control Panel security tab, click Start ⇒ Control Panel ⇒ Programs ⇒ Java ⇒ Security, configure medium security to allow self-signed and vanilla unsigned applets to run. If medium is not available, or if Java security is blocking you from running the program, configure high security and add http://mindprod.com to the Exception Site List at the bottom of the security tab.
Often problems can be fixed simply by clicking the reload button on your browser.
Make sure you have both JavaScript and Java enabled in your browser.
Make sure the Java in your browser is enabled in the security tab of the Java Control panel. Click Start ⇒ Control Panel ⇒ Programs ⇒ Java ⇒ Security ⇒ Enable Java Content in the browser.
This signed Java Applet (that can also be run as an application) needs 32-bit or 64-bit Java 1.8 or later. For best results use the latest 1.8.0_131 Java.
You also need a recent browser.
It works under any operating system that supports Java e.g. W2K, XP, W2003, Vista, W2008, W7-32, W7-64, W8-32, W8-64, W2012, W10-32, W10-64, Linux, LinuxARM, LinuxX86, LinuxX64, Ubuntu, Solaris, SolarisSPARC, SolarisSPARC64, SolarisX86, SolarisX64 and OSX
You should see the Applet hybrid above looking much like this screenshot. If you don’t, the following hints should help you get it working:
Optionally, you may permanently install the Canadian Mind Products code-signing certificate so you don’t have to grant each time.
If the above Applet hybrid appears to freeze-up, click Alt-Esc repeatedly to check for any buried permission dialog box.
If you have certificate troubles, check the installed certificates and remove or update any obsolete or suspected defective certificates. The only certificate used by this program is mindprodcert2017rsa.cer.
Especially if this Applet hybrid has worked before, try clearing the browser cache and rebooting.
To ensure your Java is up to date, check with Wassup. First, download it and run it as an application independent of your browser, then run it online as an Applet to add the complication of your browser.
If the above Applet hybrid does not work, check the Java console for error messages.
If the above Applet hybrid does not work, you might have better luck with the downloadable version available below.
If you are using Mac OS X and would like an improved Look and Feel, download the QuaQua look & feel from randelshofer.ch/quaqua. UnZip the contained quaqua.jar and install it in ~/Library/Java/Extensions or one of the other ext dirs.
Upgrade to the latest version of Internet Explorer or another browser.
Click the Information bar, and then click Allow blocked content. Unfortunately, this also allows dangerous ActiveX code to run. However, you must do this in order to get access to perfectly-safe Java Applets running in a sandbox. This is part of Microsoft’s war on Java.
Try upgrading to a more recent version of your browser, or try a different browser e.g. Firefox, SeaMonkey, IE or Avant.
If you still can’t get the program working click the red HELP button below for more detail.
If you can’t get the above Applet hybrid working after trying the advice above and from the red HELP button below, have bugs to report or ideas to improve the program or its documentation, please send me an email at.

Get New Java Get New Browser

How to Determine the Default Encoding in Java

Finding Official Encoding Name Given an Alias

Java Requirements and Troubleshooting

OfficialEncoding is a Java Applet (that can also be run as an application) to Official Encoding. You are welcome to install it on your own website. If it does not work…

If Copy/Paste (Ctrl-C/Ctrl-V) do not work, you can turn them back on by modifying your java.policy file. This is not for the novice or faint of heart. instructions Your alternative is to download this program and run it without a browser.
In the Java Control Panel security tab, click Start ⇒ Control Panel ⇒ Programs ⇒ Java ⇒ Security, configure medium security to allow self-signed and vanilla unsigned applets to run. If medium is not available, or if Java security is blocking you from running the program, configure high security and add http://mindprod.com to the Exception Site List at the bottom of the security tab.
Often problems can be fixed simply by clicking the reload button on your browser.
Make sure you have both JavaScript and Java enabled in your browser.
Make sure the Java in your browser is enabled in the security tab of the Java Control panel. Click Start ⇒ Control Panel ⇒ Programs ⇒ Java ⇒ Security ⇒ Enable Java Content in the browser.
This Java Applet (that can also be run as an application) needs 32-bit or 64-bit Java 1.8 or later. For best results use the latest 1.8.0_131 Java.
You also need a recent browser.
It works under any operating system that supports Java e.g. W2K, XP, W2003, Vista, W2008, W7-32, W7-64, W8-32, W8-64, W2012, W10-32, W10-64, Linux, LinuxARM, LinuxX86, LinuxX64, Ubuntu, Solaris, SolarisSPARC, SolarisSPARC64, SolarisX86, SolarisX64 and OSX
You should see the Applet hybrid above looking much like this screenshot. If you don’t, the following hints should help you get it working:
Especially if this Applet hybrid has worked before, try clearing the browser cache and rebooting.
To ensure your Java is up to date, check with Wassup. First, download it and run it as an application independent of your browser, then run it online as an Applet to add the complication of your browser.
If the above Applet hybrid does not work, check the Java console for error messages.
If the above Applet hybrid does not work, you might have better luck with the downloadable version available below.
If you are using Mac OS X and would like an improved Look and Feel, download the QuaQua look & feel from randelshofer.ch/quaqua. UnZip the contained quaqua.jar and install it in ~/Library/Java/Extensions or one of the other ext dirs.
Upgrade to the latest version of Internet Explorer or another browser.
Click the Information bar, and then click Allow blocked content. Unfortunately, this also allows dangerous ActiveX code to run. However, you must do this in order to get access to perfectly-safe Java Applets running in a sandbox. This is part of Microsoft’s war on Java.
Try upgrading to a more recent version of your browser, or try a different browser e.g. Firefox, SeaMonkey, IE or Avant.
If you still can’t get the program working click the red HELP button below for more detail.
If you can’t get the above Applet hybrid working after trying the advice above and from the red HELP button below, have bugs to report or ideas to improve the program or its documentation, please send me an email at.

Get New Java Get New Browser

Table of Possible Supported Encodings

Here are some encodings typically supported. You often see names with dash, underscore and space variations, e. g. ISO (International Standards Organisation) 8859-1, ISO8859_1 and ISO-8859-1. The encodings you will encounter most often are: ISO-8859-1 (Latin-1), UTF-8 and windows-1250. These are the latest fashion in naming.

Java Encodings
Java Encodings
Common Encoding name	Supp- orted?	Official Java Name	Description
8859_1		ISO-8859-1	Latin-1 ASCII (the USA default). This just takes the low order 8 bits and tacks on a high order 0 byte. Same as ISO-8859-1. Microsoft’s variant of Latin-1 is called Cp1252. UTF-8 and ISO-8859-1 encode 7 bit characters identically, 0x00…0x7f, but after than that are quite different.
ASCII		US-ASCII	7-bit ASCII, plus forms like \uxxxx for the exotic characters.
base64			base64 source code is available. armouring
base64u			base64u source code is available. A variant of Base64 also URL-encoded. armouring
base85
Big5		Big5	Big5, Traditional Chinese
Big5-HKSCS		Big5-HKSCS	Big5 with Hong Kong extensions, Traditional Chinese
Big5-Solaris			Not supported in Windows. Big5 with seven additional Hanzi ideograph character mappings for the Solaris zh_TW.BIG5 locale
CESU-8		CESU-8	Added in JDK 1.8.0. A modified UTF-8.
Cp037		IBM037	USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia, EBCDIC, aka Cp1040
Cp038			International EBCDIC, aka IBM038
Cp273		IBM273	IBM (International Business Machines) Austria, Germany, aka Cp1141
Cp277		IBM277	IBM Denmark, Norway, EBCDIC, aka Cp1142
Cp278		IBM278	IBM Finland, Sweden, EBCDIC, aka Cp1143
Cp280		IBM280	IBM Italy, EBCDIC, aka Cp1144
Cp284		IBM284	IBM Catalan/Spain, Spanish Latin America, EBCDIC, aka Cp1145
Cp285		IBM285	IBM United Kingdom, Ireland, EBCDIC, aka Cp1146
Cp297		IBM297	IBM France, EBCDIC, aka Cp1147
Cp420		IBM420	IBM Arabic, EBCDIC aka IBM240
Cp424		IBM424	IBM Hebrew, EBCDIC
Cp437		IBM437	Original IBM PC (Personal Computer) OEM (Original Equipment Manufacturer) DOS (Disk Operating System) character set (with line drawing characters and some Greek and math), MS-DOS United States, Australia, New Zealand, South Africa. The rest of the world uses Cp850 for the DOS box.
Cp500		IBM500	IBM Belgium and Switzerland, EBCDIC, 500V1, aka Cp1148
Cp737		x-IBM737	PC Greek
Cp775		IBM775	PC Baltic
Cp838		IBM-Thai	IBM Thailand extended SBCS, aka IBM838
Cp850		IBM850	Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see ISO-8859-1. See Cp437.
Cp852		IBM852	Microsoft DOS Multilingual Latin-2 Slavic
Cp855		IBM855	IBM Cyrillic
Cp857		IBM857	IBM Turkish
Cp858		IBM00858	variant of Cp850 with the Euro. Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see ISO-8859-1.
Cp860		IBM860	MS-DOS Portuguese
Cp861		IBM861	MS-DOS Icelandic
Cp862		IBM862	PC Hebrew
Cp863		IBM863	MS-DOS Canadian French
Cp864		IBM864	PC Arabic
Cp865		IBM865	MS-DOS Nordic
Cp866		IBM866	MS-DOS Russian
Cp868		IBM868	MS-DOS Pakistan
Cp869		IBM869	IBM Modern Greek
Cp870		IBM870	IBM Multilingual Latin-2, EBCDIC
Cp871		IBM871	IBM Iceland, EBCDIC, aka Cp1149
Cp874		x-IBM874	IBM Thai
Cp875		x-IBM875	IBM Greek
Cp918		IBM918	IBM Pakistan(Urdu), EBCDIC
Cp921		x-IBM921	IBM Latvia, Lithuania (AIX (Advanced Interactive eXecutive), DOS).
Cp922		x-IBM922	IBM Estonia (AIX, DOS ).
Cp930		x-IBM930	Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
Cp933		x-IBM933	Korean Mixed with 1880 UDC, superset of 5029
Cp935		x-IBM935	Simplified Chinese Host mixed with 1880 UDC, superset of 5031
Cp937		x-IBM937	Traditional Chinese Host mixed with 6204 UDC, superset of 5033
Cp939		x-IBM939	Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
Cp942		x-IBM942	Japanese (OS/2) superset of 932
Cp942C		x-IBM942C	variant of Cp942. Japanese (OS/2) superset of Cp932
Cp943		x-IBM943	Japanese (OS/2) superset of Cp932 and Shift-JIS.
Cp943C		x-IBM943C	Variant of Cp943. Japanese (OS/2) superset of Cp932 and Shift-JIS.
Cp948		x-IBM948	OS/2 Chinese (Taiwan) superset of 938
Cp949		x-IBM949	PC Korean
Cp949C		x-IBM949C	variant of Cp949, PC Korean
Cp950		x-IBM950	PC Chinese (Hong Kong, Taiwan)
Cp964		x-IBM964	AIX Chinese (Taiwan)
Cp970		x-IBM970	AIX Korean
Cp1006		x-IBM1006	IBM AIX Pakistan (Urdu).
Cp1025		x-IBM1025	IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovina, Macedonia, FYRa0.
Cp1026		IBM1026	IBM Latin-5, Turkey
Cp1046		x-IBM1046	IBM Open Edition US EBCDIC
Cp1047		IBM1047	IBM System 390 EBCDIC, Java version 1.2 or later only.
Cp1048			IBM EBCDIC. aka IBM1048.
Cp1097		x-IBM1097	IBM Iran(Farsi)/Persian
Cp1098		x-IBM1098	IBM Iran(Farsi)/Persian (PC)
Cp1112		x-IBM1112	IBM Latvia, Lithuania
Cp1122		x-IBM1122	IBM Estonia
Cp1123		x-IBM1123	IBM Ukraine
Cp1124		x-IBM1124	IBM AIX Ukraine
Cp1140		IBM01140	USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia, aka Cp037.
Cp1141		IBM01141	IBM Austria, Germany, aka Cp273.
Cp1142		IBM01142	IBM Denmark, Norway, aka Cp277.
Cp1143		IBM01143	IBM Finland, Sweden, aka Cp278.
Cp1144		IBM01144	IBM Italy, aka Cp2803
Cp1145		IBM01145	IBM Catalan/Spain, Spanish Latin America, aka Cp284.
Cp1146		IBM01146	IBM United Kingdom, Ireland, aka Cp285.
Cp1147		IBM01147	IBM France, aka Cp297.
Cp1148		IBM01148	EBCDIC 500V1.
Cp1149		IBM01149	IBM Iceland.
Cp1250		windows-1250	Windows Eastern European
Cp1251		windows-1251	Windows Cyrillic (Russian)
Cp1252		windows-1252	Microsoft Windows variant of Latin-1, NT default. Beware. Some unexpected translations occur when you read with this default encoding, e.g. codes 128..159 are translated to 16-bit chars with bits in the high order byte on. It does not just truncate the high byte on write and pad with 0 on read. For true Latin-1 see ISO-8859-1.
Cp1253		windows-1253	Windows Greek
Cp1254		windows-1254	Windows Turkish
Cp1255		windows-1255	Windows Hebrew
Cp1256		windows-1256	Windows Arabic
Cp1257		windows-1257	Windows Baltic
Cp1258		windows-1258	Windows Viet Namese
Cp1381		x-IBM1381	IBM OS/2, DOS People’s Republic of China (PRC (People’s Republic of China) )
Cp1383		x-IBM1383	IBM AIX People’s Republic of China (PRC )
Cp33722		x-IBM33722	IBM-eucJP — Japanese (superset of 5050)
Default		US-ASCII	7-bit ASCII (not the actual default!). Strips off the high order bit 7 and tacks on a high order 0 byte. The actual default is controlled in W95, W98, Me, NT, W2K, XP, W2003, Vista, W2008, W7-32, W7-64, W8-32, W8-64, W2012, W10-32 and W10-64 in the Control Panel national settings.
EBCDIC			Not directly supported. EBCDIC comes in dozens of variants, most of which do not have Java support. Check out Cp037, Cp038, Cp278, Cp280, Cp284, Cp285, Cp297, Cp424, Cp500, Cp871, Cp918, Cp1046, Cp1047, Cp1048, Cp1148.
Filode		n/a	Used to encode filenames with fancy characters in them to make them usable on systems with ASCII-only filenames.
EUC-JP		EUC-JP
EUC-KR		EUC-KR
Gb18030		Gb18030	Simplified Chinese, PRC standard
Gb2312		Gb2312	Chinese. Popular in email.
GBK		GBK	GBK, Simplified Chinese
gzip		gzip	compressed, often used in HTML (Hypertext Markup Language) sent from a website.
IBM-Thai		IBM-Thai
IBM00858		IBM00858	variant of Cp850 with the Euro. Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see ISO-8859-1.
IBM01140		IBM01140	USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia, aka Cp037.
IBM01141		IBM01141	IBM Austria, Germany, aka Cp273.
IBM01142		IBM01142	IBM Denmark, Norway, aka Cp277.
IBM01143		IBM01143	IBM Finland, Sweden, aka Cp278.
IBM01144		IBM01144	IBM Italy, aka Cp2803
IBM01145		IBM01145	IBM Catalan/Spain, Spanish Latin America, aka Cp284.
IBM01146		IBM01146	IBM United Kingdom, Ireland, aka Cp285.
IBM01147		IBM01147	IBM France, aka Cp297.
IBM01148		IBM01148	EBCDIC 500V1.
IBM01149		IBM01149	IBM Iceland.
IBM037		IBM037	USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia, EBCDIC, aka Cp1040
IBM1026		IBM1026	IBM Latin-5, Turkey
IBM1047		IBM1047	IBM System 390 EBCDIC, Java version 1.2 or later only.
IBM273		IBM273	IBM Austria, Germany, aka Cp1141
IBM277		IBM277	IBM Denmark, Norway, EBCDIC, aka Cp1142
IBM278		IBM278	IBM Finland, Sweden, EBCDIC, aka Cp1143
IBM280		IBM280	IBM Italy, EBCDIC, aka Cp1144
IBM284		IBM284	IBM Catalan/Spain, Spanish Latin America, EBCDIC, aka Cp1145
IBM285		IBM285	IBM United Kingdom, Ireland, EBCDIC, aka Cp1146
IBM290		IBM290	alias for EBCDIC-JP-KANA. New with JDK 1.7.0-51
IBM297		IBM297	IBM France, EBCDIC, aka Cp1147
IBM420		IBM420	IBM Arabic, EBCDIC aka IBM240
IBM424		IBM424	IBM Hebrew, EBCDIC
IBM437		IBM437	Original IBM PC OEM DOS character set (with line drawing characters and some Greek and math), MS-DOS United States, Australia, New Zealand, South Africa. The rest of the world uses Cp850 for the DOS box.
IBM500		IBM500	IBM Belgium and Switzerland, EBCDIC, 500V1, aka Cp1148
IBM775		IBM775	PC Baltic
IBM850		IBM850	Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see ISO-8859-1. See Cp437.
IBM852		IBM852	Microsoft DOS Multilingual Latin-2 Slavic
IBM855		IBM855	IBM Cyrillic
IBM857		IBM857	IBM Turkish
IBM860		IBM860	MS-DOS Portuguese
IBM862		IBM861	MS-DOS Icelandic
IBM862		IBM862	PC Hebrew
IBM863		IBM863	MS-DOS Canadian French
IBM864		IBM864	PC Arabic
IBM865		IBM865	MS-DOS Nordic
IBM866		IBM866	MS-DOS Russian
IBM868		IBM868	MS-DOS Pakistan
IBM869		IBM869	IBM Modern Greek
IBM870		IBM870	IBM Multilingual Latin-2, EBCDIC
IBM871		IBM871	IBM Iceland, EBCDIC, aka Cp1149
IBM918		IBM918	IBM Pakistan(Urdu), EBCDIC
IBMOEM			Cp437
ISO-2022-CN		ISO-2022-CN	ISO 2022 CN, Chinese
ISO-2022-CN-CNS		x-ISO-2022-CN-CNS	CNS 11643 in ISO-2022-CN form, T. Chinese
ISO-2022-CN-GB		x-ISO-2022-CN-GB	GB 2312 in ISO-2022-CN form, S. Chinese
ISO-2022-JP		ISO-2022-JP	JIS0201, 0208, 0212, ISO-2022 Encoding, Japanese
ISO-2022-JP-2		ISO-2022-JP-2
ISO-2022-KR		ISO-2022-KR	ISO 2022 KR, Korean
ISO-8859-1		ISO-8859-1	ISO 8859-1, same as 8859_1, USA, Europe, Latin America, Caribbean, Canada, Africa, Latin-1, (Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish). Beware, for NT, the default is Cp1252 a variant of Latin-1, controlled by the control panel regional settings. UTF-8 and ISO-8859-1 encode 7 bit characters identically, 0x00…0x7f, but after than that are quite different.
ISO-8859-2		ISO-8859-2	ISO 8859-2, Eastern Europe, Latin-2, (Albanian, Czech, English, German, Hungarian, Polish, Rumanian, (Serbo-)Croatian, Slovak, Slovene and Swedish)
ISO-8859-3		ISO-8859-3	ISO 8859-3, SE Europe/miscellaneous, Latin-3 (Afrikaans, Catalan, English, Esperanto, French, Galician, German, Italian, Maltese and Turkish)
ISO-8859-4		ISO-8859-4	ISO 8859-4, Scandinavia/Baltic, Latin-4, (Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian and Swedish)
ISO-8859-5		ISO-8859-5	ISO 8859-5, Cyrillic, (Bulgarian, Bielorussian, English, Macedonian, Russian, Serb(o-Croat)ian and Ukrainian)
ISO-8859-6		ISO-8859-6	ISO 8859-6, Arabic ASMO 449
ISO-8859-7		ISO-8859-7	ISO 8859-7, Greek ELOT-928
ISO-8859-8		ISO-8859-8	ISO 8859-8, Hebrew
ISO-8859-9		ISO-8859-9	ISO 8859-9, Turkish Latin-5, (English, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish and Turkish)
ISO-8859-10			ISO 8859-10, Lappish/Nordic/Eskimo languages, Latin-6. (Danish, English, Estonian, Faeroese, Finnish, German, Greenlandic, Icelandic, Lappish, Latvian, Lithuanian, Norwegian and Swedish)
ISO-8859-11		x-iso-8859-11	ISO 8859-11, Thai.
ISO-8859-12			ISO 8859-12, Devanagari.
ISO-8859-13		ISO-8859-13	ISO 8859-13, Baltic Rim, Latin-7.
ISO-8859-15			ISO 8859-14, Celtic, Latin-8.
ISO-8859-15		ISO-8859-15	ISO 8859-15, Euro, including Euro currency sign, aka Latin9, not Latin-15 as you would expect. Like Latin-1 with 8 replacements.
JIS		ISO-2022-JP	Japanese
JIS0201		JIS_X0201	JIS 0201, Japanese
JIS0212		JIS_X0212-1990	JIS 0212, Japanese
JISAutoDetect		x-JISAutoDetect	Detects and converts from Shift-JIS, EUC-JP, ISO- 2022 JP (conversion to Unicode only)
JIS_X0201		JIS_X0201	Japanese
JIS_X0212-1990		JIS_X0212-1990	Japanese
KOI8		KOI8	Added in JDK 1.8.0
KOI8-R		KOI8-R	KOI8-R, Russian
KOI8-U		KOI8-U
ks_c_5601-1987		EUC-KR	Korean standard often used in emails. See KSC5601.
KSC5601		EUC-KR	Korean
Latin-1			see ISO-8859-1 and Cp1252.
Latin-2			see ISO-8859-2.
Latin-3			see ISO-8859-3.
Latin-4			see ISO-8869-4.
Latin Extended-A			MSWord
Latin Extended-B			MSWord
LocaleDefault			Mad as it sounds, the only way to get this is to look up the Locale default such as yourself and pass it explicitly or use a variant method that does not specify the encoding. default won’t do it! In my opinion, all methods that use a LocaleDefault without an encoding parameter should be deprecated. You can also find out the encoding used on an InputStreamReader with InputStreamReader. getEncoding(). It will pick up the default, or the explicit encoding specified. default
MacArabic		x-MacArabic	Macintosh Arabic
MacCentralEurope		x-MacCentralEurope	Macintosh Latin-2
MacCroatian		x-MacCroatian	Macintosh Croatian
MacCyrillic		x-MacCyrillic	Macintosh Cyrillic (Russian)
MacDingbat		x-MacDingbat	Macintosh Dingbat
MacGreek		x-MacGreek	Macintosh Greek
MacHebrew		x-MacHebrew	Macintosh Hebrew
MacIceland		x-MacIceland	Macintosh Iceland
MacRoman		x-MacRoman	Macintosh Roman, default encoding for Mac OS (Operating System). Note it is not MacroMan.
MacRomania		x-MacRomania	Macintosh Romania
MacSymbol		x-MacSymbol	Macintosh Symbol
MacThai		x-MacThai	Macintosh Thai
MacTurkish		x-MacTurkish	Macintosh Turkish
MacUkraine		x-MacUkraine	Macintosh Ukraine
Ms874		x-windows-874	Windows Thai
Ms932		windows-31j	Windows Japanese. Microsoft JIS.
SingleByte			This does not expand low order eight-bits with high order zero as its name implies. It looks to be a complex encoding for some Asian language.
Shift_JIS		Shift_JIS	Shift JIS. Japanese. A Microsoft code that extends csHalfWidthKatakana to include kanji by adding a second byte when the value of the first byte is in the ranges 81-9F or E0-EF.
TIS-620		TIS-620	TIS620, Thai
Transporter			Transporter source code is available. A variant of Base64u also URL-encoded. It also optionally handles serialization/reconstituting, compression/decompression, signing/verifying and heavy duty encryption/decryption. armouring
truncation			chop high byte, or 0-pad high byte. ISO-8859-1
UCS-2			Use UTF-16.
Unicode		UTF-16	use UTF-16BE instead. Big endian, must be marked.
Unicode-8			see UTF-8.
Unicode-16			see UTF-16.
UnicodeBig		UTF-16	use UTF-16BE instead. 16-bit UCS-2 Transformation Format, big endian byte order identified by an optional byte-order mark; FE FF . On read, defaults to big-endian. On write puts out a big-endian marker. Same as Unicode.
UnicodeBigUnmarked		UTF-16BE	16-bit UCS-2 Transformation Format, big endian byte order, definitely without Byte Order Mark. Not written on write, ignored on read. Same as UTF-16BE.
UnicodeLittle		x-UTF-16LE-BOM	Use UTF-16LE instead. 16-bit UCS-2 Transformation Format, little endian byte order identified by an optional byte-order mark; FF FE. On read, defaults to little-endian. On write puts out a little-endian marker.
UnicodeLittleUnmarked		UTF-16LE	16-bit UCS-2 Transformation Format, little endian byte order, definitely without Byte Order Mark. Not written on write, ignored on read.
URL (Uniform Resource Locator)			For x-www-form-urlencoded use java.net.URLEncoder.encode and java.net.URLDecoder.decode instead. Used to encode GCI command lines. It encodes space as + and special characters as %xx hex. Don’t confuse it with BASE64 or BASE64u. Neither java.net.URLEncoder.encode nor java.net.URLDecoder.decode are for encoding/decoding URLs (Uniform Resource Locators). They are for encoding/decoding application/x-www-form-urlencoded form data.
US-ASCII		US-ASCII	7-bit American Standard Code for Information Interchange.
Uuencode			Similar to base64.
UTF-7			7-bit encoded Unicode.
UTF-8		UTF-8	8-bit encoded Unicode. née UTF8. Optional marker on front of file: EF BB BF for reading. Unfortunately, OutputStreamWriter does not automatically insert the marker on writing. Notepad can’t read the file without this marker. Now the question is, how do you get that marker in there? You can’t just emit the bytes EF BB BF since they will be encoded and changed. However, the solution is quite simple. prw.write( '\ufeff' ); at the head of the file. This will be encoded as EF BB BF.RFC 3629 officially describes the UTF-8 format. DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings. Java DataOutputStream and ObjectOutputStream uses a slight variant of kosher UTF-8. To aid with compatibility with C in JNI (Java Native Interface), the null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. Only the 1-byte, 2-byte and 3-byte formats are used. Supplementary characters, (above 0xffff), are represented in the form of surrogate pairs (a pair of encoded 16-bit characters in a special range), rather than directly encoding the character. UTF-8 and ISO-8859-1 encode 7 bit characters identically, 0x00…0x7f, but after than that are quite different.
UTF-16		UTF-16	Same as Unicode. Default big endian, optionally marked. UTF-16 is officially defined in Annex Q of ISO/IEC 10646-1. (Copies of ISO standards are quite expensive.) It is also described in the Unicode Consortium’s Unicode Standard, as well as in the IETF (Internet Engineering Task Force)’s RFC 2781. To put the byte order mark in at the head of the file use prw.write( '\ufeff' ); This will be encoded as FE FF .
UTF-16BE		UTF-16BE	16-bit UCS-2 Transformation Format, big endian byte order identified by an optional byte-order mark; FE FF . On read, defaults to big-endian. On write puts out a big-endian marker. If you definitely have a BOM (Byte Order Mark), use x-UTF-16BE-BOM.
UTF-16LE		UTF-16LE	16-bit UCS-2 Transformation Format, little endian byte order identified by an optional byte-order mark; FF FE. On read, defaults to little-endian. On write puts out a little-endian marker. If you definitely have a BOM, use x-UTF-16LE-BOM.
UTF-32		UTF-32	32-bit UCS-4 Transformation Format, byte order identified by an optional byte-order mark: 00 00 FF FE for little endian, FE FF 00 00 for big endian.
UTF-32BE		UTF-32BE	32-bit UCS-4 Transformation Format, big-endian byte order. If you definitely have a BOM, use X-UTF-32BE-BOM.
UTF-32LE		UTF-32LE	32-bit UCS-4 Transformation Format, little-endian byte order. If you definitely have a BOM, use X-UTF-32LE-BOM.
windows-1250		windows-1250	Windows Eastern European
windows-1251		windows-1251	Windows Cyrillic (Russian)
windows-1252		windows-1252	Microsoft Windows variant of Latin-1, NT default. Beware. Some unexpected translations occur when you read with this default encoding, e.g. codes 128..159 are translated to 16-bit chars with bits in the high order byte on. It does not just truncate the high byte on write and pad with 0 on read. For true Latin-1 see ISO-8859-1.
windows-1253		windows-1253	Windows Greek
windows-1254		windows-1254	Windows Turkish
windows-1255		windows-1255	Windows Hebrew
windows-1256		windows-1256	Windows Arabic
windows-1257		windows-1257	Windows Baltic
windows-1258		windows-1258	Windows Viet Namese
windows-31j		windows-31j	Windows 31j
x-Big5-Solaris		x-Big5-Solaris
x-EUC-CN		Gb2312	Gb2312, EUC encoding, Simplified Chinese
x-EUC-JP		EUC-JP	JIS0201, 0208, 0212, EUC Encoding, Japanese
x-euc-jp-linux		x-euc-jp-linux	JISX0201, 0208, EUC Encoding, Japanese for LinuxYFF
x-EUC-KR			KS C 5601, EUC Encoding, Korean
x-EUC-TW		x-EUC-TW	CNS11643 (Plane 1-3), T. Chinese, EUC encoding
x-eucJP-Open		x-eucJP-Open
x-IBM1006		x-IBM1006	IBM AIX Pakistan (Urdu).
x-IBM1025		x-IBM1025	IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia, FYRa0.
x-IBM1046		x-IBM1046	IBM Open Edition US EBCDIC
x-IBM1097		x-IBM1097	IBM Iran(Farsi)/Persian
x-IBM1098		x-IBM1098	IBM Iran(Farsi)/Persian (PC)
x-IBM1112		x-IBM1112	IBM Latvia, Lithuania
x-IBM1122		x-IBM1122	IBM Estonia
x-IBM1123		x-IBM1123	IBM Ukraine
x-IBM1124		x-IBM1124	IBM AIX Ukraine
x-IBM1381		x-IBM1381	IBM OS/2, DOS People’s Republic of China (PRC )
x-IBM1383		x-IBM1383	IBM AIX People’s Republic of China (PRC )
x-IBM300		x-IBM300	alias CP-300. New with JDK 1.7.0-51
x-IBM33722		x-IBM33722	IBM-eucJP — Japanese (superset of 5050)
x-IBM737		x-IBM737	PC Greek
x-IBM833		x-IBM833
x-IBM834		x-IBM834
x-IBM856		x-IBM856
x-IBM874		x-IBM874	IBM Thai
x-IBM875		x-IBM875	IBM Greek
x-IBM921		x-IBM921	IBM Latvia, Lithuania (AIX, DOS ).
x-IBM922		x-IBM922	IBM Estonia (AIX, DOS ).
x-IBM930		x-IBM930	Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
x-IBM933		x-IBM933	Korean Mixed with 1880 UDC, superset of 5029
x-IBM935		x-IBM935	Simplified Chinese Host mixed with 1880 UDC, superset of 5031
x-IBM937		x-IBM937	Traditional Chinese Host mixed with 6204 UDC, superset of 5033
x-IBM939		x-IBM939	Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
x-IBM942		x-IBM942	Japanese (OS/2) superset of 932
x-IBM942C		x-IBM942C	variant of Cp942. Japanese (OS/2) superset of Cp932
x-IBM943		x-IBM943	Japanese (OS/2) superset of Cp932 and Shift-JIS.
x-IBM943C		x-IBM943C	Variant of Cp943. Japanese (OS/2) superset of Cp932 and Shift-JIS.
x-IBM948		x-IBM948	OS/2 Chinese (Taiwan) superset of 938
x-IBM949		x-IBM949	PC Korean
x-IBM949C		x-IBM949C	variant of Cp949, PC Korean
x-IBM950		x-IBM950	PC Chinese (Hong Kong, Taiwan)
x-IBM964		x-IBM964	AIX Chinese (Taiwan)
x-IBM970		x-IBM970	AIX Korean
x-ISCII91		x-ISCII91	ISCII91 encoding of Indic scripts
x-ISO-2022-CN-CNS		x-ISO-2022-CN-CNS	CNS 11643 in ISO-2022-CN form, T. Chinese
x-ISO-2022-CN-GB		x-ISO-2022-CN-GB	GB 2312 in ISO-2022-CN form, S. Chinese
x-iso-8859-11		x-iso-8859-11	ISO 8859-11, Thai.
x-JIS0208		x-JIS0208	JIS 0208, Japanese
x-JISAutoDetect		x-JISAutoDetect	Detects and converts from Shift-JIS, EUC-JP, ISO- 2022 JP (conversion to Unicode only)
x-Johab		x-Johab	Johab, Korean
x-Ms950-HKSCS		x-Ms950-HKSCS	Windows Traditional Chinese with Hong Kong extensions
x-MacArabic		x-MacArabic	Macintosh Arabic
x-MacCentralEurope		x-MacCentralEurope	Macintosh Latin-2
x-MacCroatian		x-MacCroatian	Macintosh Croatian
x-MacCyrillic		x-MacCyrillic	Macintosh Cyrillic (Russian)
x-MacDingbat		x-MacDingbat	Macintosh Dingbat
x-MacGreek		x-MacGreek	Macintosh Greek
x-MacHebrew		x-MacHebrew	Macintosh Hebrew
x-MacIceland		x-MacIceland	Macintosh Iceland
x-MacRoman		x-MacRoman	Macintosh Roman
x-MacRomania		x-MacRomania	Macintosh Romania
x-MacSymbol		x-MacSymbol	Macintosh Symbol
x-MacThai		x-MacThai	Macintosh Thai
x-MacTurkish		x-MacTurkish	Macintosh Turkish
x-MacUkraine		x-MacUkraine	Macintosh Ukraine
x-mswin-936		x-mswin-936	Windows Simplified Chinese PRC
x-PCK		x-PCK
SJIS-0213		SJIS-0213
x-UTF-16BE-BOM			Unicode 16-bit big ended, with a BOM definitely present.
x-UTF-16LE-BOM		x-UTF-16LE-BOM	Unicode 16-bit little ended, with a BOM definitely present.
X-UTF-32BE-BOM		X-UTF-32BE-BOM	Unicode 32-bit big ended, with a BOM definitely present.
X-UTF-32LE-BOM		X-UTF-32LE-BOM	Unicode 32-bit little ended, with a BOM definitely present.
x-windows-50220		x-windows-50220	Japanese Hiragana
x-windows-50221		x-windows-50221	Multilingual, Russian, Japanese, Greek
x-windows-874		x-windows-874	Windows Thai
x-windows-949		x-windows-949	Windows Korean
x-windows-950		x-windows-950	Windows Traditional Chinese

Where two fonts are shows separated by a /, the second one is the new version including the euro symbol. Adam Dingle did the research on how these encodings work.

Many new encodings were added in Java 1.4.1 and some were dropped. This list contains even the dropped items. Before you use an encoding, make sure it is supported by your version of Java.

Note that what Java and the HTML 4.0 specification call a character encoding is actually called a character set at IANA (Internet Assigned Numbers Authority) and in the HTTP (Hypertext Transfer Protocol) proposed standard.

I would like to do some experiments to find out for sure what happens with BOMs (Byte Order Marks) in various encodings. I discovered that native2ascii would not work with an BOM until I used x-UTF-16LE-BOM encoding.

Why So Many Encodings?

You may wonder why there are so many encodings, or why there are any at all. The reason is historical chaos. In the beginning, computers used a 4-channel, 16 possible characters, paper tape, allowing for only the hexadecimal characters 0..9 and A..F. To allow for alphabetic messages, not just numbers, two more channels of holes on the paper tape were added. This 6-bit, 64-character encoding allows digits, upper case A..Z and some punctuation. Every university or major computer installation invented its own code, allowing for the punctuations and symbols of most local importance.

Universities started exchanging programs and data on magnetic type. The 7-bit, 128-character ASCII code was invented to allow for a common character set and encoding. It allowed for both upper and lower case and a reasonably rich set of punctuation.

About that time, computers started to standardise on the 8-bit byte. Every national group then expanded the code to 8-bits, giving them 256 possible characters. They filled the extra slots with various accented letters, nationally important symbols and letters from non-Roman alphabets. The Chinese had a difficult problem. They needed thousands of symbols, not just 256 offered by 8-bit codes. So they invented various multi-byte encodings and 16-bit encodings. IBM invented EBCDIC, its own proprietary sets of codes to help lock customers into its equipment. There was very little document sharing, so the fact every country had its own way of encoding data and sometimes dozens of ways, caused little trouble.

To allow for exchange, especially on the Internet, 16-bit, 4096-character Unicode was invented. Surely this would be sufficient to handle all of earth’s languages! It had one big drawback. At least for English, its files were twice as fat as the old 8-bit encodings. The world was not prepared to abandon their hundreds of encodings, even for new files. Not only where they firmly entrenched in email, they were burned into hardware, such as printers and modems. Java needed tables called encodings to translate scores of these 8-bit encoding into Unicode. Java’s Readers and Writers automatically handle the translations.

Then UTF-8 was invented to give the benefit of compact 8-bit encoding, with the full 16-bit Unicode character set.

Then scholars complained that Unicode did not handle various dead languages and obscure musical notation. So Unicode was extended to 32 bits to shut them up. Java half-heartedly supports this with code points.

When you write Java programs, there are at least three encodings you will be forced to deal with:

UTF-16 : how Java stores characters and codepoints (32-bit characters) internally in Strings and char[].
UTF-8 : the usual compact way to store Unicode data on disk or text files.
your local default : For me, this is windows-1252. It is the default encoding of *.bat files and notepad.

ISO

You can buy documentation standard from ISO. They cost approximately

Roll Your Own

You can find out what is already supported with java.nio.charset. Charset. availableCharsets().

If you don’t see the character set encoding you need, you can write your own translate/encoding tables and insert them as part of the official set. See the java.nio.charset.spi.CharsetProvider, Charset, CharsetEncoder and CharsetDecoder classes.

To create a new character set, you extend CharsetProvider to provide one or more custom CharSets with look-up by name. To create the custom Charset, you extend the CharSet class mainly to flesh it out with methods for newEncoder and newDecoder which provide your own custom CharsetEncoder and CharsetDecoder respectively.

To write your custom CharsetEncoder you extend CharsetEncoder and write a custom encodeLoop method. To write your custom CharsetDecoder you extend CharsetDecoder and write a custom decodeLoop method. You can of course borrow these methods from some other Charset and just code some exceptions to the rule. You can borrow either by extending or by delegation.

After all this is all ready, to include your Charsets as part of the official ones, you register your new CharsetProvider with a configuration file named java.nio.charset.spi.CharsetProvider in the resource directory META-INF/services. This file contains a list of your fully-qualified CharsetProvider class names, one per line. The file must be encoded in UTF-8.

Converting

The key thing in converting to keep uppermost in your mind is that all encoded files are conceptually composed of 8-bit byte[], even UTF-16 encoded files. Java internally works with Unicode 16-bit chars. Don’t try to go from String to String or byte[] to byte[]. You are always encoding String to byte[] or decoding byte[] to String. There are three basic ways to do the conversions:

With Reader and Writer file I/O. See the File I/O Amanuensis for details. Your files are byte-encoded and you read and write translating into Strings internally. Use a Reader to decode bytes to Strings and a Writer to encode Strings to bytes.

Use

// use String constructor to decode bytes to String
byte[] someBytes = ...;
String encodingName = "Shift_JIS";
String s = new String ( someBytes, encodingName );

Use String.getbytes to encode Strings to bytes.

// Using String.getBytes to encode String to bytes
String s = ...;
byte [] b = s.getBytes( "8859_1" /* encoding */ );

If you have more than one conversion to do, use java.nio.charset. Char set. This saves the overhead of looking up the encoding Encoding String to bytes.
If you want very fast conversions, you must avoid the hidden copies that are inherent in the above methods. For speed, you would use CharBuffer and ByteBuffer.

native2ascii

Sun has included a utility misnamed native2ascii.exe which is included with the JDK in

native2ascii.exe :

in J:\Program Files\java\jdk1.8.0_131\bin\native2ascii.exe in JDK 1.8.0_131 on your local Windows J: drive.

It converts files from any encoding to 8-bit printable form and back. 8-bit printable using ASCII characters plus forms like \u95e8 for the exotic characters.

details on how to use native2ascii

Reversibility

You won’t necessarily get exactly back to where you started if you encode then decode. If you chose a traditional single-byte, 8-bit encoding, say Cp437 as your target, there are only 256 encodings to go round for all 64K Unicode characters. Obviously, some Unicode characters are going to have to collapse onto the same 8-bit character and so won’t decode back to where they started. Further, some of these 8-bit encodings have a few strange characters that don’t exist in Unicode. UTF-8 does not suffer from this problem.

Further, the encode/decode routines are permitted to combine pairs such as 0x0055 (LATIN CAPITAL LETTER U) followed by 0x0308 (COMBINING DIAERESIS) to a single character 0x00DC (LATIN CAPITAL LETTER U WITH DIAERESIS), or vice versa.

Tracking which characters get Translated Where

Be careful when translating between character sets using the encoding feature of Readers. Everything goes through the intermediate 16-bit Unicode which may not have all the characters of the target and destination character sets. Some characters may be translated to codes with some high byte bits on. For more accurate translation, do it yourself with a one-step table. You can use the following program to discover what translations are being done with any particular encoding and use that information to generate the source for your own translate table, using the automatic encodings, so that you can see any inaccuracies and fix them.

Rant on Encoding Identification

Files are not marked with a signature to denote the encoding used. Further, the encoding it is not recorded externally in some sort of resource fork. You are just supposed to know what sort of encoding was used or track it by some ad hoc means. There are three exceptions.

Emails have the encoding recorded the header, e.g. Content-Type: charset= iso-8859-1;
HTML files have the optional content-type meta tag to tell you.
XML (extensible Markup Language) files have the optional encoding parameter.
<?xml version=1.0 encoding=ISO8859-1 ?>

I feel spitting mad at the gross incompetence of this situation. It conjures images of unshaved young men in dirty underwear surrounded by empty pizza boxes and dirty coffee cups. The encoding should have been embedded in the file from the start. Perhaps it is not too late to embed it this way and hope the convention catches on: BOM BOM encoding-name BOM. This convention does not require any new reserved characters. Unfortunately, it is not transparent to programs ignorant of the convention. It could be transparent. Java could strip or add this prefix automatically, hiding it from the application programs. You could add a new method to Readers to find out what the encoding actually was. Perhaps we need to think this out and embed other meta information in an extensible way at the same time, such as MIME (Multipurpose Internet Mail Extensions) type.

You can make a guess by reading the text presuming various encodings. This is how the Encoding Recogniser below works. The language gives a clue to the likely encoding used. The way common words are encoded gives a clue. Try looking at the document in various encodings and see which makes the most sense.

The Unicode little-endian or big-endian BOM is a strong clue you have 16-bit Unicode.

To automate the guessing, you could look for common foreign words to see how they are encoded. You could compute letter frequencies and compare them against documents with known encodings.

You might want to tackle this student project to solve the problem.

The following Applet helps you determine the encoding of a file by displaying the beginning of it in hex and decoded characters in any of the supported Java encodings. If the file is made only of printable ASCII characters, then almost any encoding can be used to read it. If the display shows blanks between each character then chances are you have some variant of UTF-16 encoding.

You can fine tune your guesses by entering them in the Official Encoding Applet above to see which sample character set looks most plausible for documents such as yours. The biggest clue is what country the document came from. Try the national encodings first.

Java Requirements and Troubleshooting

EncodingRecogniser is a signed Java Applet (that can also be run as an application) to Encoding Recogniser. You are welcome to install it on your own website. If it does not work…

For this Applet hybrid to work, you must click grant/accept/always run on this site/I accept the risk to give it permission to read a file whose encoding you want to determine. If you refuse to grant permission, the program may crash with an inscrutable stack dump on the console complaining about AccessController.checkPermission.
In the Java Control Panel security tab, click Start ⇒ Control Panel ⇒ Programs ⇒ Java ⇒ Security, configure medium security to allow self-signed and vanilla unsigned applets to run. If medium is not available, or if Java security is blocking you from running the program, configure high security and add http://mindprod.com to the Exception Site List at the bottom of the security tab.
Often problems can be fixed simply by clicking the reload button on your browser.
Make sure you have both JavaScript and Java enabled in your browser.
Make sure the Java in your browser is enabled in the security tab of the Java Control panel. Click Start ⇒ Control Panel ⇒ Programs ⇒ Java ⇒ Security ⇒ Enable Java Content in the browser.
This signed Java Applet (that can also be run as an application) needs 32-bit or 64-bit Java 1.8 or later. For best results use the latest 1.8.0_131 Java.
You also need a recent browser.
It works under any operating system that supports Java e.g. W2K, XP, W2003, Vista, W2008, W7-32, W7-64, W8-32, W8-64, W2012, W10-32, W10-64, Linux, LinuxARM, LinuxX86, LinuxX64, Ubuntu, Solaris, SolarisSPARC, SolarisSPARC64, SolarisX86, SolarisX64 and OSX
You should see the Applet hybrid above looking much like this screenshot. If you don’t, the following hints should help you get it working:
Optionally, you may permanently install the Canadian Mind Products code-signing certificate so you don’t have to grant each time.
If the above Applet hybrid appears to freeze-up, click Alt-Esc repeatedly to check for any buried permission dialog box.
If you have certificate troubles, check the installed certificates and remove or update any obsolete or suspected defective certificates. The only certificate used by this program is mindprodcert2017rsa.cer.
Especially if this Applet hybrid has worked before, try clearing the browser cache and rebooting.
To ensure your Java is up to date, check with Wassup. First, download it and run it as an application independent of your browser, then run it online as an Applet to add the complication of your browser.
If the above Applet hybrid does not work, check the Java console for error messages.
If the above Applet hybrid does not work, you might have better luck with the downloadable version available below.
If you are using Mac OS X and would like an improved Look and Feel, download the QuaQua look & feel from randelshofer.ch/quaqua. UnZip the contained quaqua.jar and install it in ~/Library/Java/Extensions or one of the other ext dirs.
Upgrade to the latest version of Internet Explorer or another browser.
Click the Information bar, and then click Allow blocked content. Unfortunately, this also allows dangerous ActiveX code to run. However, you must do this in order to get access to perfectly-safe Java Applets running in a sandbox. This is part of Microsoft’s war on Java.
Try upgrading to a more recent version of your browser, or try a different browser e.g. Firefox, SeaMonkey, IE or Avant.
If you still can’t get the program working click the red HELP button below for more detail.
If you can’t get the above Applet hybrid working after trying the advice above and from the red HELP button below, have bugs to report or ideas to improve the program or its documentation, please send me an email at.

Get New Java Get New Browser

Choosing an Encoding

Applet
When you are working in an Applet you don’t have to concern yourself with encoding. You use 16-bit Unicode Strings and all comes out in the wash.
Servlet
If you are writing a Servlet, you can nearly always send UTF-8 encoding. But in any case, HTTP requests will come in with a list of encodings the receiver is prepared to accept.
Email
If you are sending an email, you can look at the encoding the person used to send emails to you and use that for replies. Most modern email programs will all support UTF-8 so you could use it universally.
Reading
If you are reading a file, you have to get somebody to tell you what encoding it is. You can look for BOMs to detect likely Unicode encodings. You can discover the default encoding with the restricted file.encoding system property with:
Writing
If you know someone’s locale, (language/country/variant) you can make a guess at what encodings they might like in files you prepare for them, but Java has no built-in support for making that guess. The move is toward using UTF-8 all the time, especially when documents contain multi-languages or they are being sent to multiple people. You can use the system default.

The times you would send people special encodings are when:

They are using old text-based software that supports only one particular encoding.
You are sending large volumes of data and want the efficiency of a national encoding perhaps combined with compression.

Java Source Code Encoding

If your Java source code contains awkward characters encoded with \uxxxx, then there in nothing special you need to do. However, if they are encoded as naked UTF-8 characters then you need to code: To set it in ANT (A Neat Tool) do something like this:

Default File and Console Encoding

If you want to change the default encoding for files you read and write, including the console, you need to set the file.encoding system property. You can do this programmatically with:

// setting the default encoding programmatically
System.setProperty( "file.encoding", "UTF-8" );

You can also do it on the java.exe command line like this:

Rem setting the default encoding on the command line
java.exe "-Dfile.encoding=UTF-8" -jar myprog.jar

Learning More

Oracle’s Technetwork Supported Locales

Oracle’s Java 1.8 documentation on : Locales and Encoding

Oracle’s Java 1.8 documentation on : nio encodings

Oracle’s Javadoc on Charset class : available:

on the web at Oracle.com
in the current JDK 1.8.0_131 on your local Windows J: drive.

Oracle’s Javadoc on Charset.forName : available:

on the web at Oracle.com
in the current JDK 1.8.0_131 on your local Windows J: drive.

Oracle’s Javadoc on CharsetDecoder class : available:

on the web at Oracle.com
in the current JDK 1.8.0_131 on your local Windows J: drive.

Oracle’s Javadoc on CharsetProvider class : available:

on the web at Oracle.com
in the current JDK 1.8.0_131 on your local Windows J: drive.

Oracle’s Javadoc on ByteBuffer class : available:

on the web at Oracle.com
in the current JDK 1.8.0_131 on your local Windows J: drive.

Oracle’s Javadoc on CharBuffer class : available:

on the web at Oracle.com
in the current JDK 1.8.0_131 on your local Windows J: drive.

Oracle’s Javadoc on System.getProperty : available:

on the web at Oracle.com
in the current JDK 1.8.0_131 on your local Windows J: drive.

Oracle’s Java 1.8 documentation on : native2ascii

Oracle’s Technetwork Character Conversions Browser ⇒ Database

RFC 2978 Describes the IANA procedure for registering new encodings.

standard footer
	This page is posted on the web at:	http://mindprod.com/jgloss/encoding.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\jgloss\encoding.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[216.73.216.93]
Feedback	You are visitor number

encoding : Java Glossary

Possible Supported Encodings

Supported Encodings in this Browser

Java Requirements and Troubleshooting

How to Determine the Default Encoding in Java

Finding Official Encoding Name Given an Alias

Java Requirements and Troubleshooting

Table of Possible Supported Encodings

Why So Many Encodings?

ISO

Roll Your Own

Converting

native2ascii

Reversibility

Tracking which characters get Translated Where

Rant on Encoding Identification

Java Requirements and Troubleshooting

Choosing an Encoding

Applet

Servlet

Email

Reading

Writing

Java Source Code Encoding

Default File and Console Encoding

Learning More