BOM : Java Glossary

* 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all)

BOM

BOMs (Byte Order Marks) are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. These codes also tell whether the encoding is 8, 16 or 32-bit. You can recognise Unicode files by their starting byte order marks and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros.

UTF BOM (Byte Order Mark) Unicode-encoding Endian Indicators
UTF (Unicode Transformation unit)BOM (Byte Order Mark)(Byte Order Mark) Unicode-encoding Endian Indicators
0xfeff BOM as it appears encoded	Description
ef bb bf	UTF-8 endian, strictly speaking does not apply, though it uses big-endian most-significant-bytes first representation.
fe ff	UTF-16 for 16-bit internal UCS-2, big endian, Java network order
ff fe	UTF-16 for 16-bit internal UCS-2, little endian, Intel/Microsoft order. Note you must examine subsequent bytes to tell this apart from a UTF-32 BOM since they both start ff fe.
00 00 fe ff	UTF-32 for 32-bit internal UCS-4, big-endian, Java network order
ff fe 00 00	UTF-32 for 32-bit internal UCS-4, little endian, Intel/Microsoft order.

The actual Unicode character encoded in all cases is 0xfeff.

There are also variants of these encodings that have an implied endian marker.

Unfortunately, often applications, even Javac.exe, choke on these byte order marks. Java Readers don’t automatically filter them out. There is not much you can do but manually remove them.

Avoiding BOM s

How can you get rid of these pesky BOM s? Here are ideas, the ones I consider best/simplest near the top.

Use UTF-8. It does not use them. You can use native2ascii.exe to convert your given encoding to UTF-8.
Write a utility that reads the first character of a file. If it is a BOM, copy the rest of a file to a temp file, then delete the original and rename the temp to the original, effectively permanently chopping off the leading BOM. Unfortunately, this discards the useful information about the encoding of the file. It will be more efficient if you don’t use Readers, but use byte-based InputStreams instead.
Write an encoding UTF-16HIDEBOM that wraps itself around UTF-16 and install it as one of the official encodings.
Write a FilterInputStream that discards BOMs. And use it in your apps.
Lobby Oracle to provide a solution.
Look for the character in your application code and ignore it. This technique is very clumsy and will seriously interfere with your application logic.

TestBOM

This program tests how Java handles BOM s. It discovers than Java never inserts BOM and it never removes them on its own. You have to bypass, insert and delete them explicitly.

Encodings Matter

You would think if there is a BOM at the start of a file, Java could tell all on its own if the file were UTF-8, UTF-16BE or UTF-16LE encoded. However, Java is not clever. You must get the encoding right in the InputStreamReader, or you will just read gibberish and you will not get an error message.

Here is how I discovered this:

encoding
encoding recogniser
native2ascii.exe
Unicode
UTF

standard footer
	This page is posted on the web at:	http://mindprod.com/jgloss/bom.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\jgloss\bom.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[216.73.216.203]
Feedback	You are visitor number