Converting binary data into printable gibberish so that data transport systems will not corrupt it. You see it
used often in certificates, email, and HTTP communications.
There are many data transport systems that either ignore, act on or otherwise meddle with control characters
embedded in the data. They may trim trailing blanks, change line end characters, convert tabs to spaces etc. etc.
Any of these actions would totally corrupt binary data. To pass binary data through such a meddlesome channel,
e.g. the email system, it must first be armoured, converted to use only safe
printable characters that will not be meddled with, e.g. a-z A-Z
0-9 and the vanilla punctuation. I sometimes refer to character than need special processing to pass
through a channel as awkward.
MIME email and email attachments have a configurable encoding scheme, controlled via the Transfer-Content-Encoding mime header, often base64 or Quoted-Printable.
Unfortunately this bulks the message up by 30 to 300% depending on the
technique you use. The other end has to recognise the armouring technique and do the reverse to get the binary
back.
When 8-bit data are encoded in printable characters, the more printable characters used in the representation,
generally the more efficient the protocol. However, the more characters used, the greater the odds one of the
characters used will be interfered with by your communication channel.
Armouring Schemes
Unfortunately, there are a plethora of techniques. It is not always obvious just from looking which was used to
encode the data:
- base64: common in certificates, passwords, email, email attachments, cookies and
HTTP
Base64 uses an small cast of characters to convert 8-bit data into printable characters: a to z, A to Z, 0 to 9, + / and =. You might do this to
convert any binary data to printable. This makes base64 suitable for encoding binary data as SQL strings,
that will work no matter what the encoding. Unfortunately + / and = all have special meaning in URLs. See Base64 for
free Java source code. Every three characters in the original fluff up to four characters in the encoded
form. This 33% increase in size occurs independent of what characters appear
in your data. At the receiving end you convert the printable characters back to the 8-bit data.
- url-encoded. See the separate entry on it.
- base64u: A variant of Base64 that avoids the + / and = characters that have
special meaning in URLs, GET and POST. You can treat its output either as not needing URLEncoding, or as
already URLEncoded. Used to armour bytes or anything that can be converted to bytes, e.g. via serialized
- the Transporter which optionally handles
serialising/reconstituting, compression/decompression, signing/verifying, heavy duty encryption/decryption
and Base64u armouring/dearmouring all with light weight classes.
Use it when you want to include arbitrary Java Objects in your CGI GETS and POSTS.
- Quoted-Printable (RFC 2045) used in newsgroup messages
and email. Quoted-Printable (RFC 2045) uses the following set of
characters to convert 8-bit data into printable characters : space, a to z, A to Z, 0 to 9, !- <, >- ~, =. It converts unsafe characters into =FF where FF is the hex equivalent. In the best case, your message is the same size as the
original. In a pathological case, your message can balloon up to three times the original size.
- hexadecimal: two characters per byte 0..F. The result is always exactly double the
size of the original. This is one of the easiest schemes to write code for.
- binhex: a hex variant used on the Macintosh.
- UUEncode: similar to Base64 in that they both use 64 ASCII characters to
represent 6 bits in the printable representation, but they are not compatible.
Base 64 uses upper case, lower case, digits and only three punctuation symbols. UUEncode uses 28 punctuation
symbols and it uses only upper case letters. Also, the uuencode command has a structure to its output, with a
header containing a file name and permissions, line-length encoding characters, and a footer, none of which are
part of Base64.
- CMP Encode: dates back to 1985. Very efficient for text that is mostly printable already. CMP Encode uses the full 95 ASCII
printable characters excluding space. Printable characters it leaves as is. It encodes control characters with
a lead ^, e.g. code 3 becomes ^C. High bit chars are
encoded with a lead `. It has a simple compression scheme for repeating character
strings. In the best case, your message can be even smaller than the original. In a pathological case, your
message can balloon up to twice the original size. Unfortunately, Java code for this algorithm is not currently
available. Pascal source and executable is available. This algorithm is
not a recognised official MIME encoding.
- CMP Encrypt: dates back to 1985. Also encrypted with a theoretically uncrackable one-time pad. Pascal source and executable is available.