Converting binary data into printable gibberish so that data transport systems will
not corrupt it. You see it used often in certificates, email and
HTTP (Hypertext Transfer Protocol) communications.
There are many data transport systems that either ignore, act on or otherwise
meddle with control characters embedded in the data. They may trim trailing blanks,
change line end characters, convert tabs to spaces etc. etc. Any of these actions
would totally corrupt binary data. To pass binary data through such a meddlesome
channel, e.g. the email system, it must first be armoured,
converted to use only safe printable characters that will
not be meddled with, e.g. a-z A-Z 0-9 and the vanilla
punctuation. I sometimes refer to character than need special processing to pass
through a channel as awkward.
MIME (Multipurpose Internet Mail Extensions) email and email
attachments have a configurable encoding scheme, controlled via the Transfer-Content-Encoding mime header, often base64 or
Quoted-Printable.
Unfortunately this bulks the message up by 30 to 300%
depending on the technique you use. The other end has to recognise the armouring
technique and do the reverse to get the binary back.
When 8-bit data are encoded in printable characters, the
more printable characters used in the representation, generally the more efficient
the protocol. However, the more characters used, the greater the odds one of the
characters used will be interfered with by your communication channel.
Armouring Schemes
Unfortunately, there are a plethora of techniques. It is
not always obvious just from looking which was used to encode the data:
- base64: common in certificates,
passwords, email, email attachments, cookies and HTTP
Base64 uses a small cast of characters to convert 8-bit
data into printable characters: a to z, A to Z, 0 to 9, + /
and =. You might do this to convert any binary data
to printable. This makes base64 suitable for encoding binary data as
SQL (Standard Query Language) strings, that will work no matter what the
encoding. Unfortunately + / and = all have special meaning in URLs (Uniform Resource Locators).
See Base64 for free
Java source code. Every three characters in the original fluff up to four
characters in the encoded form. This 33% increase in
size occurs independent of what characters appear in your data. At the receiving
end you convert the printable characters back to the 8-bit data.
- url-encoded. See the separate entry
on it.
- base64u: A variant of Base64 that
avoids the + / and = characters that have special meaning in
URL (Uniform Resource Locator) s, GET and POST. You can treat its output either
as not needing URLEncoding, or as already URLEncoded. Used to armour bytes or
anything that can be converted to bytes, e.g. via serialized
- the Transporter
which optionally handles serialising/reconstituting, compression/decompression,
signing/verifying, heavy duty encryption/decryption and Base64u
armouring/dearmouring all with light weight classes.
Use it when you want to include arbitrary Java Objects in your
CGI (Common Gateway Interface) GETS and POSTS.
- Quoted-Printable (RFC 2045
) used in newsgroup messages and email. Quoted-Printable (RFC 2045
) uses the following set of characters to convert
8-bit data into printable characters : space, a to z, A to Z, 0 to 9, !- <, >- ~, =. It converts unsafe characters
into =FF where FF is the hex equivalent. In the best
case, your message is the same size as the original. In a pathological case, your
message can balloon up to three times the original size.
- hexadecimal: two characters per byte 0..F.
The result is always exactly double the size of the original. This is one of the
easiest schemes to write code for.
- binhex: a hex variant used on the Macintosh.
- UUEncode: similar to Base64 in that they both use 64
ASCII (American Standard Code for Information Interchange) characters to represent 6 bits in the printable representation, but they are not
compatible. Base 64 uses upper case, lower case, digits and only three
punctuation symbols. UUEncode uses 28 punctuation symbols and it uses only upper
case letters. Also, the uuencode command has a structure to its output, with a
header containing a file name and permissions, line-length encoding characters,
and a footer, none of which are part of Base64.
- CMP
Encode: dates back to 1985. Very efficient
for text that is mostly printable already. CMP (Canadian Mind Products)
Encode uses the full 95 ASCII
printable characters excluding space. Printable characters it leaves as is. It
encodes control characters with a lead ^, e.g. code 3
becomes ^C. High bit chars are encoded with a lead
`. It has a simple compression scheme for repeating
character strings. In the best case, your message can be even smaller than the
original. In a pathological case, your message can balloon up to twice the
original size. Unfortunately, Java code for this algorithm is not currently
available. Pascal source and
executable is available. This algorithm is not a recognised official
MIME encoding.
- CMP
Encrypt: dates back to 1985. Also encrypted
with a theoretically uncrackable one-time pad. Pascal source and executable is available.