A phobia of binary format files. It is widespread especially in the Unix
community.
Then
To understand how this fear got started you must understand what computers were
like in the beginning.
In the early days, there was almost no standardisation. A byte was anywhere from
6 to 12 bits, a word anywhere from 8 to 64 bits. Every machine had at least two
proprietary floating point formats. Sometimes each installation defined
its own custom character set. Some machines were big-endian, some little-endian,
some twos-complement, some ones-complement. Reading a file from someone else’s
computer was quite an undertaking. It was easier if it was pure characters
because then it was easier to decipher the format.
If a program did not work, since the documentation on the format was typically
so sketchy, it was easier to deal with human-readable character data than binary
data, even if it were more bulky.
Data formats were not taken very seriously. Formats were defined procedurally —
whatever the program produced. This sufficed because there was very little
interchange of data. If data were exchanged, it would always be read and written
by the same program on the same hardware, so there was no need to define
precisely what the format was.
Even mag tape densities and proprietary formats and labels caused interchange
problems.
Microsoft used binary formats for its MS Word and Excel products. However they
considered the format a proprietary secret. They would often change the format
without telling anyone. They arranged formats to be deliberately incompatible as
a dodge to trick customers into upgrading. Once Version N+1 has touched a
document, Version N could no longer read it. Everyone had to upgrade to Version
N+1 at considerable cost, just to be able to read their documents again.
Microsoft only sold version N+1, so there was no legitimate way new users in a
shop could stick with version N to avoid the problem. Microsoft traumatised
programmers against binary formats. Programmers gradually decoded and document
the formats as best they could. It was an undertaking comparable to breaking the
German enigma code. And there was no guarantee the result was 100% accurate.
Whenever programmers think binary format, they instantly associate it with
Microsoft’s wicked behaviour. In NLP terms, binary
format has become a negative anchor.
CORBA made a brave stab at letting you exchange binary
data between different platforms. The catch was CORBA made such a production of
it, that the very thought made programmers want to lie down and take a nap.
Now
Today, things have changed:
- We have converged on IEEE format standards for binary floating point interchange.
- We have standardised on big-endian format for network order.
- Nearly all hardware can read/write 8-bit, 16-bit, 32-bit and 64-bit binary two-complement
integers signed and unsigned, as well as IEEE floating point.
- Java allows serialised objects to be exchanged between machines with totally
different internal hardware. Though it started in Java, there is no reason other
languages could not implement the same protocols.
- The Internet means data is now routinely exchanged between computers from
different manufacturers, using software from different vendors on each end. It
now becomes extremely important to precisely define the data formats, and to
create programs to verify that the standards are being adhered to.
- The Internet means it is more important than ever to exchange data in compact
formats. If you don’t, you waste bandwidth, air time, computing power,
battery life in hand-held devices, and, most of all, people time waiting for
transmissions to complete.
- We are moving to an age with an explosion of hand held devices that communicate
the same way cell phones do with the Internet. These too must be accommodated.
They have very tight RAM and CPU requirements. Further, air time is considerably
more expensive and considerably slower than the cable connections that desktop
machines enjoy. Further, the amount of bandwidth is limited by the radio
frequency spectrum. We are rapidly running out of cell-phone type bandwidth. You
have to be ultra-efficient to even play the game.
Advantages
There are several major advantages to binary formats:
Compactness
They are compact to store, compact to process in RAM, and compact to transmit
over the Internet. In contrast, some text formats such as XML can be an order of
magnitude fluffier.
Speed
A well designed binary format is computer-friendly. The computer can rapidly
navigate the data finding what it wants without having to parse that which it
does not want.
Simplicity
Though a binary format might look terrifying to a human viewing it with the
wrong tool, such as NOTEPAD, from the computer’s point of view, it takes
much less code to read and analyse a binary format file. This is especially
important in hand-held devices where RAM for code, and battery power to drive
that code is at a premium.
Accuracy
If you use text files for information interchange there will be a conversion
from binary to prepare them and a conversion back to binary to read them. Each
of those conversions can introduce small errors if you are not careful,
especially with IEEE floating point. If you go direct binary to binary there are
two less places you can go wrong.
Symptoms
XML is probably the fluffiest, least efficient text
format ever conceived. It is the complete antithesis of a binary format.
Addiction to XML is a symptom of a severe case of binaphobia.
Treatment
The binaphobic wakes in the night terrified he has written a program to create a
binary format and now for some reason he cannot read the data. What can be done
to reassure the binaphobic?
- Use industry standard protocols and well tested libraries to read and write the
data. Then at worst you will be missing a field. You are then no worse off than
had you done the whole thing in text.
- Remind him, "When was the last time anyone lost a serialised object because
of bugs in readObject or writeObject?"
- Use the proper debugging tools to study your binary format files. You would not
use NOTEPAD to modify an MS Word document, so why do you think it the
appropriate tool to examine and edit a binary format document. Use a binary
format editor/inspector. Programmers have no fear of the binary TCP/IP format,
because they use proper tools to examine the bits in the packets, rather than
trying to analyse them with NOTEPAD.
- If you invent a new binary format, get different people to write the reader,
writer, verifier, and inspector/editor. That will help iron out inconsistencies
or ambiguities in the format specification, and cross check each others’
work. Then get lots of other people to use it. The more people using it, the
less likely a bug will slip through unnoticed.
- Remind him that bugs in properly tested programs are rare. Errors in text files
prepared with NOTEPAD are extremely common. He is like the fool who sits in his
car in the garage with the motor running to avoid being hit by lightning.
- Let the binaphobic use a binary editor. He imagines somehow it will be harder to
use that Notepad. He imagines that it will force him to fiddle bits with hex
notation. He has no idea that a modern binary editor is like a spreadsheet with
the formulas locked that validates each field as you entered it.