The primary function of XML is to consume RAM and datacommunication
bandwidth. Presumably it was promoted to its current frenzy by companies who
sell either RAM or bandwidth. Others promoting it have patents they hope to
spring on the public once it is entrenched. XML is the biggest con game going in
computers. You probably guessed, I am known for my rabid dislike of XML.
The Basics
XML is the Extensible Markup
Language, a W3C proposed recommendation. Like
HTML, XML is based on SGML, an International Standard (ISO 8879) for creating
markup languages. However, while HTML is a single SGML document type, with a
fixed set of element type names (AKA "tag names"), XML is a simplified
profile of SGML: you can use it to define many different document types, each of
which uses its own element type names (instead of HTML’s html
, body, h1, ol,
etc.). For example, in XML, you can markup an online transaction like this:
Fields that there can be only zero or one of are usually specified as attributes
e.g. unit= "box".
Fields that there can be many of are enclosed in tags e.g. <item>…</item>
e.g. Just like HTML, comments begin with <!-- and
end with -->. You can abbreviate <mytag
myattrib="something"></mytag>
as <mytag myattrib="something"
/>.
XML was designed to make it easy to write a parser. I think this was an
unfortunate decision. Only a handful of people in the world will ever write an
XML parser, but hundreds of thousands have to compose XML. They should have
designed it to be easy and terse to write. For example, its mandatory quotes
around each field are there solely for the convenience of the parser writer. The
tag names in the </mytag;> are redundant, and
should be optional. They are not needed at all in XML designed solely for
machine consumption. Even in human-read XML, they add nothing on the innermost
nest on a single line.
Encoding
UTF-8 is the default encoding, but unfortunately the encoding could be any ruddy
encoding ever invented. Using other encodings destroys XML as an interchange
format. Don’t do it!
Schemas
You describe your little XML subgrammar by writing a DTD (Document
Type Definition)
file. Optionally, you can include the DTD inline inside your XML file. There are
other more elaborate schema grammars including RELAX
NG, Schematron, XSD and various
other schemes.
Validation
Each schema has its corresponding technique for validating an XML file that the
syntax is valid. If you use a DTD, here is how to do it.
Parsing
There are two popular parsing techniques, SAX
(Simple API for XML),
which hands you each field as it parses, and W3C DOM (Document
Object Model)
tree which creates a complete parse tree you can prune and repeatedly scan.
I personally detest XML, however, it has caught on like a cocaine wave. It must
have some redeeming features.
XML Benefits
- XML is the latest fad. Almost every program is learning to import and export
data in XML format, which makes it a lot easier to glue programs created by
different people together.
- It unifies the grammar of thousands of little files so that you don’t have
to learn the syntax quirks of each one.
- It is relatively easy to whip up a DTD to describe an XML grammar for some
little data file. That DTD is all you need to generate a parser.
- The XML files can be viewed or composed by humans using a text editor.
- XML is about as simple a grammar as you can get.
- XML can work with almost any 8-bit or 16-bit character set.
- XML is good at handling hierarchical data.
- You can have Pick OS-like data, with arbitrarily long fields, and arbitrarily
repeated fields.
- XML is platform independent. It has no big-little endian problems.
- It is possible to parse XML without writing a DTD. This process presumes the XML
file is perfectly formed.
- XML search engines can take into account the tag context, e.g. "Washington"
inside tag <state>, <president>, <mountain>, <moviestar>.
An XML search engine can show you want tags in found and let you choose the
relevant ones.
- XML settles on Unicode character encoding to allow transmitting data in any
language, though it does require clumsy entity encoding/decoding.
- A program does not need to understand the entire structure of a file. It can
just pick out the tags of interest. This means new tags can be easily added
without disturbing existing software that uses the file.
XML Drawbacks
- XML is incredibly fluffy and repetitive. It wastes bandwidth in transmission.
You must compress it. Happily, ZIP-style compression works very well on
XML. Unfortunately, you have to fluff it back up to process it, wasting RAM with
unprecedented abandon. In practice no one does compress it.
- It takes up huge amounts of RAM and disk space to store it.
- The DOM parse tree considers every space significant, even spaces between
tags, even spaces for indenting, even trailing spaces on a line, even double
spaces embedded in data.
- There is no mechanism to describe the types of the data. To XML, everything is a
string. There is no way to specify a field must be numeric, that in needs two
decimal places, that it must represent a date in some range, that it must not
have accented letters, that it be restricted to certain punctuation, or be one
of a certain set of legal values. There are scores of tack-ons trying to fix
this and other shortcomings turning the simple XML into a tower of Babel.
- You can’t use the XML files directly, they need to be parsed first.
Perhaps some day there will be pre-parsed, compact, computer-friendly versions
of XML. I have heard rumour such a beast called XMLC has been proposed.
- It uses HTML’s fluffy system of entities such as
- There are a raft of recommendations surrounding XML, such as XPath, XPointer,
XSL, CSS, XLink and so forth. In the pipeline are XHTML, Metadata and Namespaces
and a Schema system. XML is fast becoming very complicated, because it is not
really standalone. You need added extras to make it usable. Competing standards
will have to fight it out. The #1 reason XML caught on was its raging-idiot
simplicity. Now it has not even that advantage.
- XML advocates say "Memory is cheap and bandwidth is cheap, so what the hell,
let’s squander it." However, this is not true with handhelds. Memory
consumes battery power, the main limit today of handheld capabilities. Bandwidth
consumes radio air time and battery time. We are running out of broadcast
frequencies. You can’t manufacture more of them once the channels are
filled, just use them more efficiently. Further, the delays caused by bloated
XML packets consume precious people time, and frustrate the heck out of users
completely needlessly.
- In an Applet or a hand held device, memory for data and code is at a premium.
You normally carefully massage the data off-line to be as predigested and as
compact as possible, e.g. serialised objects. As well as being fat, XML needs
considerable processing before it can be used. This consumes RAM for both data
and code, and battery power to do the massaging.
- There is no standard way to compress XML. You can use ZIP which is very cpu and
ram heavy. You can use WBXML (Wireless Binary
XML). The problem is on receipt, it is fluffed back up to regular XML
then parsed, so it is has even more parsing overhead that regular XML. There are
other compressed formats ASN-1 and WML. In practice most
XML gets sent in its outrageously fluffy default form. People think XML files
are always tiny little 1K configuration files and so why worry. The point is
once a format gets established, it gets used for all sorts of things the
originators would never have dreamed of, like 3 gig image files. ASN.1 schemas
now can be used to validate XML files. XML files with XML schemas can be
automatically converted to ASN.1. ASN.1 files can be decoded 100 times faster
than XML. I think it is time to start thinking of using ASN.1 instead of XML for
large files, or for when they must me transported over the wire.
- There is sort of mania to convert everything to XML, even things for which it is
only marginally well-suited.
“This obsession of XMLing everything (build scripts, database mapping,
setup & configuration,… etc.) without proper GUI tools to
intelligently and efficiently edit and maintain such data contradicts the
very fundamental role of the programmers’ profession.”
~ Hani Hammami
- You pay for forcing all data into the XML mould in the circumlocutions necessary
to say everything in XML, e.g. about 8 lines of code to conditionally
copy a file in ANT with XML.
- XML assumes all data in the universe come in the form of a tree. XML becomes a
Procrustean bed if the data are not tree-structured.
- XML DTD uses a ugly syntax with gratuitous punctuation. #IMPLIED
really means optional. #PCDATA
means string <!ATTLIST
means attributes.
- There are no standard tag names for XML. Everyone still codes postal addresses
differently which means data exchange still requires custom coding. RDF
ontologies address this problem.
 |
recommend book⇒The Theory Of The Leisure Class |
| | paperback | hardcover |
|---|
| ISBN13: | 978-0-14-018795-3 | 978-0-8488-1659-9 |
|---|
| ISBN10: | 0-14-018795-2 | 0-8488-1659-5 |
|---|
| publisher: | Penguin |
| published: | 1994-02-01 |
| by: | Thorstein Veblen |
| This is one of the most amusing books I ever read. It is funny by being so on. He coined the terms conspicuous consumption and conspicous waste to explain modern status displays. |
|
XML
is an example of conspicuous waste, waste for waste’s sake. I find it
morally repugnant. I reminds me of Roman Emperor Caligula who took a bite of a
peach, tossed it away, then grabbed a fresh one. The authors went out of their
way to create a bloated, ugly syntax.
Using XML to transmit data is the analog of insisting that all code be passed
around as triple spaced Java source files, with added dummy comments, rather
than as binary byte code. There is no guarantee a source file is even
syntactically correct. It is impossible to create a syntactically incorrect byte
code file. Byte code files can be processed without time-consuming parsing. In
byte code, repeating strings are naturally specified only once. XML, as it
stands, suffers from all those analogous drawbacks and more.
What Should Replace XML?
The characteristics include:
- It needs to be a binary format for compactness. Files have to both be
transmitted and stored. Size does matter. People think in terms of one page XML
files, but they potentially could be gigabytes long. If XML becomes an
established interchange format we will pay for the slop in XML trillions of
times over. It is not good enough to say XML files will always be stored in
compressed form. In my experience in practice XML files are never compressed.
Files should be both compact and quick to process. XML as it stands is neither.
- It needs to be a binary format to ensure correctness. Human readable formats
tempt people to manually compose documents that are almost syntactically correct,
e.g. HTML. This is too sloppy for an interchange format. Consider how much
better chance you have of getting a working program first time if someone sends
you java byte code rather than Java source that may not even compile.
- It needs to be computer-friendly so that a program can rapidly find the data it
wants without having to parse for delimiters of various flavours. If people want
to examine the file detail for debugging, let them use a binary reader/editor.
You could use counted strings rather than delimited strings and use integers to
encode the field types so they can be used directly as table indexes. I would
not go quite so far is to ask for a serialised tree of nodes, but push for a
representation that can rapidly be turned into one.
- For giant files, the representation should not have substantially more overhead
than the raw binary. There need to be ways of efficiently expressing repeating
patterns. For example, there is no need for delimiters for fixed length data.
There is no need for individual field identifiers for standard groupings of
fields. You want to push as much as possible of the file format description into
the descriptor file, out of the data file. The descriptor file need be
transmitted only once. The data file will typically be transmitted again and
again. There is no need to make the format simple, just compact and fast to
process. All you need is a simple programmer’s interface to it.
Only a handful of programmers ever need concern themselves with its inner
structure.
- XML currently only allows for hierarchical trees of data. There are one or two
other types of data out there in the world, (e.g. tables, relations, references,
graphs) A universal interchange format should be a little more flexible. If it
is worth doing, it is worth doing right. Obviously the format can’t be
expected to handle every conceivable data structure and obsolete every
specialised interchange format ever devised. However, XML is talking big about
becoming universal and should deliver. It can’t even handle ordinary
business data which is typically relational not strictly hierarchical.
- One possible example of the sort of inner structure I am thinking of is my HTML
compactor project.
- The other thing it needs is in the DTD some information about the allowed data
types, there need to be the usual bounded ints, IEEE floats, IEEE doubles, 8-bit
encoded strings in some reasonably small number of character sets, with maximum
and minimum lengths, as well as a variety of business types, such as zip, zip+4,
state, country, Canusan phone, international phone, date, time, credit card
number, latitude, longitude, etc. When someone is handing you data you need to
know how clean it is. You need to know ahead of time the minimum and maximum
enforced limits on various field sizes.
- Ideally the new binary format, or a variant of it would also handle the function
HTML does now. This would, in a stroke, give four benefits:
- Much more compact transmissions, which means much faster transmissions and
lighter loaded servers.
- No more syntax errors. In the process of converting to binary format all syntax
would either have to be manually or automatically corrected. This means the
browser no longer has to deal with both the official standard, and also all the
common variant errors that people type. This means pages would always render
properly. As it is, pages render properly only in the browser used by the author
which forgives his particular errors. The binary protocol effectively blocks
human HTML coding errors from getting out on the net.
- Faster rendering since the data would arrive already preparsed. The browser
would know for example how big tables are before it had finished reading the
entire file, and so could start rendering the top part of the document
accurately immediately.
- Consider the total dollars invested in equipment in the world to transmit HTML,
including servers, satellite links, fibre optic links, cable connections…
In a stroke, you would double the capacity of that equipment to deliver HTML,
simply by switching to a binary delivery format.
One possible candidate for the XML replacement job is the Java serialised object
format. It can handle just about any data structure imaginable. It is platform
independent. It has a simple DTD — Java source code for the corresponding
class. Some claim it is Java-only. Not so. It is no more difficult for C++ to
parse than any other similar newly concocted protocol. It is not tied to any
hardware or OS. It is just that Java has a head start implementing it. Java can
implement it with no extra overhead.
There have been some efforts made to patch up the shortcomings of XML, in fact
there are dozens of them. XML is no longer simple any more. It is raggedy
patchwork quilt. People were sucked in by the initial simplicity, then
discovered that it was not really all that useful in its simple form. Schema was
added to allow specifying types (but still only permitting strings). Yes we need
a standard interchange format, but XML was only a back of the envelope stab at
it. XML was destined to fail since it totally ignored so many factors in coming
up with a good design.
One such effort is VTD Virtual Token Descriptor
(VTD). A VTD record is a 64-bit integer that encodes the starting offset, length,
type and nesting depth of a token in an XML document. Because VTD records don’t
contain data fields, they work alongside of the original XML document, which is
maintained intact in memory by the processing model.
Due to the stupidity, duplicity and/or greed of those promoting XML, we will
likely be stuck with some committee-patched variant of it forever —
something that will make even HTML look clean. We need a common data interchange
format, but not so inept.
DTD
You need to compose a DTD file that describes the format of the XML file. The <!ELEMENT
statement is used to list the various tags you will use, and which tags may be
used inside which tags, and how often and in which order. The <!ATTLIST
statement is used to list the various attributes (mandatory and optional) of
each tag. The <!ENTITY statement lets you make up
you own abbreviations.
Here is a simple example:
DTD:
<!ELEMENT square EMPTY>
<!ATTLIST square width CDATA "0">
The CDATA means the value of the field is a string.
XML:
<square width="100"></square>
Schema
A schema is a document that describes what constitutes a legitimate XML document.
It might be very generic, describing all XML documents, or some particular class
of XML documents, say ones describing an invoice for the XYZ company. The
original XML schema was called DTD, borrowed from the HTML people. It was clumsy
and did not allow very tight specification. It basically just let you specify
the names of the tags and attributes. Since then there have been several other
flavours of schema: RELAX NG, Schematron
and a new one from W3C called XML
schema. DTDs look nothing like XML itself. XML Schema is itself a flavour of
XML. XML Schema is a major advance over DTD. It is described in three documents: Primer,
Structures
and Data
Types. It can define datatypes, ranges, enumerator, dates, complex datatypes
to much more rigidly specify what constitutes a valid XML file.
Awkward Characters
XML has a similar problem to HTML with reserved characters. What if <
incidentally appears in your data? It would be look like the beginning of some </end>
tag. There is only one truly awkward character, namely <,
and you deal with it the same way you do in HTML, by encoding it as an entity
reference, namely <. (They are not
called entities in XML since that term is already
taken to mean a group of data.)
HTML has scores of entities whereas XML has only five:
< ( < ), &
( & ), > ( >
), " ( "
), ' ( ' ).
All of the entity references are optional except for <
and &
But what about awkward non-ASCII characters such as é
and Ω and ⇔?
There are six ways around the restriction that XML does not support the full set
of HTML character entity references.
- If you use UTF-8 encoding, you can use any Unicode
characters plain without entification.
- If you use an 8-bit encoding such as ISO-8859-1,
you can stick to just 256 characters defined in that encoding.
- You could use decimal NCRs (Numeric Character
Entites) e.g. €
for the euro sign €. Values of numeric
character references are interpreted as Unicode characters — no matter
what encoding you use for your document. To be perverse, you could use decimal
numeric entity references or the basic entity references i.e.
< ( <
), & ( &
), > ( >
), " ( "
), ' ( '
).
- You could write a DTD to create the additional alphabetic character entities
references you need, e.g. €
- You could use hexadecimal NCRs (Numeric Character
Entites) e.g. €
for the euro sign €. Again the values of
numeric character references are interpreted as Unicode characters — no
matter what encoding you use for your document.
- If you take a depraved pleasure in deformity, you could use the CDATA
sandwich. Place pretty well whatever data you want, including raw (un-entified) <,
> and &, within
in a bizarre sandwich of characters namely: <![CDATA[ …
]]>
e.g. <caption><![CDATA[Rah!
<><><> Rah! & all that.]]></caption>
Handling awkward characters is a concern if:
- You compose XML “by hand“ with a text editor.
- You are developing code and read XML files directly.
- You write code to generate XML directly without using any sort of XML package.
Otherwise, the XML package will transparently handle awkward characters for you
both on writing and reading, so you can forget about them.
UTF-8 files using the basic five character-entity
encodings, or ISO-8859-1, with the basic five
character entities (possibly excluding ')
plus decimal NCRs, will create the files easiest to read and compose manually,
XML’s saving grace.
XML Serialization
There is another form of serialization that produces XML instead of binary ObjectOutputStreams.
It uses the java.beans.XMLEncoder
class. It does not use the Serializable
interface, but writes ordinary Objects that have
JavaBean-style getter and setter methods and a no-arg constructor. It does not
persist fields, but rather properties (in the Delphi sense, not System.
setProperty), implemented with get/set. Basically
it looks for all the getXXX methods, and
calls them, and emits a stream of tags named after the properties. To
reconstitute, XMLDecoder instantiates an Object
of the class, and calls the corresponding setXXX
methods from the values in the XML stream. The source and target classes need
not have matching code the way they do with true serialization. Most trouble
using this features comes from thinking it behaves like ordinary serialization.
They have almost nothing in common.
Digitally Signing XML
You would think XML would be a nightmare for digital signing, with its variable
amounts of whitespace, and variable newline characters and lax attitude toward
the encoding. However, W3C
has invented a slick scheme to let you digitally sign various fields in an XML
document (by specifying #xxxx HTML-like targets) and
embed the signature in the document. You can also sign documents external to the
XML file. The secret is canonicalisation.
You use an algorithm to tidy the document to standard form. The transforms leave
embedded, lead and trailing whitespace on fields intact, but collapse the rest
to standard patterns. The scheme allows for various canonicalisation transforms
and various signing algorithms. As you would expect from XML, the signature
block is gargantuan.
Apache has
written classes to make the work easier.
Books
 |
recommend book⇒Java and XML |
| | paperback |
|---|
| ISBN13: | 978-0-596-10149-7 |
|---|
| ISBN10: | 0-596-10149-X |
|---|
| publisher: | O’Reilly  |
| published: | 2006-12-08 |
| by: | Brett McLaughlin, Justin Edelson |
| Covers SAX2, DTDs, XML Schema, XSL, JDOM, JAXP, JAXB, RSS and remote procedure calls with XML. |
|
Learning More
Sun’s Javadoc on the
Schema class : available:
Sun’s Javadoc on the
SchemaFactory class : available:
Sun’s Javadoc on the
Validator class : available:
Sun’s Javadoc on the
XMLConstants class : available:
Sun’s Javadoc on the
SAXParser class : available:
Sun’s Javadoc on the
XMLEncoder class : available: