XML Compactor

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

These compact, binary XML-on-a-diet files would look to application programs that use tools like SAX as identical to bloated XML. Very few programs deal with XML directly. They would not notice the difference, just faster more efficient processing and cleaner more consistent data.

The basic idea is to convert the tags to 8/16-bit binary tokens and the strings to counted UTF-8 strings. Short strings (1 to 255 chars) are preceded by a 1-byte length count. Long strings are preceded by a four byte big-endian length count. You can then rapidly chase through the file jumping over strings you don’t want and indexing the 16-bit tokens into a table to decide what to do with that tag. You reserve a couple of 8-bit tokens for use in introducing strings. There are no quoted characters. Counted strings don’t need to be parsed to be processed or bypassed.

XML has other features besides simple begin/end tags that you will also have to encode. You might encode the offset to the corresponding virtual end tag for even faster processing. Numeric strings can be encoded instead as binary ints or IEEE (Institute of Electrical & Electronics Engineers). Dates can be encoded as binary int days since 1970.

Since tags are constrained only to appear inside certain other tags, you could usually get away with 8-bit tokens to represent tags. The encoding for a tag is specific to the context. You don’t need to assign a unique number to every possible tag or attribute.

Your binary version of the DTD (Document Type Definition) would contain the same information as the usual one, with additional information about which tokens you assigned to which tags. You could assign a token for a tag + int, tag + short string, or tag + long string, high/low bounds on a field, fixed length restrictions or character set restrictions. The binary descriptor for an xml-on-a-diet file could be automatically generated from a traditional XML schema, such as Schematron, RELAX NG or a DTD. The binary scheme might be optimised by letting the optimiser study a number of real world XML files.

You could use a different token to encode the same field as 1, 2, 3, 4 etc. byte integers, signed or unsigned.

You could use a different token to encode a fix-length uncounted string field, e.g. a postal code in Canada that is always 6 long, but allow the same field with a different token to encode it for variable length string contents.

Of course, you would mark your files with a format version number so that the binary format could evolve without concerning anyone except people to who write DOM (Document Object Model) drivers, much the way PKZIP format has evolved.

	This page is posted on the web at:	http://mindprod.com/project/xmlcompactor.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\project\xmlcompactor.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[216.73.217.127]
Feedback	You are visitor number

XML Compactor

Disclaimer