SAX File Transfer Protocol
This essay does not describe an
existing computer program, just one that should exist. This
essay is about a suggested student project in
Java programming. This essay gives a
rough overview of how it might work. I have no source, object,
specifications, file layouts or anything else useful to implementing this
This project outline is not like the artificial, tidy little problems you
are spoon-fed in school, when all the facts you need are included, nothing
extraneous is mentioned, the answer is fully specified, along with hints
to nudge you toward a single expected canonical solution. This project is
much more like the real world of messy problems where it is up to you to
fully the define the end point, or a series of ever more difficult versions
of this project, and research the information yourself to solve them.
Everything I have to say to help you with this project is written below.
I am not prepared to help you implement it; or give you any additional
materials. I have too many other projects of my own.
Though I am a programmer, I don’t do people’s homework for
them. That just robs them of an education.
You have my full permission to implement this project in any way you please
and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
SAX (not an acronym) is a more efficient file transfer protocol than HTTP (Hypertext Transfer Protocol) or FTP (File Transfer Protocol) which are based on TCP/IP (Transmission Control Protocol/Internet Protocol). TCP/IP sends half its time
acking every packet, which wastes bandwidth and slows things down. SAX is intended to work efficiently even when the
transport layer is flaky and loses packets or loses connections.
I originally designed SAX before anyone used the Internet, and before I had ever heard of TCP/IP or UDP (User Datagram Protocol). The
error-correcting modem had not yet even been invented. I was thinking in terms of dial-up augmented by X.25.
I wanted something as simple to use as a FAX (Facsimile), so made it rhyme with FAX, like a sexy FAX, but it is not an acronym.
SAX is not an acronym. It was originally coined as just a mix of FAX and SEX since it was intended to replace
inefficient FAXing of documents with sexy new protocol. Recently the XML (extensible Markup Language) people have absconded with the term SAX for
a parser. I got it first, but of course had no official registration on the word.
What Is SAX For
The basic SAX algorithm could be used to:
- Invent a replacement for FTP that uploads and downloads files, that works efficiently even over lines that
disconnect frequently and lose packets. If you go that route see my essay on how an atomic FTP uploader should work.
- Invent a replacement for BitTorrent that uploads and
downloads files, that works efficiently even over lines that disconnect frequently and lose packets.
- Invent a replacement to HTTP for downloading files, that works efficiently even over lines that disconnect
frequently and lose packets. It also uses multiple mirror servers dymamically selecting the one the fastest and
least busy. Of course it would automatically recover if any server or connection to it failed by using alternate
servers. See my essay on closest download mirrors.
- Invent a fast file copy for LANs (Local Area Networks). You save all the acks on each packet. The low level packet protocol in
Ethernet does not guarantee delivery. As the LAN (Local Area Network) becomes every more heavily loaded. the percentage of lost packets
goes up. This is where SAX shines by cutting the total number of packets it throws onto the LAN in half, and my
more efficiently handling retransmission of lost packets.
Here are the key features of the SAX protocol. I will leave it up to you to fill in the details. The protocol itself
is much simpler than other protocols:
- You can think of SAX as a sliding window protocol, like Kermit or Zmodem, where the window is the entire file
does not ack individual packets. It only acks every 60 seconds or so as an are-you-there? I’m-alive
heartbeat and when file transfer is complete. Either end can ask the other at any time Are you still
alive? or What is your view of the work still to be done?
- Packets are fixed length. They carry with them the record number where they fit in the file. You multiply the
record number by packet data length to get the offset where they fit.
- Every SAX packet is self-identifying — where it fits into the file being transferred. Packets may arrive out of
order, or be repeated without any damage. Good packets are never discarded the way they are in most other
- Sender and receiver maintain bitmaps of which parts of the file they think have been successfully sent. When a
sender sends a packet, it marks that chunk done. When the receiver receives a packet successfully it marks that
chunk done. When the sender receives an ack, it marks unreceived chunks as undone. The file transfer is complete
when the receiver acks with an empty to do list.
- An ack is a list of all the parts of the file that have not been received yet. The file transfer is over
when the receiver’s ack says all has been received, that there is no more work to do. The ack consists of a
complete list of the parts of the file that still need to be transferred according to that end, given as a list of
ranges and individual chunk numbers.
- The SAX protocol must be built atop IP (Internet Protocol) or UDP, not sockets TCP/IP to avoid the automatic acks on every packet.
Have a look at the java.net.DatagramPacket and java.net.DatagramSocket classes. Unlike TCP/IP, SAX never disards any good packet, and never asks for
a good packet to be retransmitted. This is particularly important on poor connections when a high percentage of the
packets are lost or corrupted.
- Any packets received with bad checksums are just ignored. The Datagram classes should handle creating and
verifying checksums and discarding bad packets for you.
- SAX is restartable in event of disconnect. The receiver tells the sender which bits of the file have not
arrived yet. The file transfer starts the same way. The receiver keeps its bit map in the event of disconnect. The
server rebuilds its bit map from the list of unfinished work the receiver sends it to restart the transfer.
- If the file is not already compressed, it is compressed prior to starting the transfer. Compressed files have a
jar, cab or zip signature in the first few bytes. If the file used some unknown compression, SAX would
recompress/decompress it, which would do no harm other than waste a few CPU (Central Processing Unit) cycles.
- SAX can also be used to transmit a stream of data, but it requires that a log be kept back to the last
- Pacing, that is avoiding sending packets so fast that many are lost, or too slowly, is not strictly part of the
protocol. The sender can be clever, if it wants, by requesting an ack from the receiver. If it notices a pattern of
lost intermediate packets, it can lower its sending rate. If it sees a pattern of perfect transmission, it can
raise it. Perfect pacing speed is not necessary. It is free to experiment on the fly with different sending rates
to see what works best. It may optimise for speed or efficiency (ratio of packets received to sent) or some blend.
It may take into account its own load from other customers.
- Packets in the pipeline cause a complication. When the sender requests an ack, the receiver sends back its
state. However several packets could be in the pipeline on their way to the receiver. When the sender gets the ack,
it will look as if those packets had been lost. The protocol per has nothing to say about this problem. However,
how can the sender deal with it?
- Ignore the problem. It will all come out in the wash. On the next ack, these packets will be marked
received. If you prefer to send new packets rather than resend old ones, the problem will most likely be
resolved before you get around to resending the questionable packets.
- Mark these unacked recently-sent packets as state unknown. When sending packets, avoid sending state
unknown ones until you have sent any packets known unambiguously to be unsent. Only send state
unknown packets when you have nothing better to do, waiting for an ack.
- If your acks don’t get a response within X seconds you retry a few times, then give up and start a new
connection. For heartbeat acks, where you are primarily just making sure the other end is still alive, you keep
sending data packets while you await the ack. The sender might send heartbeats even when for some reason it was too
busy to send packets, just to keep the channel alive. It might send heartbeats at a higher packet priority
- You might consider piggybacking your ack requests on an outgoing data packet.
- You will probably want to implement this as a connection object with two associated threads, one to handle
sending packets, ack requests and acks, and another thread to handle receiving them. You might as well make both
ends symmetrical, capable of sending a file in either direction.
- Have each connection object deal with transferring only one file. If you need to send more than one file, use
multiple connection objects, that may or may not run in parallel.
- How big should the packets be?
- Make them the size of the biggest packet that can flow down the link without being broken into subpackets
on one of the hops. See the TweakDUN utility for more information on how that works.
- Make the packets huge, say 64K (or whatever datagrams allow) to lower disk I/O overhead. The disadvantage
of this is datagrams need to be reassembled from subpackets before any processing. If any subpacket is lost,
the whole datagram is lost.
- Run a little test to find the optimal size by experiment. It will depend on error rate, MTU (Maximum Transmission Unit), network
- Make it a configurable option. The user does experiments to tweak the number for each remote site and time
of day if they are fanatical.
- It is important to avoid per-packet overhead since the packets may be relatively small. Unlike most layered
protocols, SAX trusts the lower layers to do their job and verify checksums and figure out how long the packet
is, and therefore how much actual data it contains. I would think you could trim the overhead down to a
one-byte command followed by a 2 or 4 byte chuck number. You might implement the command byte as a set of bits
that can be ORed so that you can piggyback a request for ack on a normal data packet.
- How does the process end? You don’t want to get into an endless ack the ack sequence. In the very worst
case, you can recover from anything pathological when the receiver restarts the transfer.
The sender will be the first to discover it has sent all the packets, (simply by maintaining an unsent packet
count). Then it sends its status along with a request for status. If it does not hear back, it repeats this and
eventually gives up, then throws away its bit map. Normally it will either get a request back for some missing
packets, or an all-done status in which case it can shut down the channel.
The receiver will eventually notice that it has received all packets. It sends its status then shuts down, and
throws away its bit map. It does not really matter if that final status packet got through. The sender will give
up soon enough.
If the receiver is expecting packets and does not hear anything, it can prod the sender with a status packet,
telling it of the packets it is still waiting for. If it still hears nothing, after several attempts, it shuts
down the channel, but keeps its bit map, and restarts the entire process, perhaps after some long delay. If at
any time the channel goes dead, the sender just gives up and forget the whole thing, and the receiver must
restart, sending its status describing the chunks it still wants.
Because of the way the protocol works, in theory the receiver might be able to request just a sampling
of the file. It would fake a restart, specifying just the chunks it wanted. To the sender it would look like a
restart with a few missing packets.
- SAX might implement a total file hash as a double check on the individual packet checksums, but it would not
recheck each packet with a second checksum. The file level checksum would have to be defined in such a way that it
could be computed as the chunks arrived in higgledy-piggledy order. Either end could request a checksum of the
other and any range of chunks. This feature could be used by a clever strategy routine to narrow down where the
damage is and repair a damaged chunk that was corrupted in transmission but fluked out and passed the packet
- You may find it simpler to organise the protocol as several intercommunicating threads, each with a single
purpose, e. g. one thread may just send packets. Another might just receive them. Another might keeps track of
whether the other end is still alive. Another ensures the other end knows we are still alive. You may find the
logic of the protocol is thus simplified.
- Once you get he raw transfer going, add this wrinkle. With each file send a serialised descriptor object. The
object contains fields such as:
- description of what the file is for.
- which application the file belong to.
- date and time.
- URL (Uniform Resource Locator) to go for an update.
- URN (Uniform Resource Number) (future).
- platform for which the file is intended.
- mime type.
- SAX dates back to 1985 when I hired a guy to implement a SAX-based EMAIL system based on dial up phone lines
and the X.25 packet net. I was trying to figure out how you communicate over long distance telephone satellite
links or packet nets where you have a very long ack turn-around delay. I was also concerned about how you
communicate with someone in Africa in the field using primitive non-error correcting modems and noisy local phone
- SAX seems a good fit with the Internet and its long ack times caused by packets pinging from hop to hop like
balls in a pin ball machine. You would think implementing SAX atop UDP should be even simpler than implementing it
directly via modem since you don’t have the problem of resynching after an error to find the beginning of the
next packet, and you don’t have the problem of reserving magic control characters that you must somehow quote
when they occur in data. However, then you have a problem of flow control. You could have very badly mismatched
speed between transmitter and receiver, a problem that does not come up in direct dial connections. So you might
use TCP/IP just to get the flow control. Doing it with UDP, if you could handle the flow control problem even
passably well, would probably be more efficient since you would save all the acks SAX tries so hard to avoid. Your
flow control could be handled by the receiver sending a message to the sender, please send at x packets per minute.
This way the receiver could deliberately choke down the traffic to conserve bandwidth, or run at its full receiving
- FTP has a problem if you want to maintain three mirror sites. You must upload a file three times over your
narrow-band modem connection to the Internet. You should need to upload the file just once to one of the servers,
and then orchestrate a direct server to server transfer to copy to the other two servers. All you should need is
the necessary read/download write/upload privilege at the servers. Ideally, this would happen transparently. You
would just tell SAX you wanted a file sent to three places, and it would figure out the fastest way to do it. FTP
can do this now, but only if you have the rarely given shell script Telnet privilege. FTP can, in theory, can even
handle such a third party transfer within the protocol itself, but the feature is rarely implemented. SAX would not
require shell privilege, so it could be used by ordinary folk. On a cold night, years from now, you might even
think about how you could use multicast packets to send to more than one site all at once.
- Eventually you want to encapsulate SAX so that it is as built-in as FTP. Instead of saying ftp://
in your HTML (Hypertext Markup Language) you should be able to say sax:// to download a file.
- FTP has a problem with conflict between uploaders and downloaders. If somebody is downloading a file off my
website, I can’t upload a new version of it. I have to wait until he is finished downloading. If the file is
popular, I may have to wait until 3 AM
when there is a tiny window of opportunity when no-one is downloading it.
Similarly, when I am in the process of uploading a file, no one can download it. SAX should be clever about this.
When you upload a file that already exists, you upload to a temporary file, so you don’t block people from
using the old version. When the transfer is complete, you can do a rename (if the OS (Operating System) is clever enough to allow
renames of open files), or go into a wait loop until the file is free to do the rename. You need a much smaller
time window in which to do the rename than you need to upload the entire file.
- Writing a smart file uploader in Java (that used renames and wait loops to
avoid conflicts as described in the previous item) could be a little project all on its own, using ordinary FTP
instead of SAX. You would use the ftpClient class methods. The advantage is, you would not need to convince your
servers to install the SAX protocol. The disadvantage is it would not be as fast.
- The SAX protocol is very simple, or will be, once it is precisely nailed down. However, the strategy engine
behind SAX can be as simple or sophisticated as you want. There is the challenge that allows for true
competition between SAX implementors. There are some pieces in the SAX protocol that don’t appear to be used,
such as receiver being able to request the sender’s bitmap status. These are for potential use by more
sophisticated strategies and for troubleshooting/debugging.
- SAX extends trivially to download simultaneously from several servers at once. You just tell each server you
are missing a different part of the file. You treat all messages coming in almost as if they came from a single
source. Multi-server support will be useful in a Napster-like situtation where you are downloading from a mixture
of PC (Personal Computer) s, servers, dialup etc, any of which could disconnect or suddenly drop in performance. You just carry on with
whatever servers are still working, or add some more.
- SAX is similar to BitTorrent in many ways. However,
BitTorrent is based on TCP/IP, where SAX is based on UDP.
- Your clients will likely have firewalls. They typically block all UDP traffic until configured
to let it pass.
- Before you pump effort into implementing SAX, do a simple simulation over your lines to see how
much of an improvement in speed and reduce packet usage you actually get over your particular nasty connections. If
you don’t see much, you might as well do something much simpler based on TCP/IP.
Why bother with SAX?
- You have flaky connections that tend to lose packets, either because of dialup over non-error-correcting
modems, or network conjestion.
- You want automatic recovery from failed connections.
- You want to cut the number of packet sent in half, especially if you pay per packet sent rather than per byte