SAX File Transfer Protocol
©1996-2017 Roedy Green of Canadian Mind Products
Disclaimer
This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in
Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything
else useful to implementing this project. Everything I have prepared to help you is right here.
This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is
fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the
define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.
Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many
other projects of my own.
Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
SAX (not an acronym) is a more efficient file transfer protocol than
HTTP (Hypertext Transfer Protocol) or
FTP (File Transfer Protocol) which are based on TCP/IP (Transmission Control Protocol/Internet Protocol).
TCP/IP
sends half its time acking every packet, which wastes bandwidth and slows things down.
SAX
is intended to work efficiently even when the transport layer is flaky and loses packets
or loses connections.
I originally designed SAX
before anyone used the Internet and before I had ever heard of
TCP/IP or UDP (User Datagram Protocol). The error-correcting modem
had not yet even been invented. I was thinking in terms of dial-up augmented by X.25. I
wanted something as simple to use as a FAX (Facsimile),
so made it rhyme with FAX, like a
sexy FAX, but it is not an
acronym.
SAX is not an acronym. It
was originally coined as just a mix of FAX
and SEX since it was intended to replace inefficient FAXing of documents with sexy new
protocol. Recently the XML (extensible Markup Language)
people have absconded with the term SAX
for a parser. I got it first, but, of course, had no
official registration on the word.
What Is SAX
For
The basic SAX
algorithm could be used to:
- Invent a replacement for FTP that uploads and downloads files, that works efficiently
even over lines that disconnect frequently and lose packets. If you go that route see
my essay on how an atomic FTP uploader
should work.
- Invent a replacement for BitTorrent that uploads and downloads files, that works
efficiently even over lines that disconnect frequently and lose packets.
- Invent a replacement to HTTP for downloading files, that works efficiently even over
lines that disconnect frequently and lose packets. It also uses multiple mirror servers
dymamically selecting the one the fastest and least busy. Of course, it would
automatically recover if any server or connection to it failed by using alternate
servers. See my essay on closest download
mirrors.
- Invent a fast file copy for LANs (Local Area Networks).
You save all the acks on each packet. The low level packet protocol in Ethernet
does not guarantee delivery. As the LAN (Local Area Network)
becomes every
more heavily loaded. the percentage of lost packets goes up. This is where
SAX
shines by cutting the total number of packets it throws onto the
LAN
in half and my more efficiently handling retransmission of lost packets.
Protocol Features
Here are the key features of the
SAX protocol. I will leave
it up to you to fill in the details. The protocol itself is much simpler than other
protocols:
- You can think of SAX
as a sliding window protocol, like Kermit or Zmodem, where the window is the entire
file being transferred.
- SAX
does not ack individual packets. It only acks every 60
seconds or so as an are-you-there? I’m-alive heartbeat and when file transfer is
complete. Either end can ask the other at any time Are you still
alive? or What is your view of the work still to be
done?
- Packets are fixed length. They carry with them the record number where they fit in
the file. You multiply the record number by packet data length to get the offset where
they fit.
- Every SAX
packet is self-identifying — where it fits into the file being transferred.
Packets may arrive out of order, or be repeated without any damage. Good packets are
never discarded the way they are in most other protocols.
- Sender and receiver maintain bitmaps of which parts of the file they think have
been successfully sent. When a sender sends a packet, it marks that chunk done. When
the receiver receives a packet successfully it marks that chunk done. When the sender
receives an ack, it marks unreceived chunks as undone. The file transfer is complete
when the receiver acks with an empty to do list.
- An ack is a list of all the parts of the file that have not been
received yet. The file transfer is over when the receiver’s ack says all has been
received, that there is no more work to do. The ack consists of a complete list of the
parts of the file that still need to be transferred according to that end, given as a
list of ranges and individual chunk numbers.
- The SAX protocol must be built atop
IP (Internet Protocol) or
UDP, not sockets
TCP/IP to avoid the automatic acks on every packet. Have a
look at the java.net.DatagramPacket and java.net.DatagramSocket classes. Unlike TCP/IP,
SAX
never disards any good packet and never asks for a good packet to be retransmitted.
This is particularly important on poor connections when a high percentage of the
packets are lost or corrupted.
- Any packets received with bad checksums are just ignored. The Datagram classes
should handle creating and verifying checksums and discarding bad packets for you.
- SAX is restartable in
event of disconnect. The receiver tells the sender which bits of the file have not
arrived yet. The file transfer starts the same way. The receiver keeps its bit map in
the event of disconnect. The server rebuilds its bit map from the list of unfinished
work the receiver sends it to restart the transfer.
- If the file is not already compressed, it is compressed prior to starting the
transfer. Compressed files have a jar, cab or zip signature in the first few bytes. If
the file used some unknown compression, SAX
would recompress/decompress it, which would do no harm other than waste a few
CPU (Central Processing Unit) cycles.
- SAX can also be used to
transmit a stream of data, but it requires that a log be kept back to the last
checkpoint.
- Pacing, that is avoiding sending packets so fast that many are lost, or too slowly,
is not strictly part of the protocol. The sender can be clever, if it wants, by
requesting an ack from the receiver. If it notices a pattern of lost intermediate
packets, it can lower its sending rate. If it sees a pattern of perfect transmission,
it can raise it. Perfect pacing speed is not necessary. It is free to experiment on the
fly with different sending rates to see what works best. It may optimise for speed or
efficiency (ratio of packets received to sent) or some blend. It may take into account
its own load from other customers.
- Packets in the pipeline cause a complication. When the sender requests an ack, the
receiver sends back its state. However, several packets could be in the pipeline on
their way to the receiver. When the sender gets the ack, it will look as if those
packets had been lost. The protocol per has nothing to say about this problem. However,
how can the sender deal with it?
- Ignore the problem. It will all come out in the wash. On the next ack, these
packets will be marked received. If you prefer to send new packets rather than
resend old ones, the problem will most likely be resolved before you get around to
resending the questionable packets.
- Mark these unacked recently-sent packets as state unknown. When
sending packets, avoid sending state unknown ones until you have sent any packets
known unambiguously to be unsent. Only send state unknown packets
when you have nothing better to do, waiting for an ack.
- If your acks don’t get a response within X seconds you retry a few times,
then give up and start a new connection. For heartbeat acks, where you are primarily
just making sure the other end is still alive, you keep sending data packets while you
await the ack. The sender might send heartbeats even when for some reason it was too
busy to send packets, just to keep the channel alive. It might send heartbeats at a
higher packet priority status.
- You might consider piggybacking your ack requests on an outgoing data packet.
- You will probably want to implement this as a connection object with two associated
threads, one to handle sending packets, ack requests and acks and another thread to
handle receiving them. You might as well make both ends symmetrical, capable of sending
a file in either direction.
- Have each connection object deal with transferring only one file. If you need to
send more than one file, use multiple connection objects, that may or may not run in
parallel.
- How big should the packets be?
- Make them the size of the biggest packet that can flow down the link without
being broken into subpackets on one of the hops.
- Make the packets huge, say 64K (or whatever datagrams
allow) to lower disk I/O overhead. The disadvantage of this is datagrams need to be
reassembled from subpackets before any processing. If any subpacket is lost, the
whole datagram is lost.
- Run a little test to find the optimal size by experiment. It will depend on
error rate, MTU (Maximum Transmission Unit), network congestion, etc.
- Make it a configurable option. The user does experiments to tweak the number
for each remote site and time of day if they are fanatical.
- It is important to avoid per-packet overhead since the packets may be
relatively small. Unlike most layered protocols, SAX
trusts the lower layers to do their job and verify checksums and figure out how
long the packet is and therefore how much actual data it contains. I would think
you could trim the overhead down to a one-byte command followed by a 2 or 4 byte
chuck number. You might implement the command byte as a set of bits that can be
ORed so that you can piggyback a request for ack on a normal data packet.
- How does the process end? You don’t want to get into an endless ack the ack
sequence. In the very worst case, you can recover from anything pathological when the
receiver restarts the transfer.
The sender will be the first to discover it has sent all the packets, (simply by
maintaining an unsent packet count). Then it sends its status along with a request
for status. If it does not hear back, it repeats this and eventually gives up, then
throws away its bit map. Normally it will either get a request back for some missing
packets, or an all-done status in which case it can shut down the channel.
The receiver will eventually notice that it has received all packets. It sends its
status then shuts down and throws away its bit map. It does not really matter if
that final status packet got through. The sender will give up soon enough.
If the receiver is expecting packets and does not hear anything, it can prod the
sender with a status packet, telling it of the packets it is still waiting for. If it
still hears nothing, after several attempts, it shuts down the channel, but keeps its
bit map and restarts the entire process, perhaps after some long delay. If at any
time the channel goes dead, the sender just gives up and forget the whole thing and
the receiver must restart, sending its status describing the chunks it still
wants.
Because of the way the protocol works, in theory the receiver might be able to
request just a sampling of the file. It would fake a restart,
specifying just the chunks it wanted. To the sender it would look like a restart with
a few missing packets.
- SAX might implement a
total file hash as a double check on the individual packet checksums, but it would not
recheck each packet with a second checksum. The file level checksum would have to be
defined in such a way that it could be computed as the chunks arrived in
higgledy-piggledy order. Either end could request a checksum of the other and any range
of chunks. This feature could be used by a clever strategy routine to narrow down where
the damage is and repair a damaged chunk that was corrupted in transmission but fluked
out and passed the packet checksum test.
- You may find it simpler to organise the protocol as several intercommunicating
threads, each with a single purpose, e. g. one thread may just send packets. Another
might just receive them. Another might keeps track of whether the other end is still
alive. Another ensures the other end knows we are still alive. You may find the logic
of the protocol is thus simplified.
- Once you get he raw transfer going, add this wrinkle. With each file send a
serialised descriptor object. The object contains fields such as:
- description of what the file is for.
- which application the file belong to.
- version.
- date and time.
- creator.
- URL (Uniform Resource Locator) to go for an update.
- URN (Uniform Resource Number) (future).
- platform for which the file is intended.
- mime type.
- SAX dates back to 1985
when I hired a guy to implement a SAX-based EMAIL system based on dial up phone lines
and the X.25 packet net. I was trying to figure out how you communicate over long
distance telephone satellite links or packet nets where you have a very long ack
turn-around delay. I was also concerned about how you communicate with someone in
Africa in the field using primitive non-error correcting modems and noisy local phone
lines.
- SAX seems a good fit with
the Internet and its long ack times caused by packets pinging from hop to hop like
balls in a pin ball machine. You would think implementing SAX
atop UDP
should be even simpler than implementing it directly via modem since you don’t
have the problem of resyncing after an error to find the beginning of the next packet,
and you don’t have the problem of reserving magic control characters that you
must somehow quote when they occur in data. However, then you have a problem of flow
control. You could have very badly mismatched speed between transmitter and receiver, a
problem that does not come up in direct dial connections. So you might use
TCP/IP just to get the flow control. Doing it with
UDP, if you could
handle the flow control problem even passably well, would probably be more efficient
since you would save all the acks SAX
tries so hard to avoid. Your flow control could be handled by the receiver sending a
message to the sender, please send at x packets per minute. This way the receiver could
deliberately choke down the traffic to conserve bandwidth, or run at its full receiving
capacity.
- FTP has a problem
if you want to maintain three mirror sites. You must upload a file three times over
your narrow-band modem connection to the Internet. You should need to upload the file
just once to one of the servers and then orchestrate a direct server to server
transfer to copy to the other two servers. All you should need is the necessary
read/download write/upload privilege at the servers. Ideally, this would happen
transparently. You would just tell SAX
you wanted a file sent to three places and it would figure out the fastest way to do
it. FTP
can do this now, but only if you have the rarely given shell script Telnet privilege.
FTP can, in
theory, can even handle such a third party transfer within the protocol itself, but the
feature is rarely implemented. SAX
would not require shell privilege, so it could be used by ordinary folk. On a cold
night, years from now, you might even think about how you could use multicast packets
to send to more than one site all at once.
- Eventually you want to encapsulate SAX
so that it is as built-in as FTP. Instead of saying ftp://
in your HTML (Hypertext Markup Language) you should be
able to say sax:// to download a file.
- FTP has a
problem with conflict between uploaders and downloaders. If somebody is downloading a
file off my website, I can’t upload a new version of it. I have to wait until
he is finished downloading. If the file is popular, I may have to wait until
3 AM when there is a tiny window of opportunity when no-one
is downloading it. Similarly, when I am in the process of uploading a file, no one
can download it. SAX should be clever about this. When you upload a file
that already exists, you upload to a temporary file, so you don’t block people
from using the old version. When the transfer is complete, you can do a rename (if
the OS (Operating System) is clever enough
to allow renames of open files), or go into a wait loop until the file is free to do
the rename. You need a much smaller time window in which to do the rename than you
need to upload the entire file.
- Writing a smart file uploader in Java (that used
renames and wait loops to avoid conflicts as described in the previous item) could be a
little project all on its own, using ordinary FTP
instead of SAX.
You would use the ftpClient class methods. The advantage is, you would not need to
convince your servers to install the SAX
protocol. The disadvantage is it would not be as fast.
- The SAX
protocol is very simple, or will be, once it is precisely nailed down. However, the
strategy engine behind SAX
can be as simple or sophisticated as you want. There is the challenge
that allows for true competition between SAX
implementors. There are some pieces in the SAX
protocol that don’t appear to be used, such as
receiver being able to request the sender’s bitmap status. These are for
potential use by more sophisticated strategies and for troubleshooting/debugging.
- SAX extends trivially to
download simultaneously from several servers at once. You just tell each server you are
missing a different part of the file. You treat all messages coming in almost as if
they came from a single source. Multi-server support will be useful in a Napster-like
situtation where you are downloading from a mixture of PC (Personal Computer)
s, servers, dialup etc, any of which could disconnect or suddenly drop in performance.
You just carry on with whatever servers are still working, or add some more.
- SAX is similar to
BitTorrent in many ways.
However, BitTorrent is based on TCP/IP,
where SAX
is based on UDP.
- Your clients will likely have firewalls. They typically block all
UDP traffic until
configured to let it pass.
- Before you pump effort into implementing
SAX, do a simple
simulation over your lines to see how much of an improvement in speed and reduce packet
usage you actually get over your particular nasty connections. If you don’t see
much, you might as well do something much simpler based on
TCP/IP.
Summary
Why bother with SAX
?
- You have flaky connections that tend to lose packets, either because of dialup over
non-error-correcting modems, or network conjestion.
- You want automatic recovery from failed connections.
- You want to cut the number of packet sent in half, especially if you pay per packet
sent rather than per byte sent.