image provider

SAX File Transfer Protocol


Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

SAX (not an acronym) is a more efficient file transfer protocol than HTTP (Hypertext Transfer Protocol) or FTP (File Transfer Protocol) which are based on TCP/IP (Transmission Control Protocol/Internet Protocol). TCP/IP sends half its time acking every packet, which wastes bandwidth and slows things down. SAX is intended to work efficiently even when the transport layer is flaky and loses packets or loses connections.

I originally designed SAX before anyone used the Internet and before I had ever heard of TCP/IP or UDP (User Datagram Protocol). The error-correcting modem had not yet even been invented. I was thinking in terms of dial-up augmented by X.25. I wanted something as simple to use as a FAX (Facsimile), so made it rhyme with FAX, like a sexy FAX, but it is not an acronym.

SAX is not an acronym. It was originally coined as just a mix of FAX and SEX since it was intended to replace inefficient FAXing of documents with sexy new protocol. Recently the XML (extensible Markup Language) people have absconded with the term SAX for a parser. I got it first, but, of course, had no official registration on the word.

What Is SAX For

The basic SAX algorithm could be used to:

Protocol Features

Here are the key features of the SAX protocol. I will leave it up to you to fill in the details. The protocol itself is much simpler than other protocols:
  1. You can think of SAX as a sliding window protocol, like Kermit or Zmodem, where the window is the entire file being transferred.
  2. SAX does not ack individual packets. It only acks every 60 seconds or so as an are-you-there? I’m-alive heartbeat and when file transfer is complete. Either end can ask the other at any time Are you still alive? or What is your view of the work still to be done?
  3. Packets are fixed length. They carry with them the record number where they fit in the file. You multiply the record number by packet data length to get the offset where they fit.
  4. Every SAX packet is self-identifying — where it fits into the file being transferred. Packets may arrive out of order, or be repeated without any damage. Good packets are never discarded the way they are in most other protocols.
  5. Sender and receiver maintain bitmaps of which parts of the file they think have been successfully sent. When a sender sends a packet, it marks that chunk done. When the receiver receives a packet successfully it marks that chunk done. When the sender receives an ack, it marks unreceived chunks as undone. The file transfer is complete when the receiver acks with an empty to do list.
  6. An ack is a list of all the parts of the file that have not been received yet. The file transfer is over when the receiver’s ack says all has been received, that there is no more work to do. The ack consists of a complete list of the parts of the file that still need to be transferred according to that end, given as a list of ranges and individual chunk numbers.
  7. The SAX protocol must be built atop IP (Internet Protocol) or UDP, not sockets TCP/IP to avoid the automatic acks on every packet. Have a look at the java.net.DatagramPacket and java.net.DatagramSocket classes. Unlike TCP/IP, SAX never disards any good packet and never asks for a good packet to be retransmitted. This is particularly important on poor connections when a high percentage of the packets are lost or corrupted.
  8. Any packets received with bad checksums are just ignored. The Datagram classes should handle creating and verifying checksums and discarding bad packets for you.
  9. SAX is restartable in event of disconnect. The receiver tells the sender which bits of the file have not arrived yet. The file transfer starts the same way. The receiver keeps its bit map in the event of disconnect. The server rebuilds its bit map from the list of unfinished work the receiver sends it to restart the transfer.
  10. If the file is not already compressed, it is compressed prior to starting the transfer. Compressed files have a jar, cab or zip signature in the first few bytes. If the file used some unknown compression, SAX would recompress/decompress it, which would do no harm other than waste a few CPU (Central Processing Unit) cycles.
  11. SAX can also be used to transmit a stream of data, but it requires that a log be kept back to the last checkpoint.
  12. Pacing, that is avoiding sending packets so fast that many are lost, or too slowly, is not strictly part of the protocol. The sender can be clever, if it wants, by requesting an ack from the receiver. If it notices a pattern of lost intermediate packets, it can lower its sending rate. If it sees a pattern of perfect transmission, it can raise it. Perfect pacing speed is not necessary. It is free to experiment on the fly with different sending rates to see what works best. It may optimise for speed or efficiency (ratio of packets received to sent) or some blend. It may take into account its own load from other customers.
  13. Packets in the pipeline cause a complication. When the sender requests an ack, the receiver sends back its state. However, several packets could be in the pipeline on their way to the receiver. When the sender gets the ack, it will look as if those packets had been lost. The protocol per has nothing to say about this problem. However, how can the sender deal with it?
  14. If your acks don’t get a response within X seconds you retry a few times, then give up and start a new connection. For heartbeat acks, where you are primarily just making sure the other end is still alive, you keep sending data packets while you await the ack. The sender might send heartbeats even when for some reason it was too busy to send packets, just to keep the channel alive. It might send heartbeats at a higher packet priority status.
  15. You might consider piggybacking your ack requests on an outgoing data packet.
  16. You will probably want to implement this as a connection object with two associated threads, one to handle sending packets, ack requests and acks and another thread to handle receiving them. You might as well make both ends symmetrical, capable of sending a file in either direction.
  17. Have each connection object deal with transferring only one file. If you need to send more than one file, use multiple connection objects, that may or may not run in parallel.
  18. How big should the packets be?
  19. How does the process end? You don’t want to get into an endless ack the ack sequence. In the very worst case, you can recover from anything pathological when the receiver restarts the transfer.

    The sender will be the first to discover it has sent all the packets, (simply by maintaining an unsent packet count). Then it sends its status along with a request for status. If it does not hear back, it repeats this and eventually gives up, then throws away its bit map. Normally it will either get a request back for some missing packets, or an all-done status in which case it can shut down the channel.

    The receiver will eventually notice that it has received all packets. It sends its status then shuts down and throws away its bit map. It does not really matter if that final status packet got through. The sender will give up soon enough.

    If the receiver is expecting packets and does not hear anything, it can prod the sender with a status packet, telling it of the packets it is still waiting for. If it still hears nothing, after several attempts, it shuts down the channel, but keeps its bit map and restarts the entire process, perhaps after some long delay. If at any time the channel goes dead, the sender just gives up and forget the whole thing and the receiver must restart, sending its status describing the chunks it still wants.

    Because of the way the protocol works, in theory the receiver might be able to request just a sampling of the file. It would fake a restart, specifying just the chunks it wanted. To the sender it would look like a restart with a few missing packets.

  20. SAX might implement a total file hash as a double check on the individual packet checksums, but it would not recheck each packet with a second checksum. The file level checksum would have to be defined in such a way that it could be computed as the chunks arrived in higgledy-piggledy order. Either end could request a checksum of the other and any range of chunks. This feature could be used by a clever strategy routine to narrow down where the damage is and repair a damaged chunk that was corrupted in transmission but fluked out and passed the packet checksum test.
  21. You may find it simpler to organise the protocol as several intercommunicating threads, each with a single purpose, e. g. one thread may just send packets. Another might just receive them. Another might keeps track of whether the other end is still alive. Another ensures the other end knows we are still alive. You may find the logic of the protocol is thus simplified.
  22. Once you get he raw transfer going, add this wrinkle. With each file send a serialised descriptor object. The object contains fields such as:
  23. SAX dates back to 1985 when I hired a guy to implement a SAX-based EMAIL system based on dial up phone lines and the X.25 packet net. I was trying to figure out how you communicate over long distance telephone satellite links or packet nets where you have a very long ack turn-around delay. I was also concerned about how you communicate with someone in Africa in the field using primitive non-error correcting modems and noisy local phone lines.
  24. SAX seems a good fit with the Internet and its long ack times caused by packets pinging from hop to hop like balls in a pin ball machine. You would think implementing SAX atop UDP should be even simpler than implementing it directly via modem since you don’t have the problem of resyncing after an error to find the beginning of the next packet, and you don’t have the problem of reserving magic control characters that you must somehow quote when they occur in data. However, then you have a problem of flow control. You could have very badly mismatched speed between transmitter and receiver, a problem that does not come up in direct dial connections. So you might use TCP/IP just to get the flow control. Doing it with UDP, if you could handle the flow control problem even passably well, would probably be more efficient since you would save all the acks SAX tries so hard to avoid. Your flow control could be handled by the receiver sending a message to the sender, please send at x packets per minute. This way the receiver could deliberately choke down the traffic to conserve bandwidth, or run at its full receiving capacity.
  25. FTP has a problem if you want to maintain three mirror sites. You must upload a file three times over your narrow-band modem connection to the Internet. You should need to upload the file just once to one of the servers and then orchestrate a direct server to server transfer to copy to the other two servers. All you should need is the necessary read/download write/upload privilege at the servers. Ideally, this would happen transparently. You would just tell SAX you wanted a file sent to three places and it would figure out the fastest way to do it. FTP can do this now, but only if you have the rarely given shell script Telnet privilege. FTP can, in theory, can even handle such a third party transfer within the protocol itself, but the feature is rarely implemented. SAX would not require shell privilege, so it could be used by ordinary folk. On a cold night, years from now, you might even think about how you could use multicast packets to send to more than one site all at once.
  26. Eventually you want to encapsulate SAX so that it is as built-in as FTP. Instead of saying ftp:// in your HTML (Hypertext Markup Language) you should be able to say sax:// to download a file.
  27. FTP has a problem with conflict between uploaders and downloaders. If somebody is downloading a file off my website, I can’t upload a new version of it. I have to wait until he is finished downloading. If the file is popular, I may have to wait until 3 AM when there is a tiny window of opportunity when no-one is downloading it. Similarly, when I am in the process of uploading a file, no one can download it. SAX should be clever about this. When you upload a file that already exists, you upload to a temporary file, so you don’t block people from using the old version. When the transfer is complete, you can do a rename (if the OS (Operating System) is clever enough to allow renames of open files), or go into a wait loop until the file is free to do the rename. You need a much smaller time window in which to do the rename than you need to upload the entire file.
  28. Writing a smart file uploader in Java (that used renames and wait loops to avoid conflicts as described in the previous item) could be a little project all on its own, using ordinary FTP instead of SAX. You would use the ftpClient class methods. The advantage is, you would not need to convince your servers to install the SAX protocol. The disadvantage is it would not be as fast.
  29. The SAX protocol is very simple, or will be, once it is precisely nailed down. However, the strategy engine behind SAX can be as simple or sophisticated as you want. There is the challenge that allows for true competition between SAX implementors. There are some pieces in the SAX protocol that don’t appear to be used, such as receiver being able to request the sender’s bitmap status. These are for potential use by more sophisticated strategies and for troubleshooting/debugging.
  30. SAX extends trivially to download simultaneously from several servers at once. You just tell each server you are missing a different part of the file. You treat all messages coming in almost as if they came from a single source. Multi-server support will be useful in a Napster-like situtation where you are downloading from a mixture of PC (Personal Computer) s, servers, dialup etc, any of which could disconnect or suddenly drop in performance. You just carry on with whatever servers are still working, or add some more.
  31. SAX is similar to BitTorrent in many ways. However, BitTorrent is based on TCP/IP, where SAX is based on UDP.
  32. Your clients will likely have firewalls. They typically block all UDP traffic until configured to let it pass.
  33. Before you pump effort into implementing SAX, do a simple simulation over your lines to see how much of an improvement in speed and reduce packet usage you actually get over your particular nasty connections. If you don’t see much, you might as well do something much simpler based on TCP/IP.

Summary

Why bother with SAX ?
  1. You have flaky connections that tend to lose packets, either because of dialup over non-error-correcting modems, or network conjestion.
  2. You want automatic recovery from failed connections.
  3. You want to cut the number of packet sent in half, especially if you pay per packet sent rather than per byte sent.
BitTorrent
bulk file distributor
FAX
firewall
FTP
TCP/IP
UDP
X.25

This page is posted
on the web at:

http://mindprod.com/project/sax.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\sax.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[18.221.238.204]
You are visitor number