Atomic FTP Uploader
©1996-2017 Roedy Green of Canadian Mind Products
Disclaimer
This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in
Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything
else useful to implementing this project. Everything I have prepared to help you is right here.
This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is
fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the
define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.
Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many
other projects of my own.
Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
I added a new section on implementation details to this essay on 2005-07-08.
The Problem
FTP (File Transfer Protocol) software is notoriously difficult to use and notoriously
unreliable. I have tried dozens of packages. FTP
clients are all utterly hopeless at the basic task of keeping a server website identical
to the client side. They are really dinosaurs left over from the days when people
downloaded files over dial up phone lines with FTP.
What are the problems:
- The software gets confused and fails to upload or delete files on the server, or
uploads them when it does not need to.
- When someone out on the net is reading a file on the server, that locks it from
being updated and bombs the update run.
- If I make a massive set of changes to the website, it make take hours to upload.
During that time people out on the web will see an incompatible mixture of old and new
files. I don’t want the new files to be visible until they are all ready. Uploads
should be atomic.
- Server and workstation clocks may be out of sync. This should not confuse the
software. Usually the workstation is the one in trouble. Its clock should be reset from
an atomic clock with software similar to SetClock. Your software should work even when the
server’s clock is badly out of whack or in a different time zone.
- You upload entire files even if only a few bytes in them have changed.
Ideally you would like to do this without running any software on the server, since
usually ISPs (Internet Service Providers)
will not let you, or will charge you considerably more if you do run your own software.
Approaches
- Upload files to a different directory branch. When they are all ready, delete the
master and rename the uploaded files. It might be possible to do this without
server-side code since FTP supports a rename function. I know of no product that
does this. It would be very useful since a website can fail when you have half old and
half new files being served to the public, or old files referencing images in the
process of being deleted or renamed.
- Use the Replicator as a core. You upload
replicator-style zips to the server and only once they are all uploaded, unzip them.
If they are busy you get on with unzipping the next file and put that file on a queue
to handle later.
- Use the Subversion version control as a core. From the workstation’s point of
view, it is just like checking in changes to a set of source files to version control.
How do you get the files in shape for the HTML (Hypertext Markup Language)
server?
- You write an HTML server than asks the Subversion server to
reconstruct the file each time.
- You persuade Subversion to export/publish flat files after an update.
- You use a client version of subversion to extract the changed files, but run it
on the server.
Subversion handles the problems of atomicity and picking up after a disconnect
in the middle of an upload. It also is smart about only uploading changes, thus
saving bandwidth.
- Look into Rsync for site
mirroring. You can use the --delay-updates option for
reasonable atomicity. It has the usual Unix utility problems, novice-unfriendly
documentation and the need to tweak and compile source code for your particular
server. Perhaps you might write a wrapper to hide Rsync’s installation
complexities.
Implementation Details
I would like to see a product
specialized for FTP uploads that runs unattended. It would work in conjunction
with my Replicator software
for automatically distributing and keeping large file sets up to date without needing any
server-side software.
However, you don’t have to know a thing about the Replicator to understand this project. I am just telling you I have a
couple of paying customers ready for you if you decide to write this.
What I need is a streamlined FTP upload-only program designed specifically to upload
website files to a server, with the following features:
- There is no human operator, just a script running it saying which directories to
upload to which websites.
- It must try heroically to do the upload, redoing any file it has trouble with
later. Often files cannot be temporarily updated because someone is downloading
them.
- It should only upload files if they have changed.
- If someone deletes files or adds files or updates files or uploads old files,
creates or deletes directories, or restores from back up to the website behind its
back, this proposed uploader should notice and solve the problem all on its own without
human help. It also deletes files that no longer exist in the master tree. The script
defines rules to prevent it from deleting files from the website that don’t
belong to it, e. g. a set of wildcards (*.cnt) to tell it
which files to leave alone.
- It should set the ERRORLEVEL so that the program or bat file that spawned it can
tell how successful it was.
- Design the program to expect disconnections. Just pick up and carry on where you
left off. Presume a permanent Internet connection rather than dialup for
simplicity.
- I would like updates to be almost atomic. By that I mean,
the outside world viewing my uploaded website sees no changes until the updated files
have all been successfully uploaded. Only then they are they instantly
revealed in a few seconds by a set of quick deletes and renames.
I want the atomicity because it can take a long time to update the website if
there have been global changes. For an hour or two the entire website is half working
under the old scheme and half the new and nothing works properly.
Further, my webserver is not very clever. It won’t let me upload a file if
anyone out there on the web is reading it. With NetLoad, which I use now, that aborts
the entire run. Which forces me to do my big uploads late at night when traffic is
lower.
Doing it with renames makes me much less vulnerable to a file being locked in use
and unchangeable. Further, this way, I take the file out of commission from the
outside world for only a second or two, not for the entire upload.
- You can use two different strategies for doing the deletes and renames.
- Do the deletes and then all renames. This takes files out of commission longer,
but never exposes the website in an inconsistent state, just a lot of file not
founds.
- When all the new versions are updated, delete the old, then rename the new in
pairs. This is not quite as atomic, but it leaves webpages out of commission less
time.
- The strategy should be configurable, if you can’t think of even better
ones.
- In either case you, have to deal with a stubbornly locked file that is constantly
busy being downloaded or that the OS (Operating System)
thinks is locked and is not really that won’t clear for hours. You have to
eventually handle it, even if tomorrow.
I have been so frustrated with GUI-style FTP
programs for uploading websites, that I put on my to do list the task of writing my own
implementation of this student project. It is a much simpler beast than something like
FTP voyager. You might use a
GUI (Graphic User Interface) like FTP-Voyager to compose the connections information
and test the configurations out so that you don’t have to compose that stuff from
scratch in your scripts. Hopefully you can find a companion GUI
that will export that information in easy-to-use format.
From Scratch, FTP Replacement
FTP
protocol is old and has a number of disadvantages:
- It does not preserve timestamps
- It gets confused by time zones and DST (Daylight Saving Time).
- It is slow up uploading a number of small files.
- It is not secure.
- It does no compression.
- It does not do deltas. It always sends entire files even if just one byte has
changed.
- It is not atomic. An upload is revelead to the public a file at a time. This can
lead to files pointing to ones that have not been uploaded yet. Idealy the upload
should appear to the public all at once.
- If someone is downloading a file that is being uploaded, the entire session
aborts.
Perhaps what is needed is a completely fresh start. The catch is then you need to
write software both for the client and server. It might work something like the Replicator.
You might implement deltas, compression, UDP (User Datagram Protocol),
SAX-like protocol, automatic recovery from disconnect…