Atomic FTP Uploader
©1996-2009 Roedy Green, Canadian Mind Products
This essay does not describe an existing computer program, just
one that should exist. This essay is about a suggested
student
project in
Java programming. This essay gives a rough overview of how it
might work. I have
no source, object, specifications, file layouts or
anything else useful to implementing this project.
This project outline is not like the artificial tidy problems you are spoon-fed
in school, when all the facts you need are included, nothing extraneous is
mentioned, the answer is fully specified, along with hints to nudge you toward a
single expected canonical solution. This project is much more like the real
world of messy problems where it is up to you to fully the define the end point,
or a series of ever more difficult versions of this project, and research the
information yourself to solve them.
Everything I have to say to help you with this project is written below. I am not
prepared to help you implement it; or give you any additional materials. I have
too many other projects of my own.
Though I am a programmer, I don’t do people’s homework
for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and
to keep all the profits from your endeavor.
Please do not email me about this project without reading the disclaimer above.
I added a new section on implementation details
to this essay on 2005-07-08.
The Problem
FTP software is notoriously difficult to use and notoriously unreliable. I have
tried dozens of packages. FTP clients are all utterly hopeless at the basic task
of keeping a server website identical to the client side. They are really
dinosaurs left over from the days when people downloaded files over dial up
phone lines with FTP.
What are the problems:
- The software gets confused and fails to upload or delete files on the server, or
uploads them when it does not need to.
- When someone out on the net is reading a file on the server, that locks it from
being updated, and bombs the update run.
- If I make a massive set of changes to the website, it make take hours to upload.
During that time people out on the web will see an incompatible mixture of old
and new files. I don’t want the new files to be visible until they are all
ready. Uploads should be atomic.
- Server and workstation clocks may be out of sync. This should not confuse the
software. Usually the workstation is the one in trouble. Its clock should be
reset from an atomic clock with software similar to SetClock.
Your software should work even when the server’s clock is badly out of
whack or in a different time zone.
- You upload entire files even if only a few bytes in them have changed.
Ideally you would like to do this without running any software on the server,
since usually ISPs will not let you, or will charge you considerably more if you
do run your own software.
Approaches
- Upload files to a different directory branch. When they are all ready, delete
the master, and rename the uploaded files. It might be possible to do this
without server-side code since FTP supports a rename function. I know of no
product that does this. It would be very useful since a website can fail when
you have half old and half new files being served to the public, or old files
referencing images in the process of being deleted or renamed.
- Use the Replicator
as a core. You upload replicator-style zips to the server, and only once they
are all uploaded, unzip them. If they are busy you get on with unzipping the
next file and put that file on a queue to handle later.
- Use the Subversion version control as a core. From the workstation’s point
of view, it is just like checking in changes to a set of source files to version
control. How do you get the files in shape for the HTML server?
- You write an HTML server than asks the Subversion server to reconstruct the file
each time.
- You persuade Subversion to export/publish flat files after an update.
- You use a client version of subversion to extract the changed files, but run it
on the server.
Subversion handles the problems of atomicity, and picking up after a disconnect
in the middle of an upload. It also is smart about only uploading changes, thus
saving bandwidth.
- Look into Rsync for site
mirroring. You can use the --delay-updates option for
reasonable atomicity. It has the usual Unix utility problems, novice-unfriendly
documentation, and the need to tweak and compile source code for your particular
server. Perhaps you might write a wrapper to hide Rsync’s installation
complexities.
Implementation Details
I would like to see a product specialized for FTP uploads that runs unattended.
It would work in conjunction with my Replicator
software for automatically distributing and keeping large file sets up to date
without needing any server-side software.
However, you don’t have to know a thing about the Replicator to understand
this project. I am just telling you I have a couple of paying customers ready
for you if you decide to write this.
What I need is a streamlined FTP upload-only program designed specifically to
upload website files to a server, with the following features:
- There is no human operator, just a script running it saying which directories to
upload to which websites.
- It must try heroically to do the upload, redoing any file it has trouble with
later. Often files cannot be temporarily updated because someone is downloading
them.
- It should only upload files if they have changed.
- If someone deletes files or adds files or updates files or uploads old files,
creates or deletes directories, or restores from back up to the website behind
its back, this proposed uploader should notice and solve the problem all on its
own without human help. It also deletes files that no longer exist in the master
tree. The script defines rules to prevent it from deleting files from the
website that don’t belong to it, e. g. a set of wildcards (*.cnt)
to tell it which files to leave alone.
- It should set the ERRORLEVEL so that the program or bat file that spawned it can
tell how successful it was.
- Design the program to expect disconnections. Just pick up and carry on where you
left off. Presume a permanent Internet connection rather than dialup for
simplicity.
- I would like updates to be almost atomic. By that I
mean, the outside world viewing my uploaded website sees no changes until the
updated files have all been successfully uploaded. Only then they are
they instantly revealed in a few seconds by a set of quick deletes and renames.
I want the atomicity because it can take a long time to update the website if
there have been global changes. For an hour or two the entire website is half
working under the old scheme and half the new and nothing works properly.
Further my webserver is not very clever. It won’t let me upload a file if
anyone out there on the web is reading it. With NetLoad, which I use now, that
aborts the entire run. Which forces me to do my big uploads late at night when
traffic is lower.
Doing it with renames makes me much less vulnerable to a file being locked in
use and unchangeable. Further, this way, I take the file out of commission from
the outside world for only a second or two, not for the entire upload.
- You can use two different strategies for doing the deletes and renames.
- Do the deletes and then all renames. This takes files out of commission longer,
but never exposes the website in an inconsistent state, just a lot of file not
founds.
- When all the new versions are updated, delete the old, then rename the new in
pairs. This is not quite as atomic, but it leaves webpages out of commission
less time.
- The strategy should be configurable, if you can’t think of even better
ones.
- In either case you, have to deal with a stubbornly locked file that is
constantly busy being downloaded or that the OS thinks is locked and is not
really that won’t clear for hours. You have to eventually handle it, even
if tomorrow.
I have been so frustrated with GUI-style FTP programs for uploading websites,
that I put on my to do list the task of writing my own implementation of this
student project. It is a much simpler beast than something like FTP
voyager. You might use a GUI like FTP-Voyager to compose the connections
information and test the configurations out so that you don’t have to
compose that stuff from scratch in your scripts. Hopefully you can find a
companion GUI that will export that information in easy-to-use format.
You can get started with Peter van der Linden’s little LinLyn
FTP class and by watching the conversations back and forth between an GUI-style
FTP client and a FTP server during an upload. It is much simpler than you might
imagine.
From Scratch, FTP Replacement
FTP protocol is old and has a number of disadvantages:
- It does not preserve timestamps
- It gets confused by timezones and DST.
- It is slow up uploading a number of small files.
- It is not secure.
- It does no compression.
- It does not do deltas. It always sends entire files even if just one byte has
changed.
- It is not atomic. An upload is revelead to the public a file at a time. This can
lead to files pointing to ones that have not been uploaded yet. Idealy the
upload should appear to the public all at once.
- If someone is downloading a file that is being uploaded, the entire session
aborts.
Perhaps what is needed is a completely fresh start. The catch is then you need
to write software both for the client and server. It might work something like
the Replicator.
You might implement deltas, compression, UDP, SAX-like protocol, automatic
recovery from disconnect…