Automatic File Updates
by Roedy Green ©1996-2008 Canadian Mind Products
This essay is about a suggested
student project in
Java programming. This essay gives a rough overview of how it might work. It
does not describe an actual complete program. I have
no source, object,
specifications, file layouts or anything else useful to implementing this
project. Everything I have to say to help you with this project is written below.
I am
not prepared to help you implement it; I have too many other
projects of my own.
I do contract work for a living, which could include writing a program such as
this. However, I don’t do people’s homework
for them. That just robs them of an education.
You have my full permission to implement this project any way you please.
The intent of this project is to keep any files (including *.jar
and *.zip files) up-to-date on the client site,
automatically. It is designed to dovetail with the Delta
File Creator that makes the process even more efficient by sending just the
parts of files that have changed. This project sends entire files, or entire zip
members, if so much as a comma changes in them.
There are already some push tools that update files on
the client sites. These include Marímba,
Java Web Start, DMP,
Funduc Patch, and Symantec
LiveUpdate. However, they are not suitable for two simple applications I
have:
- Keeping people’s expanded copies of cmp*.zip up-to-date.
These are giant files containing all the *.html, *.gif,
*.jpg and *.mp3 files on my
website in downloadable zipped format.
- Keeping a JavaHelp.jar file up-to-date on a client
site.
What are the problems?
- Every time even so much as a comma changes in a file, the client must download
the whole thing.
- Some tools require running custom software on the server. ISPs don’t like you
doing that, at least not without paying them a fat fee.
- The tools may be too expensive for the hobbyist.
- You can’t FTP upload the master copy of the files on the server because people
are always busy downloading them.
How are Automatic Update Files Stored on the Server?
Instead of storing entire jars or zips on the server, the members of the jars
are stored as separate *.upd files on the server. The
files are given sequential numbered names. e.g. 00000042.upd.
The server is a perfectly standard HTTP server. The only thing you need to
configure on your server is a new MIME type
for the *.upd extension for your component files. When
the component of a jar is updated, the new contents are assigned a new
sequential number. That way there is no problem uploading a file to the server
that others are downloading. Any change gets a new number. You retire updated
file numbers. All the *.upd files are stored in GZIP
compressed format. Content files in jars and zips are compressed only once. *.upd
files representing stand-alone files too are compressed. Compressed-standalone
files are effectively compressed twice. Fanatics might want to invent a way to
avoid that tiny extra overhead.
How are Automatic Update Files Stored on the Client?
The files are conventional jar, zip, program or stand-alone data files. They can
have any name or extension. They can live anywhere on disk. They may be
compressed or uncompressed in any conceivable format. Jars can even be digitally
signed. They are not permanently stored as *.upd files.
The master copies of the files created by the programmers are also maintained in
this conventional way. No one is aware of the *.upd files. They are a transport
mechanism only. The conversion to and from the *.upd
files is fully automatic.
How Does Automatic Update Work
There is a tiny root file on the server. It contains
the sequential number of the current state-of-the-union
file. The state-of-the union file also is also stored on the server in GZIP
compressed form. It has entries that contain the following data for each file/member
managed by the file updating system.
| State-Of-The-Union File Fields |
| Field |
Purpose |
| Status |
A=Active, R=Replaced, D=Deleted. Active means
this file is necessary for the client. Replaced means this file has been
replaced by some other version, Deleted means this file is no longer used.
To start a client from scratch, all you need do is examine the Active entries.
Usually all the Replaced entries will be filtered out. They are there
mainly for debugging. Similarly very old Deleted entries might be
filtered out. |
| Sequence Number |
Appending .upd will give you the name of the
corresponding file on the server. All the files for a project are stored in the
same directory on the server, even if they are stored in many different
directories on the client. |
| Install Root Code |
Usually this will be 1 to mean that files are installed relative to the
client’s installation directory. For a complex project, you may have multiple
installation directories. 0 means file names are absolute. |
| File Name |
The fully qualified filename of where this content eventually ends up on the
client. |
| Member Name |
The fully qualified filename of the jar or zip entry. If this is blank, this
entry represents a standalone file. |
| Date/Time Updated |
Miliseconds since 1970, using GMT, a Unix or Java timestamp. This is used to
set or check the file’s system date. |
| Checksum |
32-bit Adlerian checksum of the data. It is computed on the uncompressed
form of the file. Adlerian checksums are faster to compute and verify than other
types. |
|---|
When the client wants to refresh its files, it first downloads the tiny root
file. From there it can download and decompress current state-of-the-union
file. It knows it already has current files, up to and including sequence number
N. It knows this even if it had to restore its data files from backup. It then
looks in the state-of-the-union file and processes the entries. If it sees a Delete
entry, it deletes the corresponding file or member. If the zip or jar has no
more members, you delete the file itself. If it sees a Replace entry, it
ignores it. If it sees an Active entry, it inserts/replaces that file or
member. It may optionally verify the checksum of newly updated or all active
members. If there are failures, it can automatically redownload any failed
entries, and even optionally even totally recreate any jar/zip files from
scratch. This makes your applications and files self-healing.
You need a tool to help you prepare your *.upd files
for uploading to the server. It starts with a list of directories and files to
process. It detects file and member changes via file and member dates or
possibly with the checksums, or even with comparison with your current set of *.upd
files. It is probably easiest to use file dates exclusively in determining which *.jar
files to create, and have a checksum verify routine you run periodically. If you
get a failure, you manually redate the affected files with a touch utility to
force a correction.
How do you handle a file whose date has changed, but whose contents have not?
You could:
- Avoid changing file dates on your master files unless you really change them.
- Propagate the redated file as if it were a truly updated file.
- Fix the original file date back to the previous date.
- Ignore the problem. Leave the file with the old date on the client site. Don’t
worry that its date does not exactly match.
- Treat this as a special case of delta compression. See below.
- Invent a special propagation mechanism that changes file dates, without actually
transmitting the corresponding files.
What if the client refreshes so infrequently that the necessary Delete
entries are no longer present?
- You might not sweat it, and just leave the ancient deleted files lying around on
the client site.
- You might tell the client to start from scratch.
- It is probably easiest to just keep all Delete entries present forever.
Deletes are fairly rare in comparison with Replaces.
- Treat this with multiple state-of-the-union files. See below.
Extending Automatic Update
There are seven directions you could take this project once you get these basics
handled:
- You get some eager beavers who check every ten minutes if there have been
updates. The way the scheme works now, they would download the entire rather fat
state-of-the union *.upd file just to discover nothing
had changed. You get around this by maintaining several state-of-the-union files.
You might have a yearly, monthly, weekly, daily, and hourly version. The hourly
version just has changes made in the last hour. The root file
points to all of them. In addition the root file tells you the low and high
sequence number each state-of-the-union file covers. The yearly version may be
completely up-to-date, or it may not. By looking at the ranges, the client can
figure out which of these files it needs to download and process, if any.
It may need to process more than one or none. You could have as many of them or
as few of them as your wanted, spanning any range of sequence numbers.
- Getting an install started from scratch is rather inefficient since the client
downloads a zillion tiny files. It is also rather inefficient to do a massive
update, since a large number of individual files/members would have to be
downloaded. Therefore each upd file might live also in one or more lump files,
where a number of upd files are consolidated. The client downloader can then
decide the most efficient way to get the individual files it needs. The lump
files can be retired just like upd files.e They can be updated, to agglutinate
groups of upd files in different ways, or to drop retired/replaced upd files.
The restriction is, the client must download the entire lump file if it wants
even one upd file in it. As a last resort, all upd files are always available
individually. Ironically, each lump file is also an upd file, using the same
sequence number naming scheme. The client would look at the lump files available,
and decide which ones have the most stuff they need and the least stuff they don’t.
If there is too much unwanted stuff, then it would pay to download files
individually. In practice you might have a lump for updates up to the first of
this year, one for updates from first of the year to the first of this month,
one for the first of the month until yesterday, and one for today’s updates.
Note that ideally you rebuild all the lumps each day to prune them of deadwood,
and add any new files to them. A client coming in cold would need to download
all four lump files. A client who updated daily would need to download only one.
A client who updated hourly would download individually. The server is free to
update the lump files at any time even when clients are in the middle of
downloads, because of the way updates are done by always creating new upd and
lump files.
- You need to do your updates to the server’s copy of the *.upd files in batches.
You don’t update the root file until all the upd files in the batch are complete.
The client does not start using his system again until all the upd files
mentioned in the root file are downloaded and installed. You don’t want the
client using his system when only a few of the files of the update have been
installed.
- Some files can’t just simply be plopped on the client site. They need to be
installed, e.g. inserted into the registry or specially processed, e.g. to set
special attribute bits, reboot to replace a DLL, etc. the way Java Web Start
does. You need a way to specify custom installers. See Installer
and the Installer Project.
- This scheme still redownloads large non-jar files even if so much as a comma in
them changes. See the Delta Creator project for
how to tackle that problem. The same problem applies to large members where only
a tiny part of them has actually changed.
- Automatically notify clients when there are changes. This could be done by email,
or by a tiny UDP or TCP/IP probe to the running application. This just probes
them to consider doing a refresh cycle. It does not actually send them any
update data. If there are very frequent updates to the master files, you have to
avoid pestering your clients more frequently than they want to be pestered. You
also have to consider they may have many indendent applications using this
scheme. The probe had better identify the application and how to unregister
email probes. You eventually have to give up on notifying clients that never
bother to update or respond to probes. The prober is not necessarily the server
where the client gets updates.
- If you have a great many clients, you need a way to clone your server files and
have clients use all the mirrored servers TuCows-style, picking one close,
functioning, up-to-date and not too busy. Ideally you want mirror site selection
automatic. Further you want propagation to the various mirror sites automatic.
You want seriously out of date clients to first use the less-up-to-date servers
to avoid overloading the up-to-date ones.
- Other projects that could be based on such automatic update include Bulk
file distributor, HTML Glossary
Presenter, On-Line Books, Sanity
Checker, Infinite Disk, Prebranded
Software rental with auto updates.
You can use the File Transfer
classes to transfer the files around locally and remotely. You can use the File I/O
Amanuensis to teach you how to compress and decompress files. You will need
to study the Zipfile, ZipEntry,
ZipInputStream and GZIPInputStream classes
for taking apart jar files and compressing/decompressing.
Baby Steps Toward Automatic Update Nirvana
The evolution of automatic update goes like this:
- Download entire applications as a lump, e.g. with an installer,
or with a giant zip file.
- Download just the files that have changed, using Java Web
Start.
- Download just the members of jars and zips that have changed using the Automatic
File Updater described here.
- Use lumping to avoid downloading separate upd files most of the time.
- Download just the chunks of files or members that have changed using the Delta
Creator.
- Use the bulk file distributor project so
that you can efficiently use multiple client-based distributed servers. This
lets you distribute to millions of customers using only a small server.
- Instead of using simple HTTP file transfer protocols, use custom server software
to let the client grab all the updates in a single TCP/IP session.
- Use instantaneous update so that applications can use the up to the second
information, even information that becomes available after the app has started.
This requires storing data in specially structured files, usually an SQL
database. Ironically this type of update is much more evolved than the simpler
types described above. See Oracle
Distributed Databases and Oracle
Replicated Databases.