Automatic File Updates
©1996-2017 Roedy Green of Canadian Mind Products
Disclaimer
This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in
Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything
else useful to implementing this project. Everything I have prepared to help you is right here.
This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is
fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the
define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.
Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many
other projects of my own.
Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.
Please do not email me about this project without reading the disclaimer above.
The intent of this project is to keep any files (including
*.jar and *.zip files)
up-to-date on the client site, automatically. It is designed to dovetail with the
Delta File Creator that makes the process
even more efficient by sending just the parts of files that have changed. This project
sends entire files, or entire zip members, if so much as a comma changes in them.
There are already some push tools that update files on the
client sites. These include Marímba, Java Web Start, DMP, Funduc Patch and Symantec LiveUpdate. However, they are not suitable for two
simple applications I have:
- Keeping people’s expanded copies of cmp*.zip
up-to-date. These are giant files containing all the *.html,
*.gif, *.jpg and *.mp3 files on my website in downloadable zipped format.
- Keeping a JavaHelp.jar file up-to-date on a client
site.
What are the problems?
- Every time even so much as a comma changes in a file, the client must download the
whole thing.
- Some tools require running custom software on the server.
ISPs (Internet Service Providers)
don’t like you doing that, at least not without paying them a fat fee.
- The tools may be too expensive for the hobbyist.
- You can’t FTP (File Transfer Protocol) upload the master copy of the files on the server
because people are always busy downloading them.
How are Automatic Update Files Stored on the Server?
Instead of storing entire
jars or zips on the server, the members of the jars are stored as separate *.upd files on the server. The files are given sequential numbered names, e.
g. 00000042.upd. The server is a perfectly standard
HTTP (Hypertext Transfer Protocol) server. The only
thing you need to configure on your server is a new MIME type for the *.upd extension for
your component files. When the component of a jar is updated, the new contents are
assigned a new sequential number. That way there is no problem uploading a file to the
server that others are downloading. Any change gets a new number. You retire updated file
numbers. All the *.upd files are stored in GZIP compressed
format. Content files in jars and zips are compressed only once. *.upd files representing stand-alone files too are compressed.
Compressed-standalone files are effectively compressed twice. Fanatics might want to
invent a way to avoid that tiny extra overhead.
How are Automatic Update Files Stored on the Client?
The files are conventional
jar, zip, program or stand-alone data files. They can have any name or extension. They
can live anywhere on disk. They may be compressed or uncompressed in any conceivable
format. Jars can even be digitally signed. They are not permanently stored as *.upd
files.
The master copies of the files created by the programmers are also maintained in this
conventional way. No one is aware of the *.upd files. They are a transport mechanism
only. The conversion to and from the *.upd files is fully
automatic.
How Does Automatic Update Work
There is a tiny root
file on the server. It contains the sequential number of the current state-of-the-union file. The state-of-the union file also is also stored on
the server in GZIP compressed form. It has entries that contain the following data for
each file/member managed by the file updating system.
Fields to control Autoupdate
State-Of-The-Union File Fields |
Field |
Purpose |
Status |
A=Active, R=Replaced,
D=Deleted. Active means
this file is necessary for the client. Replaced means
this file has been replaced by some other version, Deleted means this file is no longer used. To start a client from
scratch, all you need do is examine the Active entries. Usually all the
Replaced entries will be filtered out. They are there
mainly for debugging. Similarly very old Deleted
entries might be filtered out. |
Sequence Number |
Appending *.upd will give you the name of the
corresponding file on the server. All the files for a project are stored in the
same directory on the server, even if they are stored in many different directories
on the client. |
Install Root Code |
Usually this will be 1 to mean that files are installed relative to the
client’s installation directory. For a complex project, you may have multiple
installation directories. 0 means file names are absolute. |
File Name |
The fully qualified filename of where this content eventually ends up on the
client. |
Member Name |
The fully qualified filename of the jar or zip entry. If this is blank, this
entry represents a standalone file. |
Date/Time Updated |
Miliseconds since 1970, using
GMT (Greenwich Mean Time), a Unix or Java timestamp.
This is used to set or check the file’s system date. |
Checksum |
32-bit Adlerian checksum of the data. It is computed on the uncompressed form
of the file. Adlerian checksums are faster to compute and verify than other
types. |
---|
When the client wants to refresh its files, it first downloads the tiny
root file. From there it can download and decompress current
state-of-the-union file. It knows it already has current files,
up to and including sequence number N. It knows this even if it had to restore its data
files from backup. It then looks in the state-of-the-union file and processes the
entries. If it sees a Delete entry, it deletes the
corresponding file or member. If the zip or jar has no more members, you delete the file
itself. If it sees a Replace entry, it ignores it. If it sees
an Active entry, it inserts/replaces that file or member. It
may optionally verify the checksum of newly updated or all active members. If there are
failures, it can automatically redownload any failed entries and even optionally even
totally recreate any jar/zip files from scratch. This makes your applications and files
self-healing.
You need a tool to help you prepare your *.upd files for
uploading to the server. It starts with a list of directories and files to process. It
detects file and member changes via file and member dates or possibly with the checksums,
or even with comparison with your current set of *.upd files. It
is probably easiest to use file dates exclusively in determining which *.jar files to create and have a checksum verify routine you run
periodically. If you get a failure, you manually redate the affected files with a touch
utility to force a correction.
How do you handle a file whose date has changed, but whose contents have not? You
could:
- Avoid changing file dates on your master files unless you really change them.
- Propagate the redated file as if it were a truly updated file.
- Fix the original file date back to the previous date.
- Ignore the problem. Leave the file with the old date on the client site.
Don’t worry that its date does not exactly match.
- Treat this as a special case of delta compression. See below.
- Invent a special propagation mechanism that changes file dates, without actually
transmitting the corresponding files.
What if the client refreshes so infrequently that the necessary Delete entries are no longer present?
- You might not sweat it and just leave the ancient deleted files lying around on
the client site.
- You might tell the client to start from scratch.
- It is probably easiest to just keep all Delete entries
present forever. Deletes are fairly rare in comparison with Replaces.
- Treat this with multiple state-of-the-union files. See below.
Extending Automatic Update
There are seven directions you could take this project
once you get these basics handled:
- You get some eager beavers who check every ten minutes if there have been updates.
The way the scheme works now, they would download the entire rather fat state-of-the
union *.upd file just to discover nothing had changed. You get
around this by maintaining several state-of-the-union files. You might have a yearly,
monthly, weekly, daily and hourly version. The hourly version just has changes made in
the last hour. The root file points to all of them. In
addition the root file tells you the low and high sequence number each
state-of-the-union file covers. The yearly version may be completely up-to-date, or it
may not. By looking at the ranges, the client can figure out which of these files it
needs to download and process, if any. It may need to process more
than one or none. You could have as many of them or as few of them as your wanted,
spanning any range of sequence numbers.
- Getting an install started from scratch is rather inefficient since the client
downloads a zillion tiny files. It is also rather inefficient to do a massive update,
since a large number of individual files/members would have to be downloaded. Therefore
each upd file might live also in one or more lump files, where a number of upd files
are consolidated. The client downloader can then decide the most efficient way to get
the individual files it needs. The lump files can be retired just like upd files.e They
can be updated, to agglutinate groups of upd files in different ways, or to drop
retired/replaced upd files. The restriction is, the client must download the entire
lump file if it wants even one upd file in it. As a last resort, all upd files are
always available individually. Ironically, each lump file is also an upd file, using
the same sequence number naming scheme. The client would look at the lump files
available and decide which ones have the most stuff they need and the least stuff they
don’t. If there is too much unwanted stuff, then it would pay to download files
individually. In practice you might have a lump for updates up to the first of this
year, one for updates from first of the year to the first of this month, one for the
first of the month until yesterday and one for today’s updates. Note that
ideally you rebuild all the lumps each day to prune them of deadwood and add any new
files to them. A client coming in cold would need to download all four lump files. A
client who updated daily would need to download only one. A client who updated hourly
would download individually. The server is free to update the lump files at any time
even when clients are in the middle of downloads because of the way updates are done
by always creating new upd and lump files.
- You need to do your updates to the server’s copy of the *.upd files in
batches. You don’t update the root file until all the upd files in the batch are
complete. The client does not start using his system again until all the upd files
mentioned in the root file are downloaded and installed. You don’t want the
client using his system when only a few of the files of the update have been
installed.
- Some files can’t just simply be plopped on the client site. They need to be
installed, e.g. inserted into the registry or specially processed, e.g. to set special
attribute bits, reboot to replace a DLL (Dynamic Link Library),
etc. the way Java Web
Start does. You need a way to specify custom installers. See Installer and the Installer Project.
- This scheme still redownloads large non-jar files even if so much as a comma in
them changes. See the Delta Creator
project for how to tackle that problem. The same problem applies to large members where
only a tiny part of them has actually changed.
- Automatically notify clients when there are changes. This could be done by email,
or by a tiny UDP (User Datagram Protocol) or TCP/IP (Transmission Control Protocol/Internet Protocol)
probe to the running application. This just probes them to consider doing a refresh
cycle. It does not actually send them any update data. If there are very frequent
updates to the master files, you have to avoid pestering your clients more frequently
than they want to be pestered. You also have to consider they may have many indendent
applications using this scheme. The probe had better identify the application and how
to unregister email probes. You eventually have to give up on notifying clients that
never bother to update or respond to probes. The prober is not necessarily the server
where the client gets updates.
- If you have a great many clients, you need a way to clone your server files and
have clients use all the mirrored servers TuCows-style, picking one close, functioning,
up-to-date and not too busy. Ideally you want mirror site selection automatic. Further,
you want propagation to the various mirror sites automatic. You want seriously out of
date clients to first use the less-up-to-date servers to avoid overloading the
up-to-date ones.
- Other projects that could be based on such automatic update include Bulk file distributor, HTML Glossary Presenter, On-Line Books, Sanity Checker, Infinite Disk, Prebranded Software rental with auto updates.
You can use the File
Transfer classes to transfer the files around locally and remotely. You can use the
File I/O Amanuensis to teach you how
to compress and decompress files. You will need to study the Zipfile, ZipEntry, ZipInputStream and GZIPInputStream classes for
taking apart jar files and compressing/decompressing.
Baby Steps Toward Automatic Update Nirvana
The evolution of automatic update goes
like this:
- Download entire applications as a lump, e.g. with an installer, or with a giant zip file.
- Download just the files that have changed, using Java Web Start.
- Download just the members of jars and zips that have changed using the Automatic File Updater described here.
- Use lumping to avoid downloading separate upd files most of the time.
- Download just the chunks of files or members that have changed using the
Delta Creator.
- Use the bulk file
distributor project so that you can efficiently use multiple client-based
distributed servers. This lets you distribute to millions of customers using only a
small server.
- Instead of using simple HTTP file transfer protocols, use custom server software to
let the client grab all the updates in a single TCP/IP
session.
- Use instantaneous update so that applications can use the up to the second
information, even information that becomes available after the app has started. This
requires storing data in specially structured files, usually an
SQL (Standard Query Language) database. Ironically this type of update is much more
evolved than the simpler types described above. See Oracle Distributed
Databases and Oracle Replicated
Databases.