The word zip refers both to American postal codes and PkWare’s public
domain file archiving and compression format. Sun has extended it in its JAR and
WAR files to have a formal table of contents.
Zip Postal Codes
Zip stands for Zoning Improvement
Plan, the American postal code made of a
5+4 numeric. The code is assigned so that you can determine the state from the
first three digits of the zip code. The US Post Office has an online
zip code lookup.
Zip File Format
Zip files and jars have a similar format. Each element is preceded by a header,
then there is a summary set of headers at the very end of the file. PKware
documents the ZIP
file header format.
PKZIP and WinZip use / as the directory separator
character. It is up to you to convert the \ to /
in element names for the ZipEntry write, and back
again on read. If you don’t bother, the \ will
get in the zip file, and you will have a platform-dependent zip.
Apache
VFS gives you a common API for files that works both for regular files and
zip file members. Normally you do your work with ZipFile,
ZipEntry. ZipInputStream
and ZipOutputStream or for simpler takes GZIPInputStream
and GZIPOutputStream.
Writing Elements of a Zip File
Here is how to write a zip file with a single compressed element:
The classes in package java.util.zip such as ZipFile,
ZipInputStream and ZipOutputStream
will let you read and create zip or jar files. Don’t worry about ZipEntry.setCrc
since it and setCompressedSize get set
automatically.
Reading elements of a Zip File Sequentially
The following code won’t work if ZipOutputStream
was used to create the zip file. This includes *.jar
files created by jar.exe.
To read all the elements of a zip, you might think you would use ZipFile.
getEntries() to enumerate all the entries.
Unfortunately, this enumeration is in "random" order — Hashtable
order really. So you need to use the random access method below. To efficiently
move the disk arms over the file, you really should sort the entries first in
the order they appear in the zip.
Reading elements of a Zip File Randomly
The following code will work to read elements by randomly given the
element name, even if ZipOutputStream was used to
create the zip file, which fails to build the length elements correctly.
Verifying
Here is how you verify a zip for distribution contains all the files in the
corresponding jar.
Directories
Normally directories are not explicitly created or even stored as separate
entries in a zip file. When the file is extracted, any directories needed to
contain the extracted files are automatically created as needed. However, you
can store empty directories in a zip file. They appear as filenames ending in /.
Nesting
The member files in a zip file can be accessed individually, just like the files
in a jar file (a species of zip file). However, when one zip is contained within
another zip, you can only access the contained zip file itself, not its
individual members. You would need to expand it to disk somewhere before
accessing its members.
There are three approaches to the problem:
- Put all members in the same jar/zip.
- Use several individual jar files, and arrange to have them on the path.
- Use a JWS installer class to unpack a nested jar into individual jars.
Why would you nest?
- To get super-compression. You create a zip as a pure archive, turning off
compression. (In WinZip you select compression:none.) Then you compress the
whole thing as single file this time with compression on. The compression
algorithm can then exploit repeated strings across members.
- Because you want the user to leave some jars packed for use. You bundle them up
for transport as a single download.
Gotchas
- ZipOutputStream produces a slightly non-standard Zip
format. ZipOutputStream puts the compressed and
uncompressed size and CRC after all the members, instead of in the local header
just in front of it. Unfortunately, when you come to read this file with ZipInputStream,
when you do an ZipEntry.getSize() you will get 0,
because ZipInputStream is a stream, and can’t
look ahead to find the size. There is a second copy of the header put at the end
of the file forming an index. However, ZipFile is
able to use this index to randomly access the file to read individual elements.
A normal zip file has the information recorded redundantly to help make it
easier to read ahead and to recover a damaged zip file.
- java.util.zip has one big limitation. It only
understands a few of the possible compression algorithms. It pretty well can
only deal with zips created by itself. If the *.zip
came from the outside world, you need to exec something like WinZip wzunzip.exe
or PKWare pkunzip.exe.
- Zip format (and by extension jar format) stores file
timestamps accurate only to the even second. When you archive and restore a file,
it will no longer have a timestamp precisely matching the original. This is
above and beyond he similar problem with Java using 1 millisecond precision and
Microsoft Windows using 100 nanosecond increments. PKZIP format derives from MS
DOS days and hence uses only 16 bits for time and 16 bits for date. There is
defined an extended time stamp in the revised PKZIP format, but Java does not
use it.
Inside zip files, dates and times are stored in local time in 16 bits each using
an old MS DOS format. Bit 0 is the least signicant bit. The format is little-endian.
There was not room in 16 bit to accurately represent time even to the second, so
the seconds field contains the seconds divided by two, giving accuracy only to
the even second.
To make matters worse, Standard tools like WinZip or PKZIP will always round the
time up to the next even second when they restore, thereby possibly making the
file one second younger. The JDK (i.e. javaToDosTime in ZipEntry
rounds the time down, thereby making the file one second older.
| PKZIP/MSDOS DOSTIME 16-bit Packed Time format |
| field |
hour |
minute |
seconds/2 |
| values |
0…23 |
0…59 |
0…29 |
| width |
5 bits |
6 bits |
5 bits |
| position |
15…11 |
10…5 |
4…0 |
|---|
|
| PKZIP/MSDOS16-bit DOSDATE Packed Date format |
| field |
year |
month |
day |
| values |
1980⇒0 |
1…12 |
1…31 |
| width |
7 bits |
4 bits |
5 bits |
| position |
15…9 |
8…5 |
4…0 |
|---|
|
- Sun’s early jar files had no compression. Compression is optional in Sun’s
zip and jar classes.
GZIP vs Zip
GZIP is a more primitive file format than zip. GZIPInputStream
and GZIPOutputStream let you read and create
compressed files, but not using the zip directory structure. The file consists
of just one compressed lump, without any embedded members filenames, timestamps
etc. For sample GZIPOutputStream code, consult the File
I/O Amanuensis.
encryption
WinZip supports:
- AES, (Advanced Encryption
Standard) 128 bit.
- AES 256-bit, slower but more secure.
- Zip 2.0, weak encryption, primarily to discourage casual snoops.
Java’s ZipEntryStream does not support the
WinZip compression scheme. So you must manually encrypt and decrypt either on
the plaintext or the compressed form perhaps using JCE.
Unfortunately you will need your code at both ends to encrypt/decrypt. You won’t
be able to create encrypted files that WinZip can decrypt on its own.
Learning More
Sun’s Javadoc on the
ZipFile class : available:
Sun’s Javadoc on the
ZipEntry class : available:
Sun’s Javadoc on the
ZipInputStream class : available:
Sun’s Javadoc on the
ZipOutputStream class : available:
Sun’s Javadoc on the
GZIPInputStream class : available:
Sun’s Javadoc on the
GZIPOutputStream class : available: