image provider

Super Compressor


Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

Ordinary ZIP (Lempel, Ziv Welch) compression depends on finding repetition and, roughly speaking, creating abbreviations for common words. This approach does not do well on small files. Here are some ideas for super compression.
  1. Imagine you had a dictionary of 64,000 words. You numbered them 1 to 64,000. Before you compress a text file, you look up each word and replace the word and its following space by a 16-bit binary number — the word’s index in the dictionary. You create specialized dictionaries for compressing/decompressing different kinds of files. You assign the most frequently used words the lowest index numbers. This is a pre-process to ordinary compression.
  2. Imagine the state of a compression engine just after it had finished compressing the entire Java version 1.4 Java source code. What if you preloaded that state just before compressing any Java source code file. You would then have a set of common Java abbreviations all ready to go. There would be no penalty for compressing a short file. You need to create loadable states for various types of files and distribute them along with the compression/decompression engine. java.util.zip.Deflater.setDictionary gives you just that ability.

    Compression is even tighter if you use a dictionary compiled from sampled words of a single author. There is research on how text correction can be done with a dictionary that knows the relative probabilities of various words occuring.

  3. Imagine you had fixed length records to compress. If you transposed the fields and records so that all the names, then all the addresses, then all the postal codes, then all the phone numbers… came in order, the compression engine could do much better because the repetition is clearer. Further, you might recognize certain field types such as YYYY/MM/DD and replace them with 16-bit days since 1970 or some such.
  4. Write a compressor that uses information it gleaned from compressing other files in the archive. This gets tricky. If you delete an item, you may have to adjust the remaining entries.
  5. You can use variable length unsigned tokens. For example, token numbers 0..127 are 1 byte long. They represent your 128 most common words/punctuation etc. Token numbers 128..32768 are 2 bytes long. They represent your next 32K of most common words. Token numbers 32768..16,777,215 are three bytes long. They represent what’s left, possibly including common phrases as well. You might experiment with reserving some of the prime 1 or 2 byte tokens for encoding unique words not in the standard dictionary.
  6. What we are after here are ways to compress short messages that use the same vocabulary. We don’t need to send the dictionary along with each message. Everyone gets a copy of the needed dictionaries when they install the software. You will have to send a short dictionary for any novel words not in the official dictionary however.
Delta Creator
HTML Compactor
voice compression

This page is posted
on the web at:

http://mindprod.com/project/supercompressor.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\supercompressor.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[18.224.73.107]
You are visitor number