Defragger Defragger

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project, and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

To write an efficient disk defragger, model how you tidy your apartment, better still, how Martha Stewart tidies her house.

A defragger is a program that rearranges files on hard disk to speed up access. See defragger in the Java & Internet Glossary. You would want the defragger to put files in order by last access date so that the rarely used files are near the center (end) of the disk and the most used ones are near the outer edge (beginning). You want each file in one contiguous lump. Ideally, you want to move frequently used system files like the directories, free space tables, and system swap file as well. Within the i-node/MFT table you want to internally defrag as well, putting the directories together, and ordering by last access date, within that by directory. You can seek to optimise either file open time or find-file if you don’t have indexes to rapidly search.

A defragger may also do other tidying such as consolidating free space fragments, removing deleted dummy directory entries, sorting directories, and collecting files from the same directory together when their access timestamps are on the same day.

There may also be quick variants of the defragger that just seek to quickly clean up the files with the most fragmentation.

The defragger comes in five pieces:

  1. The algorithm for how you optimally place the files and system tables.
  2. The algorithm for how you shuffle the pieces of the files around. It is a bit like a game of sliding squares. You can’t move a chunk into its final resting place if there is some other file in that slot at present.
  3. The display that entertains and hypnotises the user as you work. You display a coloured map of the disk showing its state and what you are doing. It would display groups of 256 pixels each representing a cluster, arranged in 16x16 squares. For small disks each cluster may be represented by more than one pixel. You might experiment with other groupings of pixels so that a contiguous file showed as a long thin line, or a blob. You could colour each block with the wavelength proportional to where that block’s final resting place is. That way a spectrum gradually appears as the defrag prgresses. You could also have an arbitrary picture appear, sort of like a jigsaw coming together.
  4. The platform-dependent Java code for accessing and changing the low-level disk structure.
  5. The platform-dependent native code for accessing and changing the low-level disk structure.
Most operating systems have some files you may not move. You have to simply work around them.

Because undebugged defraggers trash hard disks, you will want to write a defragger simulator that is platform independent. You may want to write platform dependent data-gatherers, and ask your friends to send you copies of their messy disk layouts — not the data, just the current file positions and fragmentation so you can test your program on real-world situations.

In later debugging stages, you will want a large hard disk with a free partition you can trash and recreate with a partition copier.

The simulator lets various people try coding the algorithm safely without trashing any disks or getting involved in the low level details of defragging. You can then plug in the most efficient defragger.

Your simulator’s scorekeeper should reward doing I/O in large contiguous lumps. It should reward keeping disk arm motions to a minimum both in number and length of leap. It should, as best as possible, simulate true elapsed time.

I have noticed that most of the I/O of defragging with the OS (Operating System) interface is not moving the file bodies, but rather updating the directory to point to the new location. So I suspect an algorithm that only moved files whose destination is already free will be faster, even if the arms jump all over. You want very hard to avoid moving a file to a temporary resting place.

How do you defrag when there are other tasks using the files as you work?

  1. You can use the operating system defrag interface such as the inept one provided in NT. The problem with NT is that it lets you move only one file at a time, though you can move a run of clusters from that file at a time so long as they are all being moved into a contiguous chunk of free space. You can read about how the NT official defrag interface works at SysInternals. Moving small chunks at a time this way moving the heads back and forth from source to destination is very, very slow. It is quite incompatible with the algorithms I have described for fast defragging. Further, you can’t afford to do a thorough defrag with such a limitation.

    Ideally, you should be able to specify a long list of moves to make, and NT would do them in one or more sweeps across the disk, handling perhaps dozens of files per sweep, depending on how much RAM (Random Access Memory) it has available. The only restriction is that all the target clusters need to be free before you submit the list. You would not be allowed to do a list of moves that requires an internal ordering of the moves to free up space before the rest of the list of moves can complete. When you need to do that, you must break the list in two and submit them separately.

    If you watch Raxco PerfectDisk defragger’s display as it uses the NT interface, it can drive you nuts. Nothing appears to be happening for 5 minutes at a stretch. It behaves like some bored barber endlessly picking away at your hair, cutting individual strands, shortening by never more than a millimeter per snip. Then, all of a sudden, there is a noticeable little burst of activity as its moves a large file, which can be done in reasonably efficient chunks. Then it goes back to its nitting.

  2. You can run under some other OS on some other partition so that the partition undergoing defragging is not active, e. g. you might run your defragger under a tiny JavaOS partition to defrag your Win9x FAT (File Allocation Table), OS/2 HPFS (High Performance File System), NT NTFS (New Technology File System) and Linux ext-2 partitions in sequence.
  3. You do your defragging by copying from one partition to another empty one, perhaps on a spare hard disk.
  4. You run your program very early in the boot process before any other disk-using tasks have started. This is how PartitionMagic deals with a similar problem.
What sort of algorithm should you use? Here is a suggestion. Sort all your files so that you have a table of every cluster on disk and where its final destination should be. Order by last access date with the recently accessed files near the end. This table can be compressed where there are contiguous clusters within a file. You need plenty of RAM and virtual RAM. Order them by last access date.

Imagine the disk arms are like a bus picking up passengers and dropping them off as it sweeps back and forth across the disk moving clusters to their new locations. The clusters being moved by read/write are like passengers. Here are some rules of thumb.

  1. Don’t pick up any more passengers if the bus is full.
  2. Try to keep the bus full.
  3. Don’t pick up a passenger unless the passenger’s destination slot is currently free.
  4. Don’t pick up a passenger unless the destination slot is in the direction you are currently headed.
  5. Sweep the disk smoothly like an elevator. Don’t hop all over picking up and dropping off passengers the way most defraggers do in their passion to defrag the disk methodically starting at track 0.
  6. Prefer to pick up or drop off a line of passengers contiguous on disk.
  7. Prefer to pick up clusters belonging to badly fragmented files.
  8. Prefer to pick up clusters that belong to recently used files.
  9. Prefer to pick up passengers that are far from their destinations.
  10. If there are no passengers you can move directly to their destinations, free up some space by moving a busload of passengers to any free slots just past the end of where all the files will eventually be moved. Don’t move them all the way to the end of the disk the way Norton’s defragger does.
  11. Prefer to get recently accessed files (presumably frequently accessed) to their proper position.
  12. It is more important to get files to their approximate position on disk than to get them to their precise places.
I think the O & O Complete Access algorithm works something like this. Sort files by last access date and figure out where they ideally belong, oldest first, most recently acessed near the pool of free space where new files will be formed. Move the oldest file into place. If that spot is occupied, move the occupier to where it belongs. If the spot where occurier belong too is occupied, move it to where it belongs, etc. until you can finally move a file, then you can back up and do the postponed moves. If this process just leads you round in circles, you must break the logjam by moving some file out past the end, or to some temporarily unoccupied space, and later move it to its correct spot. I see two problems with this algorithm:
  1. They fiddle with the least important files first. If the defrag is interrupted you won’t see any improvement in disk speed.
  2. They make no attempt to optimise disk arm motions. The defrag is slower than it need be.
The conventional wisdom is that defraggers must be crashproof. If the power fails at any time during the defrag process, you must lose no data. Perfection is impossible, since you can always fail in the middle of a write to the directory or the free space table. However, you can aim for recovery if the failure occurs between writes. You must move files before you marked them moved in the directory and free space tables. For a moment they exist in two places. Only after the move is recorded in the directory may you reuse the old space. The problem is crashproofing makes the defrag take longer. NT takes this to ridiculous extremes. It would take days to properly defrag a disk using the official defrag one cluster at a time interface. Ironically this extra time exposes you to more potential for power failure.

Your algorithm can be adapted to work with any sort of partition. You could produce versions for various types of Unix partitions or even mainframe partitions as well. Your product might then sell for thousands of dollars per copy. You must have a way of locking the entire disk for your exclusive access, and there must be a way to do low level physical i/o.

You might try simulating the algorithms used in the commercial defraggers such as Norton SpeedDisk for Windows9X, and NT, Executive Software Diskeeper, Raxco PerfectDisk or PowerQuest PartitionMagic. Then if your algorithms are substantially faster, you might be able to persuade them to reissue their products using your algorithms.

You might use your algorithms in proprietary firmware inside a hard disk to handle defragging/marthaing that was transparent to the operating system, thus creating a seemingly super-fast drive. It would appear to have almost 0 seek time. It might be implemented internally as two separate drives that spend their lives mirroring each other while simultaneously defragging to the other drive. By having two sets of heads, either on the same set of surfaces on separate sets, you would get away with much less head motion. The software can then be as messy as it likes. The hardware will clean up after it. In fact, the irony will be, defragging such a smart hard disk with software will cause the system to temporarily slow down! The hard drive internally is totally unaware of the file and directory structure the operating system is using. It works purely with a giant list of physical clusters. Because everything happens inside the hard disk, there is absolutely no problem with the disk being used simultaneously with being defragged.

What if you are a poor student with no access to such firmware? You can write a simulator that can be used to compute just how much faster such a smart disk would be in real life. Then perhaps you could persuade one of the high end SCSI (Small Computer System Interface) hard disk makers to build a prototype. They might be interested. It is extremely difficult to build a hard disk even a tiny bit faster than your competitors. If you can do it, you can charge a big premium.

You might build a realistic prototype using a PCI (Peripheral Component Interconnect) KVM (Kilobyte Virtual Machine) card that has hard disk support. You then need to write a device driver for your card that makes it look like an ordinary disk drive. The KVM runs your Martha software.

If you had an OS that would let you reserve additional physical space for a file beyond what is written, then you could monitor file growth rate, and allocate additional space so that a file would stay in one fragment until the next defrag. I don’t think Windows lets you do this. You could do it in the old IBM (International Business Machines) OS/360. The problem with Windows is every log file instantly fragments again after a defrag.

You could also write a defragger than required a spare empty drive to copy the data to, tidying as it went sorting by last-access-date, then copy back, always in the background. The OS would be tricked into looking for a file on one of the two disks, treating the pair as if it were one logical disk. Even a file in the process of being moved could be simultaneously used by the actual apps, with writes always going to the target.


available on the web at:
ClustrMaps is down

optional Replicator mirror
on local hard disk J:

Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy.
Your face IP:[]
You are visitor number 17,512.