image provider

Java File System


This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

One problem with Java is it lacks a portable file system. Because every file system is different on every platform, Java implements the lowest common denominator. You can’t set hidden, archive or system attribute bits.

What you do is allocate a giant file say 10 gigabytes. It looks like one file to the host OS (Operating System), but to Java it will be a file system containing many files. If you do your work well, eventually it may be the only file system of some future OS. You invent methods to allocate space inside this file for subfiles. You want to create all the ordinary file services:

  1. Lookup file by fully qualified name.
  2. Lookup file by high order part of name.
  3. Lookup file by low order part of name.
  4. Lookup files by last access or modify date. This allows a program like AltaVista Discovery to rapidly discover the files that have changed and need reindexing. Recently deleted files need to be logged too so that indexers can drop them from the indexes. Perhaps privileged application programs should be allowed to hook themselves in and monitor all file status changes system wide.
  5. Enumerate files that match some filter composed of a number of positive and negative wildcards, or user-written.
  6. Lookup file by handle.
  7. read/write sequential and random.
  8. caching and delayed write (lazy write) caching.
  9. associations — by extension is a bit dangerous. A file should be permitted to have several associations. You record the mime type of each file. None of the goofiness of Linux running file type guessers that sniff the first few bytes of the file or Windows file extensions that can have a multiplicity of meanings.
  10. Which app installed/created this file? Is it user data, or part of an app package? URL (Uniform Resource Locator) where I can get an update of it if it is corrupt.
  11. Filters so that you can rapidly search only files that contain your data, not vendor files.
  12. Security — owners, read-only, hidden, system, file signatures/keys/checksums.
  13. Dates — last access, last update.
  14. locking.
  15. defragging. See the notes on the defragging project. The whole key to a fast file system is to organise your disk layout to minimise arm movement, i.e. to keep it defragged as you go.
  16. Marthaing — dynamically keeping the most active files in the prime real estate near the outer edge (near the beginning of the disk) near the swap file and the directory and the least active files and the bulk of the disk space in the inner tracks (near the end of the disk). This could be done on per cylinder basis, or per file or per cluster.

    This file placement might presume a weekly reorg, or even better an ongoing low priority background reorg.

    When you start using a file, it gets moved by a background process to prime real estate near the MFT (Master File Table), swapfile and directories. Background processes move files that have been not used in a while to less desirable real estate further away, freeing up space, but not too much free space in the desirable real estate territory. The bulk of your free space should be out in the least desirable territory. Roughly speaking files should be contiguous and in order by last access date and within that by directory with the most recently accessed files in the more prime locations.

    The reorg process might be more integrated with ordinary I/O, so that whenever you write, that cluster gets moved to prime real estate as a side effect. You might do all your writes sequentially, interleaving writes from various files. If you had a two headed disk, you could dedicate one head to writes and one to reads. The write head would not move much. You write in a sequential circular squirrel cage pattern to a hot zone on disk. A background process moves the oldest data out of the hot zone so you don’t wrap around and overwrite. You can think of it as a second level of disk caching. You have a primary cache in RAM (Random Access Memory) and a secondary level in prime disk real estate.

    CPUs (Central Processing Units) are getting faster relative to hard disk. RAM for caching control information is getting cheaper. This means that, in future, it will pay to use more and more elaborate techniques to minimise disk head motions. You will be able to afford to do a lot of thinking to avoid a tiny bit of needless movement.

    This leaves the problem of a giant file that you update only rarely. Perhaps the positioning algorithm needs to take into account both frequency of use and last access date. Perhaps you need decaying hit counter. You multiply previous hits by 0.9 and add today’s hits to determine priority for prime real estate. (The actual decay algorithm would be more sophisticated.) You also would also factor in size. A 1K file with 10 hits is much more deserving of prime real estate that an 100 MB file with 10 hits.

  17. Temporary files — clean up of orphaned temporary files. Tracking who orphaned them.
  18. Getting more advanced, you might want an API (Application Programming Interface) for persistent objects. Traditional file systems are hierarchical. You need something so you can find stuff in a million different ways.
  19. Your file system would probably be designed with the idea of massive amounts of RAM being available, and would not worry quite so much about getting everything on rotating media immediately. You would only have to deal with graceful shutdown, perhaps with 10 minutes notice from your UPS (Uninterruptible Power Supply). This would give you a big speed boost over traditional file and database systems that are overly paranoid.
  20. Record which encoding was used to create a text file.
  21. Record the associated DTD (Document Type Definition) for an XML (extensible Markup Language) file.
  22. hooks to file viewer and editor.
  23. Ability to retrieve older versions of a file. Like a simplified CVS (Concurrent Versions System), it stores the differences between generations allowing it to reconstruct older versions.
You can start very simply. without a directory, requiring users to allocate files by absolute block numbers and just ensuring you don’t allocate the same block twice. You have been bitching that Microsoft and IBM (International Business Machines) could not write a file system to save their lives — here’s you chance to see if you can write a faster one. Once you prove your structures, maybe the MS, IBM or Linux folk will consider supporting your new partition type.

I suggest that the file system directory be implementent as an SQL (Standard Query Language) database. That handles integrity and all manner of alternative indexing for you automatically. It also allows a low level hook where the user can query the directory using SQL code.

This page is posted
on the web at:

Optional Replicator mirror
on local hard disk J:

Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

Your face IP:[]
You are visitor number