Computers spend a considerable amount of time copying files, often with a command called COPY
built into the script processor. In Linux, the copy command is called cp. COPY uses the CWD (Current Working Directory) as both the default
for the source of the files and the target. This is rarely what you want, but that is the tradition and hardly
anyone questions it.
Windows is a case-insensitive operating system, but that does not mean you can forget about case. For example,
Let us assume you have a file called Abc.txt in C:\temp, and a
file called aBc.txt in D:\temp and you type copy C:\temp\abC.txt D:\temp. What is the name of
the file in D:\temp when you are done?
Hint, the answer rhymes with the most popular word in advertising.
Nothing could be simpler conceptually than copying a file, but to do it efficiently is surprisingly complex.
It is easy and common to write appallingly bad copy code, and I have seen a lot of it over the years. I think the problem is modern programmers treat
the OS (Operating System) and the hardware like a black box. In the olden days, when I cut my teeth, every programmer had to be intimately aware of just was going
on physically inside hard disks and OSes. With a realistic mental model of what is going on under the hood,
you would never dream of writing some of the silly copying code I have seen.
If you write your own COPY, or have a COPY function included in some other program, there are several things to
- How big a buffer should you use to hold the data in intermediate RAM (Random Access Memory) before you write it back to the target disk? The bigger the buffer the faster.
However, there are diminishing returns for bigger buffer size.
Further, by grabbing too much RAM, you
could slow down other tasks, or even your own program since the buffer lives in virtual
not necessarily real RAM.
It is a Goldilocks problem. The copy will
slow down both if the buffer is too small or too big. It depends very much on what else is going on in the computer, completely unrelated to your copy.
A smart copy would monitor performance and dynamically adjust the size of the buffer. For copying large files, I
typically use either a 32K or 64K stream buffer.
For small files, I read the entire file in one fell swoop, with
no stream buffer at all and a precisely file-sized FileInputStream byte buffer.
In the olden days, 512 bytes was common, the size of one sector.
- If there much chance the copy will fail from lack of room on the target, check if there is sufficient space
before you start. Unfortunately, this check can be quite an expensive operation on some operating systems. Further, there is
no guarantee some other task won’t allocate most of the free space a millisecond later. You must actually allocate the space
ahead of time to reserve it.
- Allocate the space for the target all at once, rather than implicitly, cluster by cluster as you go. The resulting target file will more likely be
contiguous, and the copying process will be faster. You know how big the source file is before you start, so you know
how big the target will be too. Even when you modify the file slightly as you go, allocate the approximate
amount of space you need, then add to or trim the target file later as needed.
- What if the copy fails for some reason in the middle? It is best to copy the file to a temporary, and only
after it has completed, quickly delete the old target file and rename the temp to the target’s name. A power failure in the
middle of the copy will leave you with the old target file intact. Of course, this technique requires extra free space
on the target drive than you would need to overwrite the old target. HTMLTidy, does this, but has a habit of failing just after the delete
but before the rename. If I detect the trouble, I can repair it manually by doing the rename myself.
- Copying between two physical drives is more than twice as fast as copying to another spot on the same physical disk.
The arms barely have to move. Because drives have transparent internal caching, you can even get reading and
writing going on fully simultaneously. With one physical drive, only one i/o operation can be in progress at a
- If you have an SSD (Solid State Disk), you can use it as your buffer. Copy the
file to it, then to the target. Then your source and target disk arms will barely have to move. The whole trick
to copying quickly is reducing arm motions and rotational latency (waiting for data you want to spin round
under the read head). That’s why doing I/O in big chunks works. You don’t ask the arms to move as
often, or to wait for a particular sector on disk to spin round as often.
- Multitasking generally does not buy you anything. If you get several copies going at once all you do is
fibrillate the disk heads. Disk hardware can do only one operation at a time.
- Your application could just spawn/exec a command processor and feed it a script of COPY
commands, trusting the whomever wrote the code for the command processor was clever. The problem with that
approach is your code will not be portable, and the COPY code might have been written by a Microsoft intern.
- If the file is compressed just copy across the raw compressed bytes. Don’t waste time decompressing and recompressing.
Similarly for encoded files, just copy the raw bytes. Don’t decode and encode, unless you are trying to change the encoding.
- Just when you thought you understood all this sufficiently to write a perfect COPY method, consider what happens when other unrelated tasks
unexpectedly at any time before or during your copy, lock or attempt to lock either the source or target file for read or write.