Introduction
Computers spend a considerable amount of time copying files, often with a command
called COPY built into the script processor. In Linux, the
copy command is called cp. COPY
uses the CWD (Current Working Directory)
as both the default for the source of the files and the target. This is rarely what
you want, but that is the tradition and hardly anyone questions it. Take Command also
has a copy command. Another common copy command is
xcopy, specialised for copying trees of files. Note the
command is just plain copy, not copy.exe. It is an internal command built into the command processor,
not a separate executable.
Case Sensitivity
Windows is a case-insensitive operating system, but that does not mean you can
forget about case. For example, Let us assume you have a file called Abc.txt in C:\temp and a file called
aBc.txt in D:\temp and you type
copy C:\temp\abC.txt
D:\temp. What is the name of the
file in D:\temp when you are done?
- Abc.txt
- aBc.txt
- abC.txt
- abc.txt
- ABC.txt
The answer is (2)
Under the Hood
Nothing could be simpler conceptually than copying a file, but to do it
efficiently is surprisingly complex. It is easy and common to write appallingly bad
copy code and I have seen a lot of it over the years. I think the problem is modern
programmers treat the OS (Operating System)
and the hardware like a black box. In the olden days, when I cut my teeth, every
programmer had to be intimately aware of just was going on physically inside hard
disks and OSes (Operating Systems). With a realistic mental model of what is going on under the hood, you
would never dream of writing some of the silly copying code I have seen. If you write
your own COPY, or have a COPY
function included in some other program, there are several things to consider:
- How big a buffer should you use to hold the data in intermediate
RAM (Random Access Memory)
before you write it back to the target disk? The bigger the buffer the faster.
However, there are diminishing returns for bigger buffer size. Further, by
grabbing too much RAM, you could slow down other tasks, or even your own
program since the buffer lives in virtual RAM
not necessarily real RAM. It is a Goldilocks problem. The copy will slow
down both if the buffer is too small or too big. It depends very much on what
else is going on in the computer, completely unrelated to your copy. A smart copy
would monitor performance and dynamically adjust the size of the buffer. For
copying large files, I typically use either a 32K or
64K stream buffer. For small files, I read the entire
file in one fell swoop, with no stream buffer at all and a precisely file-sized
FileInputStream byte buffer. In the olden days,
512 bytes was common, the size of one sector.
- If there much chance the copy will fail from lack of room on the target, check
if there is sufficient space before you start. Unfortunately, this check can be
quite an expensive operation on some operating systems. Further, there is no
guarantee some other task won’t allocate most of the free space a millisecond
later. You must actually allocate the space ahead of time to reserve it.
- Allocate the space for the target all at once, rather than implicitly, cluster
by cluster as you go. The resulting target file will more likely be contiguous and
the copying process will be faster. You know how big the source file is before you
start, so you know how big the target will be too. Even when you modify the file
slightly as you go, allocate the approximate amount of space you need, then add to
or trim the target file later as needed.
- What if the copy fails for some reason in the middle? It is best to copy the
file to a temporary and only after it has completed, quickly delete the old target
file and rename the temp to the target’s name. A power failure in the middle
of the copy will leave you with the old target file intact. Of course, this
technique requires extra free space on the target drive than you would need to
overwrite the old target. HTMLTidy, does this, but has a habit of failing just
after the delete but before the rename. If I detect the trouble, I can repair it
manually by doing the rename myself.
- Copying between two physical drives is more than twice as fast as copying to
another spot on the same physical disk. The arms barely have to move. Because
drives have transparent internal caching, you can even get reading and writing
going on fully simultaneously. With one physical drive, only one i/o operation can
be in progress at a time.
- If you have an SSD (Solid State Disk), you can use it as your
buffer. Copy the file to it, then to the target. Then your source and target disk
arms will barely have to move. The whole trick to copying quickly is reducing arm
motions and rotational latency (waiting for data you want to spin round under the
read head). That’s why doing I/O in big chunks works. You don’t ask
the arms to move as often, or to wait for a particular sector on disk to spin
round as often.
- Multitasking generally does not buy you anything. If you get several copies
going at once all you do is fibrillate the disk heads. Disk hardware can do only
one operation at a time.
- Your application could just spawn/exec a command processor and feed it a script
of COPY commands, trusting whoever wrote the code for
the command processor was clever. The problem with that approach is your code will
not be portable and the COPY code might have been written
by a Microsoft intern.
- If the file is compressed just copy across the raw compressed bytes.
Don’t waste time decompressing and recompressing. Similarly for encoded
files, just copy the raw bytes. Don’t decode and encode, unless you are
trying to change the encoding.
- Just when you thought you understood all this sufficiently to write a perfect
COPY method, consider what happens when other unrelated tasks unexpectedly at any
time before or during your copy, lock or attempt to lock either the source or
target file for read or write.
Futures
Copying is such a fundamental operation, it should be built into the OS. Why?
- Because only the OS knows how abundant RAM is. Only it can allocate a optimally
sized buffer.
- Because the copies can then be safely hidden from virus scanners which
currently needlessly scan every copy, slowing things down.
- So that copies can occur in the background, with the optimum number of threads,
but appear to apps as if they were instantaneous.
- So that copies can be done using hardware assist with very low overhead.
- So can retire all the incompetent copying code built into applications.
- The copy can be highly optimised and it will apply to every copy done in every app.