Fast copy for disks

Introduction

This page examines improving the speed of copying one disk to another disk. Copying a partition from one disk to another disk is a closely related problem. The dd(1) command (see man dd) is tailor made for this purpose. Given the multi-gigabyte size of modern disks, a disk to disk copy can take an hour or more. Also of interest is the CPU utilization during a copy and any other adverse affects on system performance. This page looks at using the sg_dd and sgp_dd utilities and various other techniques to improve the speed of transfers. Although primarily aimed at SCSI disks (or cdroms/dvds), these utilities and techniques can also improve copy times between IDE disks.

Device special files

SCSI disks have block special file names of the form /dev/sda, /dev/sdb etc. IDE disks have block special file names of the form /dev/hda, /dev/hdb etc. Both types of disks contain one or more partitions which are addressed by adding a numeric value (starting at 1) to the above block special file names. The block subsystem maintains buffers to cache data perchance it is re-accessed. When something as large as a disk to disk copy is being done the caching of data is probably a waste of time that has the side effect of pushing other useful data out of the cache. The block subsystem also makes it difficult to run read and write operations in parallel (but it does offer "read ahead" and "write behind" capability which is some consolation).

SCSI generic (sg) character special file names are of the form /dev/sg0, /dev/sg1 etc, and some of these will correspond to any SCSI disks in the system (see sg_map in the sg_utils package for the mapping). To use sg devices to access disk partitions block offsets and lengths need to be found (typically with a command like 'fdisk -ul /dev/sda') and fed to the skip, seek and count arguments of sg_dd or sgp_dd utilities.

Raw devices are character special file names of the form /dev/raw/raw1, /dev/raw/raw2, etc. They can be bound to existing block special files such as disks or their partitions. Raw devices can be bound with raw(8) (see man raw) and utilities that use them need to meet various alignment requirements. Recent versions of dd can be used on raw devices. The lmdd utility from the lmbench package is also suitable. The sg_dd and sgp_dd utilities have been recently modified to meet the alignment requirements of raw devices. Raw device support is standard in the lk 2.4 series and is available as a patch in latter versions of the lk 2.2 series. [Some major distributions include the raw device patch in their lk 2.2 series products.]

The simple approach

For disks of similar size and geometry, the following could be used to copy one to the other:
$ dd if=/dev/sda of=/dev/sdb bs=512
Here 'bs' is given the physical sector size of the disk. Often a larger integral multiple of 512 is recommended (e.g. 8192 bytes) but this does not make a large difference in speed due to the buffer cache being used by these 2 device files. More significant is the fact that dd is single threaded which means that reads and writes are done in strict sequence. Given read and write times that are roughly similar, a dd based copy will be at least twice as slow as the best case. As the IO times of the 2 disks involved in the copy become dissimilar, the time of the slower medium tends to dominate and the slow down factor of a dd based copy becomes less pronounced. Also system performance during a dd copy of this type will be very lethargic as outlined below.

dd, lmdd, sg_dd, sgp_dd and hdparm

The dd command is a single threaded program that typically is given block special devices (e.g. /dev/sdb) when used to do a disk to disk copy. Recent versions of dd can take large arguments (up to 2^31 in the i386 architecure) in the 'skip', 'seek' and 'count' arguments. This gives it the flexibility to cope with disks up to 1 TB (assuming 512 byte sector size). Recent versions of dd can also take raw device names as input and/or output arguments. The main advantage of doing this is that the buffer cache will not be polluted with data from the copy; a minor advantage is the time saved from not doing the futile check on the buffer cache to see if the new data to be read is already present. The dd command cannot be used on sg devices as these require a special protocol. [Typically no serious damage is done to the data on a disk that has its "sg" device name given to dd by mistake.]

The lmdd command is part of the lmbench package. It can take both block special devices and raw devices as its input and output (but not sg devices). On completion, rather than output the number of blocks (fully and partially) transferred, lmdd outputs throughput and timing information. Its 'skip' argument cannot cope with a block number that resolves to a byte offset greater than 2 GB and it does not support a seek argument. So it is not well suited to copying data (e.g. partitions) contained on large disks. lmdd is primarily useful to timing IO.

The sg_dd command is single threaded like dd. While it does not offer any of the conversion facilities of dd, it does offer the other features listed above with the addition of recognizing sg devices given as arguments. The sgp_dd command uses POSIX threads to run up to 16 worker tasks (default 4) in parallel. Each worker task is a loop that starts by reading from the next available block offset and when that is complete, writing that data out. Locks are used to maintain read and write sequencing. These locks have little performance impact when sg devices are used but slow down raw and block special device access (since the locks are held over the entire read() and write() operation). Both sg_dd and sgp_dd can be found in the sg_utils package.

The hdparm command (see man hdparm) is primarily aimed at adjusting the parameters of IDE disks. It includes a "-t" option to 'perform timings of device reads for benchmark and comparison purposes'. With this option the given device may be a SCSI disk (or any other type of disk). It calculates througput by reading 64 MB of data. The code of hdparm suggests that the following two commands are roughly similar:
$ hdparm -t /dev/sda
$ dd if=/dev/sda of=/dev/null bs=1024k count=64

Test Hardware

The following tests were done on a AMD K6-2 500 MHz machine with 64 MB of RAM and an Advansys 940UW ultra wide adapter and two ultra wide disks of the same model. The disks are IBM DCHS04Us with a capacity of 4.4 GBytes each (8,813,870 blocks of 512 bytes). An ultra wide SCSI bus has a bandwidth of 40 MBytes/sec. [A third SCSI disk on a separate controller contains all mounted partitions and the swap partition. This is to minimize the effects of normal system operations (e.g. cron jobs) on the measurements.] As shown below, the streaming continuous data rate both reading and writing to those disks is roughly 10 MBytes/sec so the SCSI bus has bandwidth to spare even when those 2 disks are working to capacity. Linux kernel 2.4.0-test12 was used for the measurements. On the test system /dev/sdb, /dev/sg1 and /dev/raw/raw1 refer to one DCHS04U disk while /dev/sdc, /dev/sg2 and /dev/raw/raw2 refer to the other DCHS04U disk.

Drive characteristics

First some separate read and write tests at different block addresses on those disks. The read and write tests respectively were:
$ time sg_dd if=/dev/sg1 of=/dev/null bs=512 count=200k skip=<block_number>
$ time sg_dd if=/dev/zero of=/dev/sg2 bs=512 count=200k seek=<block_number>

So a 100 MByte transfer is being measured in both cases. The following results are in seconds (from the elapsed time provided by the time command).

block 0 block 4,000,000 block 7,000,000 block 8,000,000 block 8,600,00

read 10.42 10.88 13.12 14.56 16.02

write 10.41 10.87 13.11 14.59 16.02

Time in seconds to read/write 100 MB at different block offsets

**Time in seconds to read/write 100 MB at different block offsets**
	block 0	block 4,000,000	block 7,000,000	block 8,000,000	block 8,600,00
read	10.42	10.88	13.12	14.56	16.02
write	10.41	10.87	13.11	14.59	16.02

This shows that read and write speed tends to degrade (in a non-linear fashion) from 10 MBytes/sec to 6 MBytes/sec. An average of 9 MBytes/sec may not be a bad guess for reading (or writing) the complete disk on these figures. This works out to around 8 minutes 20 seconds (500 seconds) to read the complete disk.

The reason for this degradation is that modern disks have more sectors on their outer tracks than on the inner ones. Since the spindle speed is usually fixed, there is simply less data to read on the inner tracks. Usually block 0 is on the outermost track while the last block is on the innermost track. Those who set up benchmarks should be aware of this characteristic and make sure partitions are assigned at roughly similar block addresses on comparison machines.

time (seconds) time (mins:secs) CPU utilization

dd if=/dev/sdb bs=512 552 9:12 25 %

dd if=/dev/sdb bs=8192 549 9:09 24 %

sg_dd if=/dev/sg1 506 8:26 9 %

sg_dd if=/dev/raw/raw1 506 8:26 7 %

Times to read a 4.4 GB disk (into /dev/null)

**Times to read a 4.4 GB disk (into /dev/null)**
	time (seconds)	time (mins:secs)	CPU utilization
dd if=/dev/sdb bs=512	552	9:12	25 %
dd if=/dev/sdb bs=8192	549	9:09	24 %
sg_dd if=/dev/sg1	506	8:26	9 %
sg_dd if=/dev/raw/raw1	506	8:26	7 %

The transfers using sg_dd have 'bs=512' as is required to match the physical block size of the disk. The 'bpt' (blocks per transfer) argument of sg_dd is left at its default value of 128 which means 64 KB transfers are being done per IO operation. The dd read times using /dev/sdb are slower than they were in the lk 2.2 series. This is probably due to a new "elevator" design in the block subsystem buffer cache. The news is not all bad as the elevator seems to improve block device copy times (over earlier measurements done on the lk 2.2 series).

The times to write /dev/zero to the whole disk are roughly the same as the above "read" table. This leads us to the expectation that the best copy time from one disk to another will be around 510 seconds (8 minutes 30 seconds). A strictly synchronous copy with only one thread should take about twice this time.

time (seconds) time (mins:secs) CPU utilization

dd [sdb -> sdc] bs=512 1075 17:55 26 %

dd [sdb -> sdc] bs=8192 1062 17:42 27 %

sgp_dd [sdb -> sdc] 994 16:34 32 %

sg_dd [sg1 -> sg2] 1342 22:22 6 %

sg_dd [raw1 -> raw2] 1351 22:31 5 %

sgp_dd [sg1 -> sg2] 511 8:31 17 %

sgp_dd [raw1 -> raw2] 560 9:20 12 %

sgp_dd [sg1 -> raw2] 559 9:19 17 %

sgp_dd [sg1 -> sdc] 528 8:48 47 %

sgp_dd [raw1 -> sg2] 563 9:23 15 %

Time to copy a 4.4 GB disk to another 4.4 GB disk

**Time to copy a 4.4 GB disk to another 4.4 GB disk**
	time (seconds)	time (mins:secs)	CPU utilization
dd [sdb -> sdc] bs=512	1075	17:55	26 %
dd [sdb -> sdc] bs=8192	1062	17:42	27 %
sgp_dd [sdb -> sdc]	994	16:34	32 %
sg_dd [sg1 -> sg2]	1342	22:22	6 %
sg_dd [raw1 -> raw2]	1351	22:31	5 %
sgp_dd [sg1 -> sg2]	511	8:31	17 %
sgp_dd [raw1 -> raw2]	560	9:20	12 %
sgp_dd [sg1 -> raw2]	559	9:19	17 %
sgp_dd [sg1 -> sdc]	528	8:48	47 %
sgp_dd [raw1 -> sg2]	563	9:23	15 %

All of the above sg_dd and sgp_dd commands have 'bs=512' and 'bpt' (blocks per transfer) at its default value of 128 which meant transfers of 64 KB per IO operation were being attempted. The sgp_dd command had 4 worker threads attempting to run in parallel.

Quantifying the system lethargy when either the input or output file is a block device (e.g. sdb or sdc) is rather tricky but here is a rough measure. The time to start X is recorded on the test machine. Each test is done 3 times approximately a minute apart. The first column of times is when the machine is doing nothing other than starting X.
The next column shows the times to start X when dd is being used for a disk to disk copy via the raw devices. The last column is the time to start X when the same disk to disk copy is being done, this time via the block devices. Timings were done with a wrist watch.

doing nothing else dd [raw1 -> raw2] dd [sdb -> sdc]

First time 15 16 23

Second time 7 8 23

Third time 7 8 23

System lethargy. Time (in secs) to start X.

**System lethargy. Time (in secs) to start X.**
	doing nothing else	dd [raw1 -> raw2]	dd [sdb -> sdc]
First time	15	16	23
Second time	7	8	23
Third time	7	8	23

Analysis

The strictly synchronous, single-threaded sg_dd gave the worst copy times. 22 minutes is almost 3 times the read and write times for each disk. This is probably caused by rotational latency effects on the disk and would most likely be improved by increasing (or decreasing) the transfer size which was 64 KB. The times associated with copying via the block devices (i.e. sdb and sdc) approach a factor of 2 greater than the best time at 16 to 17 minutes. The "read ahead" and "write behind" action of the elevator in the block cache seems to be using the SCSI disk tagged queuing to negate any rotational latency problems. The "winners" are the sg and raw devices when used with sgp_dd. In these cases reads and writes (of already fetched data) are being done in parallel. The sg devices have the edge since they can queue commands to the disks, a technique that cannot be used with raw devices while maintaining read/write sequencing.

While the CPU utilization figures are interesting, they fail to convey the effect of using block devices (e.g. sdb or sdc) as either input or output on overall system performance. When this happens, the previous contents of the buffer cache are flushed out and filled with transient data from the disks being copied. This "system lethargy" table above demonstrates this point. It should also be noted that the elevator design in the buffer cache is not optimized for this type of "abuse". Various defaults of the elevator can be modified with the elvtune utility.

Other results

Abel Deuring <a.deuring@satzbau-gmbh.de> has done similar tests on 2 (older) SCSI disks and found a 35% and a 52% reduction in copy times (depending on which direction he copied). While the percent reduction was not as high as above, it does corroborate these results. Abel also did some tests on a RAID unit and got roughly similar results.

There is a (now archived) variant of sgp_dd called sgq_dd that uses asynchronous IO techniques on sg devices within a single threaded main loop. While the two programs yielded almost identical timings for a disk to disk copy, sgq_dd had a smaller CPU utilization. However sgq_dd would not have been any benefit to block special and raw devices since they do not lend themselves to asynchronous IO techniques. So, as fast web server designers have noted, there is still a place for designs that use asynchronous IO rather than multiple threads.

Conclusion

When copying large amounts of data between disks (or their partitions) using a multi-threaded variant of the classic Unix dd command can yield considerable speed improvements. There are overall system performance benefits to not using block special devices (such as /dev/hda or /dev/sdb) for large copies. Raw devices and a program like sgp_dd should give good results for IDE disks. Sg devices and sgp_dd gave the best results for SCSI disk copies in these measurements.

Back to main page

Doug Gilbert (dgilbert@interlog.com)
Last updated: 3rd January 2001