SCSI generic (sg) character special file names are of the form /dev/sg0, /dev/sg1 etc, and some of these will correspond to any SCSI disks in the system (see sg_map in the sg_utils package for the mapping). To use sg devices to access disk partitions block offsets and lengths need to be found (typically with a command like 'fdisk -ul /dev/sda') and fed to the skip, seek and count arguments of sg_dd or sgp_dd utilities.
Raw devices are character special file names of the form /dev/raw/raw1, /dev/raw/raw2, etc. They can be bound to existing block special files such as disks or their partitions. Raw devices can be bound with raw(8) (see man raw) and utilities that use them need to meet various alignment requirements. Recent versions of dd can be used on raw devices. The lmdd utility from the lmbench package is also suitable. The sg_dd and sgp_dd utilities have been recently modified to meet the alignment requirements of raw devices. Raw device support is standard in the lk 2.4 series and is available as a patch in latter versions of the lk 2.2 series. [Some major distributions include the raw device patch in their lk 2.2 series products.]
The lmdd command is part of the lmbench package. It can take both block special devices and raw devices as its input and output (but not sg devices). On completion, rather than output the number of blocks (fully and partially) transferred, lmdd outputs throughput and timing information. Its 'skip' argument cannot cope with a block number that resolves to a byte offset greater than 2 GB and it does not support a seek argument. So it is not well suited to copying data (e.g. partitions) contained on large disks. lmdd is primarily useful to timing IO.
The sg_dd command is single threaded like dd. While it does not offer any of the conversion facilities of dd, it does offer the other features listed above with the addition of recognizing sg devices given as arguments. The sgp_dd command uses POSIX threads to run up to 16 worker tasks (default 4) in parallel. Each worker task is a loop that starts by reading from the next available block offset and when that is complete, writing that data out. Locks are used to maintain read and write sequencing. These locks have little performance impact when sg devices are used but slow down raw and block special device access (since the locks are held over the entire read() and write() operation). Both sg_dd and sgp_dd can be found in the sg_utils package.
The hdparm command (see man hdparm) is primarily aimed at adjusting
the parameters of IDE disks. It includes a "-t" option to 'perform timings
of device reads for benchmark and comparison purposes'. With this option
the given device may be a SCSI disk (or any other type of disk). It calculates
througput by reading 64 MB of data. The code of hdparm suggests that the
following two commands are roughly similar:
$ hdparm -t /dev/sda
$ dd if=/dev/sda of=/dev/null bs=1024k count=64
So a 100 MByte transfer is being measured in both cases. The following
results are in seconds (from the elapsed time provided by the time
command).
block 0 | block 4,000,000 | block 7,000,000 | block 8,000,000 | block 8,600,00 | |
read | 10.42 | 10.88 | 13.12 | 14.56 | 16.02 |
write | 10.41 | 10.87 | 13.11 | 14.59 | 16.02 |
This shows that read and write speed tends to degrade (in a non-linear fashion) from 10 MBytes/sec to 6 MBytes/sec. An average of 9 MBytes/sec may not be a bad guess for reading (or writing) the complete disk on these figures. This works out to around 8 minutes 20 seconds (500 seconds) to read the complete disk.
The reason for this degradation is that modern disks have more sectors
on their outer tracks than on the inner ones. Since the spindle speed is
usually fixed, there is simply less data to read on the inner tracks. Usually
block 0 is on the outermost track while the last block is on the innermost
track. Those who set up benchmarks should be aware of this characteristic
and make sure partitions are assigned at roughly similar block addresses
on comparison machines.
time (seconds) | time (mins:secs) | CPU utilization | |
dd if=/dev/sdb bs=512 | 552 | 9:12 | 25 % |
dd if=/dev/sdb bs=8192 | 549 | 9:09 | 24 % |
sg_dd if=/dev/sg1 | 506 | 8:26 | 9 % |
sg_dd if=/dev/raw/raw1 | 506 | 8:26 | 7 % |
The transfers using sg_dd have 'bs=512' as is required to match the physical block size of the disk. The 'bpt' (blocks per transfer) argument of sg_dd is left at its default value of 128 which means 64 KB transfers are being done per IO operation. The dd read times using /dev/sdb are slower than they were in the lk 2.2 series. This is probably due to a new "elevator" design in the block subsystem buffer cache. The news is not all bad as the elevator seems to improve block device copy times (over earlier measurements done on the lk 2.2 series).
The times to write /dev/zero to the whole disk are roughly
the same as the above "read" table. This leads us to the expectation that
the best copy time from one disk to another will be around 510 seconds
(8 minutes 30 seconds). A strictly synchronous copy with only one thread
should take about twice this time.
time (seconds) | time (mins:secs) | CPU utilization | |
dd [sdb -> sdc] bs=512 | 1075 | 17:55 | 26 % |
dd [sdb -> sdc] bs=8192 | 1062 | 17:42 | 27 % |
sgp_dd [sdb -> sdc] | 994 | 16:34 | 32 % |
sg_dd [sg1 -> sg2] | 1342 | 22:22 | 6 % |
sg_dd [raw1 -> raw2] | 1351 | 22:31 | 5 % |
sgp_dd [sg1 -> sg2] | 511 | 8:31 | 17 % |
sgp_dd [raw1 -> raw2] | 560 | 9:20 | 12 % |
sgp_dd [sg1 -> raw2] | 559 | 9:19 | 17 % |
sgp_dd [sg1 -> sdc] | 528 | 8:48 | 47 % |
sgp_dd [raw1 -> sg2] | 563 | 9:23 | 15 % |
All of the above sg_dd and sgp_dd commands have 'bs=512' and 'bpt' (blocks per transfer) at its default value of 128 which meant transfers of 64 KB per IO operation were being attempted. The sgp_dd command had 4 worker threads attempting to run in parallel.
Quantifying the system lethargy when either the input or output file
is a block device (e.g. sdb or sdc) is rather tricky but here is a rough
measure. The time to start X is recorded on the test machine. Each test
is done 3 times approximately a minute apart. The first column of times
is when the machine is doing nothing other than starting X.
The next column shows the times to start X when dd is being used for
a disk to disk copy via the raw devices. The last column is the time to
start X when the same disk to disk copy is being done, this time via the
block devices. Timings were done with a wrist watch.
doing nothing else | dd [raw1 -> raw2] | dd [sdb -> sdc] | |
First time | 15 | 16 | 23 |
Second time | 7 | 8 | 23 |
Third time | 7 | 8 | 23 |
While the CPU utilization figures are interesting, they fail to convey the effect of using block devices (e.g. sdb or sdc) as either input or output on overall system performance. When this happens, the previous contents of the buffer cache are flushed out and filled with transient data from the disks being copied. This "system lethargy" table above demonstrates this point. It should also be noted that the elevator design in the buffer cache is not optimized for this type of "abuse". Various defaults of the elevator can be modified with the elvtune utility.
There is a (now archived) variant of sgp_dd called sgq_dd that uses asynchronous IO techniques on sg devices within a single threaded main loop. While the two programs yielded almost identical timings for a disk to disk copy, sgq_dd had a smaller CPU utilization. However sgq_dd would not have been any benefit to block special and raw devices since they do not lend themselves to asynchronous IO techniques. So, as fast web server designers have noted, there is still a place for designs that use asynchronous IO rather than multiple threads.
Back to main page
Doug Gilbert (dgilbert@interlog.com)
Last updated: 3rd January 2001