Comparing Direct and Indirect IO times in sg
Rather than read data from the media, most modern disks have a reasonable size cache in the drive electronics. In many cases this cache can be read using the SCSI Read Buffer command. Although there is no requirement in the SCSI standards for any level of performance from the Read Buffer command, on most modern disks, this command can source data close to the maximum bandwidth of the SCSI bus. [There have been some reports of disks that don't and also the Read Buffer command is not mandatory so it may not be implemented (e.g. on a RAID).] SCSI devices other than disks also optionally support the Read Buffer command.
In the sg utilities (both sg_utils and sg3_utils found on the main page) there is a timing program called sg_rbuf that uses the SCSI Read Buffer command. The sg_rbuf program first determines if the Read Buffer command is available and what is the maximum size of the device cache that can be read by one command. The program then continually reads the device cache until a given transfer size is reached (default is 200 MB). The user can optionally select the size read by each Read Buffer command (default is the maximum reported size of the device cache).
The default action of the sg_rbuf command (and sg in general) is to do indirect IO. In order to simulate direct IO before it was implemented, the "-q" flag in sg_rbuf causes the kernel buffer to user space copy not to occur. The sg_rbuf command from the sg3_utils package has a "-d" flag which requests direct IO which is only supported by the sg version 3 driver. There are several reasons why direct IO may not be available and the sg driver does indirect IO in such cases. The sg_rbuf command will report in its output if the "-d" flag was given but indirect IO was performed.
Timings were done for a variety of buffer sizes.
buffer size |
simulated direct IO
sg_rbug -q |
direct IO
sg_rbuf -d |
indirect IO
sg_rbuf |
512 KB | 15.1 (0%) | 16.3 (2%) | 26.3 (42%) |
256 KB | 16.1 (0%) | 17.3 (2%) | 27.2 (42%) |
32 KB | 30.5 (2%) | 31.9 (5%) | 35.4 (15%) |
16 KB | 47.0 (3%) | 48.8 (5%) | 50.4 (10%) |
8 KB | 79.9 (3%) | 82.1 (5%) | 83.5 (7%) |
4 KB | 145.7 (4%) | 147.6 (4%) | 149.1 (6%) |
simulated direct IO | direct IO | indirect IO | |
latency per
command (implied) |
0.50 ms | 0.50 ms | 0.47 ms |
best throughput
after latency |
73 MB/sec | 67 MB/sec | 40 MB/sec |
best raw
throughput |
68 MB/sec | 63 MB/sec | 39 MB/sec |
worst raw
throughput |
7.0 MB/sec | 7.0 MB/sec | 6.9 MB/sec |
The Unix time command was used to obtain elapsed times (and CPU utilization percentages). The implied latency per command = ((4KB_time - best_time) / (commands_for_4KB_test) . The best throughput after latency = transfer_size / (best_direct_time - (implied_latency_per_command * commands_for_best_test)).
There are various factors at play in these figures. Here is a summary:
Also in sg version 3.1.18 all direct IO (dio) operations are disallowed by default. They are enabled by a user with root permissions doing: "echo 1 > /proc/scsi/sg/allow_dio".
In the upper part of the following table each cell shows 2 times. The
first, called direct, is the time to DMA the data from the disk
buffer into the kernel buffers. The second time, called indirect,
is the direct time plus the additional time to transfer the data in the
kernel buffers into the user space.
A, 2xCeler 450, UW
direct / indirect |
B, PII 400, U2W
direct / indirect |
C, 2xCel 533, U2W
direct/indirect (CPU util) |
D, 2xPPro 200, fibre
direct / indirect |
|
671 KB | 5.55 / 8.02 | |||
512 KB | 5.59 / 7.69 | 2.94 / 5.03 (41%) | 2.95 / 9.49 | |
488 KB | 5.82 / 7.32 | 5.61 / 7.70 | 2.95 / 5.04 (41%) | 3.01 / 9.56 |
256 KB | 6.32 / 7.83 | 5.96 / 8.01 | 3.14 / 5.48 (42%) | 3.34 / 9.88 |
32 KB | 13.31 / 14.83 | 10.30 / 11.99 | 5.91 / 6.74 (11%) | 9.82 / 15.80 |
16 KB | 21.91 / 23.15 | 15.61 / 17.19 | 9.09 / 9.74 (9%) | 17.52 / 23.34 |
8 KB | 37.57 / 38.54 | 26.34 / 27.86 | 15.47 / 16.07 (6%) | 32.58 / 38.32 |
4 KB | 75.12 / 75.14 | 42.54 / 43.02 | 28.15 / 28.75 (5%) | 63.26 / 68.76 |
latency per command (implied) | 1.35 ms | 0.72 ms | 0.50 ms | 1.18 ms |
Best throughput after latency | 38 MB/sec | 38 MB/sec | 73 MB/sec | 80 MB/sec |
Best raw throughput | 34 MB/sec | 36 MB/sec | 68 MB/sec | 67 MB/sec |
Worst raw throughput | 2.7 MB/sec | 4.6 MB/sec | 7.0 MB/sec | 2.9 MB/sec |
The SCSI Read Buffer command should not involve any physical disk IO but this is not guaranteed. It is assumed that the buffer on the disk can source data fast enough not to be significant in these timings. Ultra Wide SCSI has a maximum transfer rate of 40 MB/sec while Ultra 2 Wide's maximum transfer rate is 80 MB/sec. Fibre channel has a maximum transfer rate of around 100 MB/sec.
The implied latency per command = ((4KB_direct_time - best_direct_time) / (commands_for_4KB_test) . The best throughput after latency = transfer_size / (best_direct_time - (implied_latency_per_command * commands_for_best_test)).
System A has dual Celerons (300 MHz) overclocked to 450 MHz with 128MB of RAM, and an Advansys 940UW controller with a IBM DCHS04U disk. The kernel used was 2.2.7. System owned by John Meijer (meijer@pathcom.com) and tests done by author.
System B has a Pentium II running at 400 MHz with an Adaptec 7890 controller and a IBM DRSV09V disk. The kernel used was 2.2.5-ac3. Tests performed by Jens Axboe (axboe@suse.de).
System C has dual 533 MHz Celerons with 128 MB of RAM on a BP6 motherboard and an Advansys 3940U2W controller with a Seagate ST318451LW disk. The kernel used was 2.4.0-test12. The CPU utilizations shown in parentheses are of one of the available CPUs for the indirect transfer; CPU utilizations on the direct transfers were negligible. System owned and tests done by the author (dgilbert@interlog.com).
System D has 2xPPro at 200 MHz with 256MB of RAM, and a Qlogic QLA2100 FC HBA with a Seagate ST19171GC fibre channel disk. The kernel was 2.2.5 with the sg driver upgraded (to what later appeared in 2.2.6). The tests were run by Matthew Jacob (mjacob@feral.com) who wrote the SCSI low level driver.
Back to main page
Doug Gilbert (dgilbert@interlog.com)
Last updated: 8th April 2001