IO timings with sg

Comparing Direct and Indirect IO times in sg

Introduction

The normal action of the sg driver when data is read from a SCSI device is a two stage process:

use the SCSI adapter to transfer the data from the SCSI device to a kernel buffer
then copy the data from the kernel buffer to the user space.

A write operation is similar (with the direction and the above stages reversed). Most SCSI adapters have a DMA element that speeds the first step and reduces the load on the CPU. In the sg documentation this double handling of data is called "indirect IO". Why do it like this? The simple answer is that it is easier. The term "direct IO" is used by the sg driver when the above two steps are combined into one. This document gives timing and CPU utilization figures comparing direct and indirect IO.

SCSI Read Buffer command

Usually, modern SCSI (parallel and fibre channel) bus bandwidth exceeds the ability of a single disk to source data from the media. As an example, a Seagate ST318451LW disk spins at 15K rpm and has a 3.9 ms access time yet can "only" read sequential data at around 40 MB/sec. Such a disk would typically be connected to an Ultra 160 SCSI bus which offers 4 times as much bandwidth as a single disk of this type can utilize.

Rather than read data from the media, most modern disks have a reasonable size cache in the drive electronics. In many cases this cache can be read using the SCSI Read Buffer command. Although there is no requirement in the SCSI standards for any level of performance from the Read Buffer command, on most modern disks, this command can source data close to the maximum bandwidth of the SCSI bus. [There have been some reports of disks that don't and also the Read Buffer command is not mandatory so it may not be implemented (e.g. on a RAID).] SCSI devices other than disks also optionally support the Read Buffer command.

In the sg utilities (both sg_utils and sg3_utils found on the main page) there is a timing program called sg_rbuf that uses the SCSI Read Buffer command. The sg_rbuf program first determines if the Read Buffer command is available and what is the maximum size of the device cache that can be read by one command. The program then continually reads the device cache until a given transfer size is reached (default is 200 MB). The user can optionally select the size read by each Read Buffer command (default is the maximum reported size of the device cache).

The default action of the sg_rbuf command (and sg in general) is to do indirect IO. In order to simulate direct IO before it was implemented, the "-q" flag in sg_rbuf causes the kernel buffer to user space copy not to occur. The sg_rbuf command from the sg3_utils package has a "-d" flag which requests direct IO which is only supported by the sg version 3 driver. There are several reasons why direct IO may not be available and the sg driver does indirect IO in such cases. The sg_rbuf command will report in its output if the "-d" flag was given but indirect IO was performed.

Time and Utilization comparison

The following tests where done on a system running lk 2.4.2-ac24 and with sg driver version 3.1.18 . The test system had dual 533 MHz Celerons with 128 MB of RAM on a BP6 motherboard and an Advansys 3940U2W controller with a Seagate ST318451LW disk. That SCSI adapter has a maximum SCSI bus transfer rate of 80 MB/sec.

Timings were done for a variety of buffer sizes.

buffer size simulated direct IO
sg_rbug -q direct IO
sg_rbuf -d indirect IO
sg_rbuf

512 KB 15.1 (0%) 16.3 (2%) 26.3 (42%)

256 KB 16.1 (0%) 17.3 (2%) 27.2 (42%)

32 KB 30.5 (2%) 31.9 (5%) 35.4 (15%)

16 KB 47.0 (3%) 48.8 (5%) 50.4 (10%)

8 KB 79.9 (3%) 82.1 (5%) 83.5 (7%)

4 KB 145.7 (4%) 147.6 (4%) 149.1 (6%)

Time in seconds (CPU utilization) for 1024 MByte transfer

simulated direct IO direct IO indirect IO

latency per
command (implied) 0.50 ms 0.50 ms 0.47 ms

best throughput
after latency 73 MB/sec 67 MB/sec 40 MB/sec

best raw
throughput 68 MB/sec 63 MB/sec 39 MB/sec

worst raw
throughput 7.0 MB/sec 7.0 MB/sec 6.9 MB/sec

Some throughput numbers derived from above table

**Time in seconds (CPU utilization) for 1024 MByte transfer**
buffer size	simulated direct IO sg_rbug -q	direct IO sg_rbuf -d	indirect IO sg_rbuf
512 KB	15.1 (0%)	16.3 (2%)	26.3 (42%)
256 KB	16.1 (0%)	17.3 (2%)	27.2 (42%)
32 KB	30.5 (2%)	31.9 (5%)	35.4 (15%)
16 KB	47.0 (3%)	48.8 (5%)	50.4 (10%)
8 KB	79.9 (3%)	82.1 (5%)	83.5 (7%)
4 KB	145.7 (4%)	147.6 (4%)	149.1 (6%)

**Some throughput numbers derived from above table**
	simulated direct IO	direct IO	indirect IO
latency per command (implied)	0.50 ms	0.50 ms	0.47 ms
best throughput after latency	73 MB/sec	67 MB/sec	40 MB/sec
best raw throughput	68 MB/sec	63 MB/sec	39 MB/sec
worst raw throughput	7.0 MB/sec	7.0 MB/sec	6.9 MB/sec

The Unix time command was used to obtain elapsed times (and CPU utilization percentages). The implied latency per command = ((4KB_time - best_time) / (commands_for_4KB_test) . The best throughput after latency = transfer_size / (best_direct_time - (implied_latency_per_command * commands_for_best_test)).

There are various factors at play in these figures. Here is a summary:

SCSI command latency and setup times dominate small buffer size transfers
simulated direct IO is faster than direct IO because 1) there are no kiobuf setup times and 2) its scatter gather element size if bigger (32 KB versus 4 KB)
the kernel to user space copy in indirect IO costs time (and consumes CPU cycles) at larger buffer sizes
direct IO times are a little better (especially for smaller buffer sizes) if the given buffer is page aligned

Some of these factors are a bit technical. For example "kiobufs" are a kernel mechanism that facilitates direct IO. The "raw" devices (e.g. /dev/raw/raw1) use the same mechanism. Anyone who is interested in using direct IO should also read the "notes" for the sg version 3 driver listed on the main page as there are some other issues.

Also in sg version 3.1.18 all direct IO (dio) operations are disallowed by default. They are enabled by a user with root permissions doing: "echo 1 > /proc/scsi/sg/allow_dio".

Other simulated direct IO figures

Following is data collected from several systems some time ago. The "direct IO" figures in this
case are actually simulated direct IO (i.e. obtained from "sg_rbuf -q"). System C below is the same hardware used to obtain the results in the prior section. The upper part of the following table shows the number of seconds it took to transfer 200 MB (MB == 1024**2) on 4 different hardware configurations. Multiple SCSI Read Buffer commands were used to transfer the 200 MB. The size of the buffer used for each command is given to the left of each row. The maximum size of this buffer depends the disk in question.

In the upper part of the following table each cell shows 2 times. The first, called direct, is the time to DMA the data from the disk buffer into the kernel buffers. The second time, called indirect, is the direct time plus the additional time to transfer the data in the kernel buffers into the user space.

A, 2xCeler 450, UW
direct / indirect B, PII 400, U2W
direct / indirect C, 2xCel 533, U2W
direct/indirect (CPU util) D, 2xPPro 200, fibre
direct / indirect

671 KB 5.55 / 8.02

512 KB 5.59 / 7.69 2.94 / 5.03 (41%) 2.95 / 9.49

488 KB 5.82 / 7.32 5.61 / 7.70 2.95 / 5.04 (41%) 3.01 / 9.56

256 KB 6.32 / 7.83 5.96 / 8.01 3.14 / 5.48 (42%) 3.34 / 9.88

32 KB 13.31 / 14.83 10.30 / 11.99 5.91 / 6.74 (11%) 9.82 / 15.80

16 KB 21.91 / 23.15 15.61 / 17.19 9.09 / 9.74 (9%) 17.52 / 23.34

8 KB 37.57 / 38.54 26.34 / 27.86 15.47 / 16.07 (6%) 32.58 / 38.32

4 KB 75.12 / 75.14 42.54 / 43.02 28.15 / 28.75 (5%) 63.26 / 68.76

latency per command (implied) 1.35 ms 0.72 ms 0.50 ms 1.18 ms

Best throughput after latency 38 MB/sec 38 MB/sec 73 MB/sec 80 MB/sec

Best raw throughput 34 MB/sec 36 MB/sec 68 MB/sec 67 MB/sec

Worst raw throughput 2.7 MB/sec 4.6 MB/sec 7.0 MB/sec 2.9 MB/sec

Upper section shows seconds to transfer 200 MB

**Upper section shows seconds to transfer 200 MB**
	A, 2xCeler 450, UW direct / indirect	B, PII 400, U2W direct / indirect	C, 2xCel 533, U2W direct/indirect (CPU util)	D, 2xPPro 200, fibre direct / indirect
671 KB		5.55 / 8.02
512 KB		5.59 / 7.69	2.94 / 5.03 (41%)	2.95 / 9.49
488 KB	5.82 / 7.32	5.61 / 7.70	2.95 / 5.04 (41%)	3.01 / 9.56
256 KB	6.32 / 7.83	5.96 / 8.01	3.14 / 5.48 (42%)	3.34 / 9.88
32 KB	13.31 / 14.83	10.30 / 11.99	5.91 / 6.74 (11%)	9.82 / 15.80
16 KB	21.91 / 23.15	15.61 / 17.19	9.09 / 9.74 (9%)	17.52 / 23.34
8 KB	37.57 / 38.54	26.34 / 27.86	15.47 / 16.07 (6%)	32.58 / 38.32
4 KB	75.12 / 75.14	42.54 / 43.02	28.15 / 28.75 (5%)	63.26 / 68.76

latency per command (implied)	1.35 ms	0.72 ms	0.50 ms	1.18 ms
Best throughput after latency	38 MB/sec	38 MB/sec	73 MB/sec	80 MB/sec
Best raw throughput	34 MB/sec	36 MB/sec	68 MB/sec	67 MB/sec
Worst raw throughput	2.7 MB/sec	4.6 MB/sec	7.0 MB/sec	2.9 MB/sec

The SCSI Read Buffer command should not involve any physical disk IO but this is not guaranteed. It is assumed that the buffer on the disk can source data fast enough not to be significant in these timings. Ultra Wide SCSI has a maximum transfer rate of 40 MB/sec while Ultra 2 Wide's maximum transfer rate is 80 MB/sec. Fibre channel has a maximum transfer rate of around 100 MB/sec.

The implied latency per command = ((4KB_direct_time - best_direct_time) / (commands_for_4KB_test) . The best throughput after latency = transfer_size / (best_direct_time - (implied_latency_per_command * commands_for_best_test)).

System A has dual Celerons (300 MHz) overclocked to 450 MHz with 128MB of RAM, and an Advansys 940UW controller with a IBM DCHS04U disk. The kernel used was 2.2.7. System owned by John Meijer (meijer@pathcom.com) and tests done by author.

System B has a Pentium II running at 400 MHz with an Adaptec 7890 controller and a IBM DRSV09V disk. The kernel used was 2.2.5-ac3. Tests performed by Jens Axboe (axboe@suse.de).

System C has dual 533 MHz Celerons with 128 MB of RAM on a BP6 motherboard and an Advansys 3940U2W controller with a Seagate ST318451LW disk. The kernel used was 2.4.0-test12. The CPU utilizations shown in parentheses are of one of the available CPUs for the indirect transfer; CPU utilizations on the direct transfers were negligible. System owned and tests done by the author (dgilbert@interlog.com).

System D has 2xPPro at 200 MHz with 256MB of RAM, and a Qlogic QLA2100 FC HBA with a Seagate ST19171GC fibre channel disk. The kernel was 2.2.5 with the sg driver upgraded (to what later appeared in 2.2.6). The tests were run by Matthew Jacob (mjacob@feral.com) who wrote the SCSI low level driver.

Conclusion

For data transfer commands with small buffer sizes ( <= 8 KB), there is little benefit from using direct IO. For large buffer sizes there can be substantial throughput gains (35% at 512 KB buffer size) and much better CPU utilization. Like all tests, these results are a little unrealistic and real life improvements would be less than the best case measured here. Direct IO also introduces some other complexities that should be considered.

Back to main page

Doug Gilbert (dgilbert@interlog.com)
Last updated: 8th April 2001