Linux Version 3 SCSI Generic Driver

Introduction
Sg version numbers
Downloads
Direct IO
Features of the version 3 driver
Disadvantages
Backward Compatibility
Acknowledgements
New sg_io_hdr structure
DIO notes
Mmap notes

Introduction

This page describes the version 3 Linux SCSI generic device driver (sg). The primary reason for this extension is to add an additional interface that is easy to use while making new information available to user programs and supporting direct IO. The ability to memory map sg's reserve buffer to the user space was added in sg version 3.1.22 which was first available in lk 2.4.17 .

Sg version numbers

The sg version numbering system arbitrarily assigns the "version 1" status to the original sg driver found for several years in linux prior to kernel version 2.2.6 . "Version 2" sg drivers include a version number that is of the form 2.x.y . That version improves the implementation while allowing a few more features in the interface by "stretching" the existing sg_header structure. As several new features became available in the 2.3 series of Linux development kernels, it was decided to add a new interface that was more general and extensible. The idea is to support all existing applications via the old sg_header structure interface while providing an additional interface (based on struct sg_io_hdr which is shown below) for new and upgraded applications that wish to access the extra features. So the sg driver available in the lk 2.4 series of production kernels (and the lk 2.5 development series) is referred to as "version 3".

Downloads

The latest version 3 sg drivers plus utilities and documentation can be found in the tables on the main page. There are driver sfor the 2.4 series of kernels which is the primary target environment and an optional driver for the 2.2 series.

Direct IO

In sg version 3.1.19 released in lk 2.4.7 direct IO ("dio") is more stable than earlier lk 2.4 versions. Hence the "dio" code is compiled in but a new procfs sg variable disables it by default. To turn it on use 'echo 1 > /proc/scsi/sg/allow_dio'. See DIO notes at the bottom of this page.

The advantage of direct IO is that it allows the SCSI adapter's DMA chip to transfer data directly into (or from) user memory. This saves a copy through kernel buffers that usually occurs. For relatively slow SCSI devices such as CD writers and scanners (except high end professional models) direct IO offers very little performance advantage.

Sg is still capable of transferring data at 40 MB/sec (at less than 25% CPU utilization) using typical PC hardware (e.g. 500 MHz Celeron) with indirect IO. That is about as fast as a single high performance SCSI disk can source continuous data today (e.g. Seagate ST318451LW 15,000 rpm 3.9ms access).

If direct IO is requested and if it is not available then the driver transparently performs indirect IO. The 'info' output member in sg_io_hdr flags what type of transfer actually occurred. Reasons why direct IO may not be available include:

Features of the version 3 driver

The version 3 sg driver has the following extra features compared with the version 2 driver: Those features marked with "**" are only available on 2.3/2.4 series Linux kernels. The former interface limited the sense buffer to 16 bytes of data. The sense buffer is actually the response to a REQUEST SENSE that is automatically sent following most failed commands. This mechanism is sometimes referred to as "autosense".

The user space scatter gather is independent of scatter gather performed by the DMA chip associated with most modern SCSI adapters. It has no alignment requirements and it is added as a convenience for applications that need to marshal a lot of data.

Pointers to the SCSI command, the data buffer (which may be a scatter gather list) and the sense buffer are passed through to the sg_io_hdr interface in the version 3 driver. This frees the user of the new interface from having to marshal and unmarshal this distinct data into a single buffer as required in the sg_header interface. Several applications that support multiple OSes in their transport layer are forced to allocate addition buffers and do otherwise redundant copies to cope with the alignment restrictions of the sg_header structure interface to sg. The SANE application for scanners is an example of this wasteful procedure. That "buffer shuffling" will no longer be required with the new interface.

The residual DMA count is not supported by all SCSI adapters. Those that don't support it yield 0. At the time of writing the advansys, sym53c8xx, aha152x and aic7xxx adapter drivers support resid.

Disadvantages

While CAM3 is very interesting, it requires a significant re-organization of the existing SCSI sub-system and some extensions to Linux or work-arounds (e.g. callback functions). The CAM3 "pass through" interface offers a much finer grain control over a SCSI device than sg does. For example, it can control a target mode device and talk at the messaging level to devices (the messaging level sits logically below the command level in SCSI). While this level of control is important in some areas (e.g. IP over SCSI), the reduced capabilities offered by sg seem sufficient for most purposes.
 

Backward Compatibility

The intention is that the new sg_io_hdr structure will be backward compatible with the existing sg_header structure. Each packet given to write(2) is examined to see whether it is using the sg_header or the sg_io_hdr interface. The decision is made by inspecting the second integer position. In sg_io_hdr ('int dxfer_direction') is always negative. That position corresponds in sg_header to 'int reply_len' which must be a positive number (>= sizeof(struct sg_header) ). This is why the direction constants associated with the sg_io_hdr interface are chosen to have negative values (e.g. "#define SG_DXFER_TO_DEV -2").

This definition of sg_io_hdr opens up the possibility of it also being used as an alternate structure passed to the SCSI_IOCTL_SEND_COMMAND ioctl() command. The current rather inflexible structure (called Scsi_Ioctl_Command) also requires a non-negative integer in its second integer postion. This would make the new interface structure available to all SCSI devices (not just sg).
 

Acknowledgements

Robin Miller <rmiller@bit-net.com>,  George Stabler <gms@worksta.com> and Grant Guenther <grant@torque.net> have all made useful suggestions about this design. Many of  their suggestions have been incorporated. I am very grateful for their input.

New sg_io_hdr structure follows:


typedef struct sg_iovec { /* parallels "struct iovec" in readv() system call */
    void * iov_base;            /* start address */
    size_t iov_len;             /* length in bytes */
} sg_iovec_t; /* the scatter-gather list is an array of objects of this type */

typedef struct sg_io_hdr
{
    int interface_id;           /* [i] 'S' for SCSI generic (required) */
    int dxfer_direction;        /* [i] data transfer direction  */
    unsigned char cmd_len;      /* [i] SCSI command length ( <= 16 bytes) */
    unsigned char mx_sb_len;    /* [i] max length to write to sbp */
    unsigned short iovec_count; /* [i] 0 implies no scatter gather */
    unsigned int dxfer_len;     /* [i] byte count of data transfer */
    void * dxferp;              /* [i] [*io] points to data transfer memory or
                                             scatter gather list */
    unsigned char * cmdp;       /* [i] [*i] points to SCSI command to perform */
    unsigned char * sbp;        /* [i] [*o] points to sense_buffer memory */
    unsigned int timeout;       /* [i] MAX_UINT->no timeout (unit: millisec) */
    unsigned int flags;         /* [i] 0 -> default, see SG_FLAG... */
    int pack_id;                /* [i->o] unused internally (normally) */
    void * usr_ptr;             /* [i->o] unused internally */
    unsigned char status;       /* [o] scsi status */
    unsigned char masked_status;/* [o] shifted, masked scsi status */
    unsigned char msg_status;   /* [o] messaging level data (optional) */
    unsigned char sb_len_wr;    /* [o] byte count actually written to sbp */
    unsigned short host_status; /* [o] errors from host adapter */
    unsigned short driver_status;/* [o] errors from software driver */
    int resid;                  /* [o] dxfer_len - actual_transferred */
    unsigned int duration;      /* [o] time taken (unit: millisec) */
    unsigned int info;          /* [o] auxiliary information */
} sg_io_hdr_t;  /* around 64 bytes long (on i386) */

/* Use negative values to flag difference from original sg_header structure */
#define SG_DXFER_NONE -1        /* e.g. a SCSI Test Unit Ready command */
#define SG_DXFER_TO_DEV -2      /* e.g. a SCSI WRITE command */
#define SG_DXFER_FROM_DEV -3    /* e.g. a SCSI READ command */
#define SG_DXFER_TO_FROM_DEV -4 /* treated like SG_DXFER_FROM_DEV with the
                                   additional property than during indirect
                                   IO user buffer is copied into the kernel
                                   buffers before the transfer */
#define SG_DXFER_UNKNOWN -5     /* Unknown data direction */

/* following flag values can be "or"-ed together */
#define SG_FLAG_DIRECT_IO 1     /* default is indirect IO */
#define SG_FLAG_LUN_INHIBIT 2   /* default is to put device's lun into */
                                /* the 2nd byte of SCSI command */
#define SG_FLAG_MMAP_IO 4       /* selects memory mapped IO. Introduced in
                                   version 3.1.22 . May not be present in
                                   GNU library headders for some time */
#define SG_FLAG_NO_DXFER 0x10000 /* no transfer of kernel buffers to/from */
                                /* user space (debug indirect IO) */
 

/* following 'info' values are "or"-ed together */
#define SG_INFO_OK_MASK 0x1
#define SG_INFO_OK 0x0          /* no sense, host nor driver "noise" */
#define SG_INFO_CHECK 0x1       /* something abnormal happened */

#define SG_INFO_DIRECT_IO_MASK 0x6
#define SG_INFO_INDIRECT_IO 0x0 /* data xfer via kernel buffers (or no xfer) */
#define SG_INFO_DIRECT_IO 0x2
#define SG_INFO_MIXED_IO 0x4    /* part direct, part indirect IO */
 



Comments and suggestions welcome.
 

DIO notes

As noted above "dio" code is active in sg 3.1.19 [lk 2.4.7] but it is disabled by the a variable controlled by procfs: /proc/scsi/sg/allow_dio . SYS_ADMIN (typically root permissions) or SYS_RAWIO capabilities are required to set that procfs variable.

Direct IO involves locking down user allocated memory. This is to stop user  memory "moving" while the DMA element in the SCSI adapter is accessing it. Linux locks RAM in PAGE_SIZE units (4096 bytes on the i386 architecture). The user of "dio" needs to take care that a single unit of memory is not locked more than once during one sg transaction. Without precautions this could happen when queuing commands, accessing sg via multiple threads or some external mechanism (e.g. mlock() ?). Another worrying scenario is several processes sharing memory that contains a buffer used by sg (this happens in some SANE drivers). The following code snippet will ensure that no two allocations get the given the same page:
    psz = getpagesize();
    if (NULL == (alloc_bp = malloc(sz + psz)))  /* 'sz' bytes required */
        exit(1);  /* out of memory */
    buffp = (unsigned char *)
            (((unsigned long)alloc_bp + psz - 1) & (~(psz - 1)));

The above code additionally aligns 'buffp' to the beginning of a page which is not strictly necessary but may improve performance.

Memory that looks contiguous in the user space (e.g. from a single malloc() ) is typically non-contiguous when the kernel looks at it. That normally limits the largest single transfer using dio to (PAGE_SIZE * adapter_scatter_gather_list_length). The latter variable is determined by the adapter driver but sg limits it to 255. So on the i386 architecture that limits a single transfer to just under 1 MB.

The following utilities in the sg3_utils package (see the main page) have options to use dio: sg_dd, sgp_dd, sgq_dd and sg_rbuf .
 

Mmap Notes

In the case of the sg driver, memory mapped IO maps the reserve buffer into the user space. There is only one reserve buffer per sg file descriptor (but there can be many sg file descriptors per SCSI device). The default size of the reserve buffer is 32 KB and this size can be changed with the SG_SET_RESERVED_SIZE ioctl().

Mmap-ed IO, like direct IO, removes the extra copy usually performed from sg's kernel buffers into the user space (or vice versa). Unfortunately the strategy used by direct IO causes significant per command overhead (think 1 millisecond) so that it is only a performance win for SCSI commands with big data payloads (e.g. a READ of 256 KB in one command). Mmap-ed IO has next to no per command overhead imposed by the sg driver.

Using mmap-ed IO requires an application to change its buffer management. An application will no longer call malloc() [or one of its friends] but rather call mmap() which returns a valid pointer if successful. Here is some pseudo code to illustrate the point:

sg_fd = open(sg_dev_filename, O_RDWR);
k = res_sz = 128 * 1024;  /* max data transfer size required */
psz = getpagesize();
if (0 == (k % psz))
    k = ((k / psz) + 1) * psz; /* round up to page size multiple */
ioctl(sg_fd, SG_SET_RESERVED_SIZE, &k);
mmBuff = mmap(NULL, res_sz, PROT_READ | PROT_WRITE,
                       MAP_SHARED, sg_fd, 0);
for ( .... ) {
    sg_io_hdr_t io_hdr;
    memset(&io_hdr, 0, sizeof(sg_io_hdr_t));
    ...
    /* no need to set io_hdr.dxferp */
    io_hdr.flags = SG_FLAG_MMAP_IO;
    ioctl(sg_fd, SG_IO, &io_hdr);
    ...
    /* assuming a READ like SCSI command */
    /* application now reads data at mmBuff */
}

Here are a few points to note:

When the SG_MMAP_FLAG_IO is set and the reserve buffer is occupied [EBUSY] , too small [ENOMEM] or cannot be done for whatever reason then no operation is performed on the SCSI device. Mmap-ed IO invocation is different in this respect from dio which defaults back to normal IO if it cannot be done.

Mmap() can be called multiple times on a single sg file descriptor. The process can also be forked or the sg file descriptor duplicated with dup(). I can't think why these variations would be useful but they have been tested. Zero copy between two SCSI devices (but still DMA in and DMA out) is possible if one device uses mmap-ed IO and the other uses direct IO .... [my guess prior to timing it is that copying between two mmap-ed buffers in user space (one for the read device, the other for the write device) will be quicker than the mmap/direct IO combination. Simpler still: do mmap-ed IO on the read side and normal IO on the write side. The latter approach is what sgm_dd does.]

The best example is real code. See the latest sg3_utils package (downloads on main page) that includes a new sg_dd variant called sgm_dd that uses mmap-ed IO. sg_simple4 is a bare bones example of mmap-ed IO on an INQUIRY response. The sg_rbuf SCSI bus speed tester has a new "-m" argument. There is also a new program called "sg_read" that reads multiple blocks from the same logical address. It has a command line syntax like "sg_dd" but with no "of" argument. Both "sg_rbuf" and "sg_read" offer internal transfer timimg.

Return to main page.

Author: Douglas Gilbert (dgilbert@interlog.com)
Last Updated: 13th April 2002 13:00