The Linux SCSI Generic Interface ================================ 20010329 D. Gilbert Introduction ------------ This document outlines the Linux SCSI Generic (sg) interface as found in the 2.4 series kernels. It presents the third major version of the sg driver. A summary of the sg driver history looks like this: - sg version 1 (original) from 1992 to lk 2.2.5 - sg version 2 from lk 2.2.6 in the 2.2 series - sg version 3 from lk 2.3.43 and then in the 2.4 series Documentation for these versions is available at: - sg v1 www.torque.net/sg/p/original/SCSI-Programming-HOWTO.txt - sg v2 www.torque.net/sg/p/scsi-generic.txt * [abridged] www.torque.net/sg/p/scsi-generic_long.txt - sg v3 www.torque.net/sg/p/scsi-generic_v3.txt [latest version of this] The document marked with "*" can be found in the kernel source of the 2.2 series usually in /usr/src/linux/Documentation/scsi-generic.txt . This "v3" document may be found at the same location in 2.4 kernels once the series matures. The version 2 documentation supersedes version 1 while the version 3 documentation _supplements_ version 2. Identifying the version of the SG driver ---------------------------------------- Existing versions of the sg device driver either have no version number (e.g. the original driver) or a version number starting with "2". The drivers that support this new interface have a major version number of "3". The sg version numbers are of the form "x.y.z" and the single number given by the SG_GET_VERSION_NUM ioctl() is calculated by (x * 10000 + y * 100 + z). The sg driver discussed here will yield a number greater than or equal to 30000 from SG_GET_VERSION_NUM. The version number can also be seen using 'cat /proc/scsi/sg/version' in the new driver. This document describes sg version 3.1.18 for the lk 2.4 series. There is an sg version 3.0.17 which is an optional driver for the lk 2.2 series. It has the following limitations: - sense buffer limited to 16 bytes - resid (residual data transfer count) is always 0 - direct IO not supported (defaults to indirect IO) Interface --------- This driver supports the system calls that would be expected of any character device driver in Linux. They are: open(), close(), write(), read() and ioctl() (plus a few others associated with polling and asynchronous notification). The major addition in sg v3 is an ioctl() called SG_IO which is functionally equivalent to a write() followed by a blocking read(). In certain contexts the write()/read() combination have advantages over SG_IO (e.g. command queuing) and continue to be supported. The existing (and original) sg interface based on the sg_header structure is still available using a write()/read() sequence as before. The SG_IO ioctl will only accept the new interface based on the sg_io_hdr_t structure. The sg v3 driver thus has a write() call that can accept either the older sg_header structure or the new sg_io_hdr_t structure. It makes its decision based on the second integer position of the passed header (i.e. sg_header::reply_len or sg_io_hdr_t::dxfer_direction). If it is a positive number then the old interface is assumed. If it is a negative number then the new interface is assumed. The direction constants placed in 'dxfer_direction' in the new interface have been chosen to have negative values. If a request is sent to a write() with the sg_io_hdr_t interface then the corresponding read() that fetches the response must also use the sg_io_hdr_t interface. The same rule applies to the sg_header interface. Some seldom used ioctl()s introduced in the sg 2.x series drivers have been withdrawn. They are: - SG_SET_UNDERRUN_FLAG (and _GET_) [use 'resid' in this new interface] - SG_SET_MERGE_FD (and _GET) [added complexity with little benefit] Theory of operation ------------------- The path of a request through the sg driver can be broken into 3 distinct stages: 1) The request is received from the user, resources are reserved as required (e.g. kernel buffer for indirect IO). If necessary, data in the user space is transferred into kernel buffers. Then the request is submitted to the SCSI mid level (and then onto the adapter) for execution. The SCSI mid level maintains a queue so the request may have to wait. If the device supports tagged queuing then it may be able to accommodate multiple outstanding requests. 2) Assuming the SCSI adapter supports interrupts, then an interrupt is received when the request is completed. When this interrupt arrives the data transfer is complete. This means that if the SCSI command was a READ then the data is in kernel buffers (indirect IO) or in user buffers (direct IO). The sg driver is informed of this interrupt via a kernel mechanism called a "bottom half" handler. Some kernel resources are freed up. 3) The user makes a call to fetch the result of the request. If necessary, data in kernel buffers is transferred to the user space. If necessary, the sense buffer is written out to the user space. The remaining kernel resources associated with this request are freed up. The write() call performs stage 1 while the read() call performs stage 3. If the read() call is made before stage 2 is complete then it will either wait or yield EAGAIN (depending on whether the file descriptor is blocking or not). If asynchronous notification is being used then stage 2 will send a SIGPOLL signal to the user process. The poll() system call will show this file descriptor is now readable (unless it was sent by the SG_IO ioctl()). The SG_IO ioctl() performs stage 1, waits for stage 2 and then performs stage 3. If the file descriptor in question is set O_NONBLOCK then SG_IO will ignore this and still block! Also a SG_IO call will not effect the poll() state nor cause a SIGPOLL signal to be sent. If you really want non-blocking operation (e.g. for command queuing) then don't use SG_IO; use the write() read() sequence instead. The sg_io_hdr_t structure in detail ----------------------------------- int interface_id; /* [i] 'S' for SCSI generic */ This must be set to 'S' (capital ess). If not, the ENOSYS error message is placed in errno. The idea is to allow interface variants in the future that identify themselves with a different value. [The parallel port generic driver (pg) uses the letter 'P' to identify itself.] int dxfer_direction; /* [i] data transfer direction */ This is required to be one of the following: SG_DXFER_NONE /* e.g. a SCSI Test Unit Ready command */ SG_DXFER_TO_DEV /* e.g. a SCSI WRITE command */ SG_DXFER_FROM_DEV /* e.g. a SCSI READ command */ SG_DXFER_TO_FROM_DEV SG_DXFER_UNKNOWN The value SG_DXFER_NONE should be used when there is no data transfer associated with a command (e.g. TEST UNIT READY). The value SG_DXFER_TO_DEV should be used when data is being moved from user memory towards the device (e.g. WRITE). The value SG_DXFER_FROM_DEV should be used when data is being moved from the device towards user memory (e.g. READ). The value SG_DXFER_TO_FROM_DEV is only relevant to indirect IO (otherwise it is treated like SG_DXFER_FROM_DEV). Data is moved from the user space to the kernel buffers. The command is then performed and most likely a READ-like command transfers data from the device into the kernel buffers. Finally the kernel buffers are copied back into the user space. This technique allows application writers to initialize the buffer and perhaps deduce the actually number of bytes read from the device (i.e. detect underrun). This is better done by using 'resid' if it is supported. The value SG_DXFER_UNKNOWN is for those (rare) situations where the data direction is not known. It may be useful for backward compatibility of existing applications when the relevant direction information is not available in the sg interface layer. There is a (very minor) performance "hit" associated with choosing this option (e.g. on the PCI bus). Some recent pseudo device drivers (e.g. USB mass storage) may have problems handling this value (especially on vendor-specific SCSI commands). N.B. 'dxfer_direction' must have one of the five indicated values and cannot be uninitialized or zero. If 'dxfer_len' is zero then all values are treated like SG_DXFER_NONE. unsigned char cmd_len; /* [i] SCSI command length ( <= 16 bytes) */ This is the length in bytes of the SCSI command that 'cmdp' points to. As a SCSI command is expected an EMSGSIZE error number is produced if the value is less than 6 or greater than 16. Further, if the subsystem has a further limit (as it does at 12 bytes) then EMSGSIZE is produced in this case as well. unsigned char mx_sb_len; /* [i] max length to write to sbp */ This is the maximum size that can be written back to the 'sbp' pointer when a sense_buffer is output which is usually in an error situation. The actual number written out is given by 'sb_len_wr'. In all cases 'sb_len_wr' <= 'mx_sb_len' . unsigned short iovec_count; /* [i] 0 implies no scatter gather */ This is the number of scatter gather elements in an array pointed to by 'dxferp'. If the value is zero then scatter gather (in the user space) is _not_ being used and 'dxferp' points to the data transfer buffer. If the value is greater than zero then each element of the array is assumed to be of the form: typedef struct sg_iovec { void * iov_base; /* starting address */ size_t iov_len; /* length in bytes */ } sg_iovec_t; Note that this structure has been named and defined in such a way to parallel "struct iovec" used by the readv() and writev() system calls in Linux. See "man 2 readv". unsigned int dxfer_len; /* [i] byte count of data transfer */ This is the number of bytes to be moved in the data transfer associated with the command. The direction of the transfer is indicated by 'dxfer_direction'. If 'dxfer_len' is zero then no data transfer takes place. If iovec_count is non-zero then 'dxfer_len' should be equal to the sum of iov_len lengths. If not, the minimum of the 2 is the transfer length. void * dxferp; /* [i], [*io] points to data transfer memory or scatter gather array */ If 'iovec_count' is zero then this value is a pointer to user memory of at least 'dxfer_len' bytes in length. If there is a data transfer associated with the command then the data will be transferred to or from this user memory. If 'iovec_count' is greater than zero then this value points to a scatter-gather array in user memory. Each element of this array should be an object of type sg_iovec_t. Note that data is sometimes written to user memory (e.g. from a failed SCSI READ) even when an error has occurred. unsigned char * cmdp; /* [i], [*i] points to command to perform */ This value points to the SCSI command to be executed. The command is assumed to be 'cmd_len' bytes long. If cmdp is NULL then the system call yields an EMSGSIZE error number. The user memory pointed to is only read (not written to). unsigned char * sbp; /* [i], [*o] points to sense_buffer memory */ This value points to user memory of at least 'mx_sb_len' bytes length where the SCSI sense buffer will be output. Most successful commands do not output a sense buffer and this will be indicated by 'sb_len_wr' being zero. unsigned int timeout; /* [i] MAX_UINT->no timeout (unit: millisec) */ This value is used to timeout the given command. The units of this value are milliseconds. The time being measured is from when a command is sent until when sg is informed the request has been completed. A following read() can take as long as the user likes. When a timeout is exceeded the command is aborted and DID_TIME_OUT is set in the 'host_status' and DRIVER_TIMEOUT is set in the 'driver_status'. unsigned int flags; /* [i] 0 -> default */ These are single or multi-bit values that can be "or-ed" together: SG_FLAG_DIRECT_IO This is a request for direct IO on the data transfer. If it cannot be performed then the driver automatically performs indirect IO instead. If it is important to find out which type of IO was performed then check the values from the SG_INFO_DIRECT_IO_MASK in 'info' when the request packet is completed (i.e. after read() or ioctl(,SG_IO,) ). The default action is to do indirect IO. SG_FLAG_LUN_INHIBIT The default action of the sg driver to overwrite internally the top 3 bits of the second SCSI command byte with the LUN associated with the file descriptor's device. To inhibit this action set this flag. For SCSI 3 (or later) devices, this internal LUN overwrite does not occur. SG_FLAG_NO_DXFER When set user space data transfers to or from the kernel buffers do not take place. This only has effect during indirect IO. This flag is for testing bus speed (e.g. the "sg_rbuf" utility uses it). int pack_id; /* [i->o] unused internally (normally) */ This value is not normally acted upon by the sg driver. It is provided so the user can identify the request. This is useful when command queuing is being used. The "abnormal" case is when SG_SET_FORCE_PACK_ID is set and a 'pack_id' other than -1 is given to read(). In this case the read() will wait to fetch a request that matches this 'pack_id'. If this mode is used be careful to set 'dxfer_direction' to a valid value (actually any of the SG_DXFER_* values will do) on input to the read(), together with the wanted pack_id. void * usr_ptr; /* [i->o] unused internally */ This value is not acted upon by the sg driver. It is meant to allow the user to associate some object with this request (e.g. to maintain state information). unsigned char status; /* [o] scsi status */ This is the SCSI status byte as defined by the SCSI standard. Note that it can have vendor information set in bits 0, 6 and 7 (although this is uncommon). Further note that this 'status' data does _not_ match the definitions in (e.g. CHECK_CONDITION). The following 'masked_status' does match those definitions. unsigned char masked_status;/* [o] shifted, masked scsi status */ Logically: masked_status == ((status & 0x3e) >> 1) So 'masked_status' strips the vendor information bits off 'status' and then shifts it right one position. This makes it easier to do things like "if (CHECK_CONDITION == masked_status) ..." using the definitions in . The defined values in this file are: GOOD 0x00 CHECK_CONDITION 0x01 CONDITION_GOOD 0x02 BUSY 0x04 INTERMEDIATE_GOOD 0x08 INTERMEDIATE_C_GOOD 0x0a RESERVATION_CONFLICT 0x0c COMMAND_TERMINATED 0x11 QUEUE_FULL 0x14 /* N.B. 1 bit offset from usual SCSI status values */ Note that SCSI 3 defines some additional status codes. unsigned char msg_status; /* [o] messaging level data (optional) */ The messaging level in SCSI is under the command level and knowledge of what is happening at the messaging level is very rarely needed. Furthermore most modern chip-sets used in SCSI adapters completely hide this value. Nearly all adapters will return zero in 'msg_status' all the time. unsigned char sb_len_wr; /* [o] byte count actually written to sbp */ This is the actual number of bytes written to the user memory pointed to by 'sbp'. 'sb_len_wr' is always <= 'mx_sb_len'. Linux 2.2 series kernels (and earlier) truncate this value to a maximum of 16 bytes. The actual number of bytes written will not exceed the length indicated by "Additional Sense Length" field (byte 7) of the Request Sense response. unsigned short host_status; /* [o] errors from host adapter */ These codes potentially come from the firmware on a host adapter or from one of several hosts that an adapter driver controls. The 'host_status' field has the following values whose #defines mimic those which are only visible within the kernel (with the "SG_ERR_" removed from the front of each define). A copy of these defines can be found in sg_err.h (see the utilities section): SG_ERR_DID_OK 0x00 /* NO error */ SG_ERR_DID_NO_CONNECT 0x01 /* Couldn't connect before timeout period */ SG_ERR_DID_BUS_BUSY 0x02 /* BUS stayed busy through time out period */ SG_ERR_DID_TIME_OUT 0x03 /* TIMED OUT for other reason */ SG_ERR_DID_BAD_TARGET 0x04 /* BAD target, device not responding? */ SG_ERR_DID_ABORT 0x05 /* Told to abort for some other reason */ SG_ERR_DID_PARITY 0x06 /* Parity error */ SG_ERR_DID_ERROR 0x07 /* Internal error [DMA underrun on aic7xxx]*/ SG_ERR_DID_RESET 0x08 /* Reset by somebody. */ SG_ERR_DID_BAD_INTR 0x09 /* Got an interrupt we weren't expecting. */ SG_ERR_DID_PASSTHROUGH 0x0a /* Force command past mid-layer */ SG_ERR_DID_SOFT_ERROR 0x0b /* The low level driver wants a retry */ unsigned short driver_status;/* [o] errors from software driver */ One driver can potentially control several hosts. For example Advansys provide one Linux adapter driver that controls all adapters made by that company - if 2 of more Advansys adapters are in 1 machine, then 1 driver controls both. When ('driver_status' & SG_ERR_DRIVER_SENSE) is true the 'sense_buffer' is also output. The 'driver_status' field has the following values whose #defines mimic those which are only visible within the kernel (with the "SG_ERR_" removed from the front of each define). A copy of these defines can be found in sg_err.h (see the utilities section): SG_ERR_DRIVER_OK 0x00 /* Typically no suggestion */ SG_ERR_DRIVER_BUSY 0x01 SG_ERR_DRIVER_SOFT 0x02 SG_ERR_DRIVER_MEDIA 0x03 SG_ERR_DRIVER_ERROR 0x04 SG_ERR_DRIVER_INVALID 0x05 SG_ERR_DRIVER_TIMEOUT 0x06 SG_ERR_DRIVER_HARD 0x07 SG_ERR_DRIVER_SENSE 0x08 /* Implies sense_buffer output */ /* above status 'or'ed with one of the following suggestions */ SG_ERR_SUGGEST_RETRY 0x10 SG_ERR_SUGGEST_ABORT 0x20 SG_ERR_SUGGEST_REMAP 0x30 SG_ERR_SUGGEST_DIE 0x40 SG_ERR_SUGGEST_SENSE 0x80 int resid; /* [o] dxfer_len - actual_transferred */ This is the residual count from the data transfer. It is 'dxfer_len' less the number of bytes actually transferred. In practice it only reports underruns (i.e. positive number) as data overruns should never happen. This value will be zero if there was no underrun or the SCSI adapter doesn't support this feature. unsigned int duration; /* [o] time taken (unit: millisec) */ This value will be the number of milliseconds from when a SCSI command was sent until sg is informed that it is complete. For i386 machines the granularity is 10ms while on alpha machines it is 1ms. This value is rounded toward zero. unsigned int info; /* [o] auxiliary information */ This value is designed to convey useful information back to the user about the associated request. This information does not necessarily indicate an error. Several single bit and multi-bit fields are "or-ed" together to make this value. A single bit component contained in SG_INFO_OK_MASK indicates whether some error or status field is non-zero. If either 'masked_status', 'host_status' or 'driver_status' are non-zero then SG_INFO_CHECK is set. The associated values are: SG_INFO_OK_MASK 0x1 SG_INFO_OK 0x0 /* no sense, host nor driver "noise" */ SG_INFO_CHECK 0x1 /* something abnormal happened */ A multi bit component contained in SG_INFO_DIRECT_IO_MASK indicates what type of data transfer has just taken place. If indirect IO (or no data transfer) has taken place then SG_INFO_INDIRECT_IO is matched. Note that even if direct IO was requested in 'flags' the driver may choose to do indirect IO instead. If direct IO was requested and performed then SG_INFO_DIRECT_IO will be matched. Currently SG_INFO_MIXED_IO is never set. The associated values are: SG_INFO_DIRECT_IO_MASK 0x6 SG_INFO_INDIRECT_IO 0x0 /* data xfer via kernel buffers (or no xfer) */ SG_INFO_DIRECT_IO 0x2 SG_INFO_MIXED_IO 0x4 /* part direct, part indirect IO */ The new write() and read() calls -------------------------------- write(int sg_fd, const void * buffer, size_t count) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The action of write() with a control block based on struct sg_header is discussed in the earlier document: scsi-generic.txt (v2). This section describes the action of write() when it is given a control block based on struct sg_io_hdr. The 'buffer' should point to an object of type sg_io_hdr_t and 'count' should be sizeof(sg_io_hdr_t) [it can be larger but the excess is ignored]. If the write() call succeeds then the 'count' is returned as the result. Up to SG_MAX_QUEUE (16) write()s can be queued up before any finished requests are completed by read(). An attempt to queue more than that will result in an EDOM error. The write() command should return more or less immediately. [There is a small probability it will spend some time waiting for a command block to become available. If O_NONBLOCK is active then this scenario will cause a EAGAIN.] The version 2 sg driver defaulted the maximum queue length to 1 (and made available the SG_SET_COMMAND_Q ioctl() to switch it to SG_MAX_QUEUE). So for backward compatibility a file descriptor that only receives sg_header structures in its write() will have a default "max" queue length of 1. As soon as a sg_io_hdr_t structure is seen by a write() then the maximum queue length is switched to SG_MAX_QUEUE on that file descriptor. The "const" on the 'buffer' pointer is respected by the sg driver. Data is read in from the sg_io_hdr object that is pointed to. Significantly this is when the 'sbp' and the 'dxferp' are recorded internally (i.e. not from the sg_io_hdr object given to the corresponding read() ). read(int sg_fd, void * buffer, size_t count) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The action of read() with a control block based on struct sg_header is discussed in the earlier document: scsi-generic.txt (v2). This section describes the action of read() when it is given a control block based on struct sg_io_hdr. The 'buffer' should point to an object of type sg_io_hdr_t and 'count' should be sizeof(sg_io_hdr_t) [it can be larger but the excess is ignored]. If the read() call succeeds then the 'count' is returned as the result. By default, read() will return the oldest completed request that is queued up. A read() will not interfere with any request associated with the SG_IO ioctl() on this file descriptor except in a special case when a SG_IO ioctl() is interrupted by a signal. If the SG_SET_FORCE_PACK_ID,1 ioctl() is active then read() will attempt to fetch the packet whose pack_id (given earlier to write()) matches the sg_io_hdr_t::pack_id given to this read(). If not available it will either wait or yield EAGAIN. As a special case, -1 in sg_io_hdr_t::pack_id given to read() will match the request whose response has been waiting for the longest time. Take care to also set 'dxfer_direction' to any valid value (e.g. SG_DXFER_NONE) when in this mode. The 'interface_id' member should also be set appropriately. Apart from the SG_SET_FORCE_PACK_ID case (and then only for the 3 indicated fields), the sg_io_hdr_t object given to read() can be uninitialized. Note that the 'sbp' pointer value for optionally outputting a sense buffer was recorded from the earlier, corresponding write(). poll(struct pollfd *ufds, unsigned int nfds, int timeout); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For file decriptors associated with sg devices: - POLLIN one or more responses is awaiting a read() - POLLOUT command can be sent to write() without causing an EDOM error (i.e. sufficient space on sg's queues) - POLLHUP SCSI device has been detached, awaiting cleanup - POLLERR internal structures are inconsistent POLLOUT indicates the sg will not block a new write() or SG_IO ioctl(). However it is still possible (but unlikely) that the mid level or an adapter may block (or yield EAGAIN). New ioctl()s ------------ SG_IO 0x2285 The idea is deceptively simple: just hand a sg_io_hdr_t object to an ioctl() and it will return when the SCSI command is finished. It is logically equivalent to doing a write() followed by a blocking read(). The word "blocking" here implies the read() will wait until the SCSI command is complete. The same file descriptor can be used both for SG_IO synchronous calls and the write() read() sequences at the same time. [This implies several threads which may just as easily use separate file descriptors.] The sg driver makes sure that the response to a SG_IO call will never accidentally be fetched by a read(). It is possible for the wait for the command completion to be interrupted by a signal. In this case the SG_IO call will yield an EINTR error. This is reasonably complex to handle and is discussed in the SG_SET_KEEP_ORPHAN ioctl description below. The following SCSI commands will be permitted by SG_IO when the user only has read permissions on the device: TEST UNIT READY, INQUIRY, READ CAPACITY, READ BUFFER, READ(6), READ(10), READ(12), MODE_SENSE(6), MODE_SENSE(10) This is experimental and may change. All commands to device type scanner are accepted. Other cases yield an EACCES error. Note that the write() read() interface must have read-write permissions on the device as write permission is required by Linux to execute a write() call. SG_GET_REQUEST_TABLE 0x2286 This ioctl outputs an array of information about the status of requests associated with the current file descriptor. Its 3rd argument should point to memory large enough to receive SG_MAX_QUEUE objects of the sg_req_info_t structure. This structure has the following members: req_state 0 -> request not in use 1 -> request has been sent, but is not finished (i.e. it is between stages 1 and 2 in the "theory of operation") 2 -> request is ready to be read() (i.e. it is between stages 2 and 3 in the "theory of operation") orphan 0 -> normal request 1 -> request sent by SG_IO ioctl() which has been interrupted by a signal sg_io_owned 0 -> request sent by a write() 1 -> request sent by a SG_IO ioctl() problem 0 -> no problem (or 1 == req_state) 1 -> req_state is 2 and either masked_status, host_status or driver_status is non-zero duration [if 1 == req_state] time since request was sent (in millisecs) [if 2 == req_state] duration of request (in millisecs). Clock is stopped when stage 2 in "theory of operation" is reached pack_id usr_ptr these are user provided values in the sg_io_hdr_t (or struct sg_header) that sent the request SG_SET_KEEP_ORPHAN 0x2287 SG_GET_KEEP_ORPHAN 0x2288 These ioctl()s allow the setting and reading of the "keep_orphan" flag. This controls what happens to the request associated with a SG_IO ioctl() that is interrupted (i.e. errno is EINTR). The default action is to drop the response as soon as it is received. This corresponds to the "keep_orphan" flag being 0. When the "keep_orphan" flag is 1 then the response is transformed in such a way that it can be fetched by a read(). This is the only circumstance in which a request sent by a SG_IO ioctl() can have the associated response fetched by a read(). Other changes to system calls ----------------------------- The ability of the SG_IO ioctl() to issue certain SCSI commands has led to some relaxation on file descriptors open()ed "read-only". The open() call will now attempt to allocate a reserve buffer for all newly opened file descriptors. The SG_SET_RESERVED_SIZE ioctl() will now work on "read-only" file descriptors. Changes to SCSI mid-level ioctls in 2.4 --------------------------------------- SCSI_IOCTL_GET_IDLUN: This ioctl takes a pointer to a "struct scsi_idlun" object as its third argument. The "struct scsi_idlun" definition is found in . It gets populated with scsi host, channel, device id and lun data for the given device. Unfortunately that header file "hides" that structure behind a "#ifdef __KERNEL__" block. To use this, that structure needs to be replicated in the user's program. Something like: typedef struct my_scsi_idlun { int four_in_one; /* 4 separate bytes of info compacted into 1 int */ int host_unique_id; /* distinguishes adapter cards from same supplier */ } My_scsi_idlun; "four_in_one" is made up as follows: (scsi_device_id | (lun << 8) | (channel << 16) | (host_no << 24)) These 4 components are assumed (or masked) to be 1 byte each. The 'host_no' element is a change in lk 2.4 kernels. [In the lk 2.2 series and prior it was 'low_inode & 0xff' from the procfs entry corresponding to the host.] This change makes the use of the SCSI_IOCTL_GET_BUS_NUMBER ioctl superfluous. open and close -------------- Open() and close() remain essentially the same as in sg v2. Note that multiple file descriptors may be open to the same SCSI device. [This is a way of side stepping the SG_MAX_QUEUE limit.] At the sg level separate state information is maintained. This means that even if multiple file descriptors are open to a single SCSI device their write() read() sequences are essentially independent. Open() calls may be blocked due to exclusive locks (i.e. O_EXCL). An exclusive lock applies to a single SCSI device and only to sg's use of that device (i.e. it has no effect on access via sd, sr or st to that device). If the O_NONBLOCK flag is used then open() calls that would have otherwise blocked, yield EBUSY. Applications that scan sg devices trying to determine their identity (e.g. whether one is a scanner) should use the O_NONBLOCK flag otherwise they run the risk of blocking. The close() system call never blocks. Seen from the sg driver's point of view, even when an application is aborted, any sg file descriptors it is holding will be closed. A close() may occur while a SCSI command is active, in which case the response is discarded when it is received. If sg is a module, precautions are taken so 'rmmod sg' will return a busy status until all outstanding responses are received (and duly discarded). A SCSI device can be detached while an application has a sg file descriptor open. All subsequent system calls to sg for that file descriptor, other than close(), will yield ENODEV. A subsequent attempt to open() that device name will yield ENODEV. A call to close() will succeed. Errors ------ With the original interface almost any string could be accidentally given to write() and potentially something nasty could happen. If some error was detected then more than likely EIO was placed in errno. Unfortunately this can still happen with write() since it can accept both the original struct sg_header or the newer sg_io_hdr_t described in this note. However since the SG_IO ioctl() will only accept the sg_io_hdr_t structure there is less chance of a random string being interpreted as a command. Since the sg_io_hdr_t interface does a lot more error checking, it attempts to give out more precise errno values to help the user pinpoint the problem. [Admittedly some of these errno values are picked in an arbitrary way from the large set of available values.] Below is a table of errno values indicating which calls to sg will give them and the meaning of the error. A write() call is indicated by "w", a read() call by "r" and an open() call by "o". errno which_calls Meaning ----- ----------- ---------------------------------------------- EACCES o, w,r,SG_IO User does not have permissions to do this. They will need read,write permissions currently. This has been relaxed for certain SCSI commands via SG_IO. EAGAIN r The file descriptor is non-blocking and the request has not been completed yet. EAGAIN w,SG_IO SCSI sub-system has (temporarily) run out of command blocks. EBADF w File descriptor was not open()ed O_RDWR. EBUSY o Someone else has an O_EXCL lock on this device. EBUSY Attempt to change something (e.g. reserved buffer size) when the resource was in use. EDOM w,SG_IO Too many requests queued against this file descriptor. Limit is SG_MAX_QUEUE active requests. If sg_header interface is being used then the default queue depth is 1. Use SG_SET_COMMAND_Q ioctl() to increase it. EFAULT w,r,SG_IO Pointer to user space invalid. EINVAL w,r Size given as 3rd argument not large enough for the sg_io_hdr_t structure. EIO w Size given as 3rd argument less than size of old header structure (sg_header). Additionally a write() with the old header will yield this error for most detected malformed requests. EINTR o While waiting for the O_EXCL lock to clear this call was interrupted by a signal. EINTR r,SG_IO While waiting for the request to finish this call was interrupted by a signal. EINTR w [Very unlikely] While waiting for an internal SCSI resource this call was interrupted by a signal. EMSGSIZE w,SG_IO SCSI command size ('cmd_len') was too small (i.e. < 6) or too large ENODEV o Tried to open() a file with no associated device. [Perhaps sg has not been built into the kernel or is not available as a module?] ENODEV o,w,r,SG_IO SCSI device has detached, awaiting cleanup. User should close fd. Poll() will yield POLLHUP. ENOENT o Given filename not found. ENOMEM o [Very unlikely] Kernel was not even able to find enough memory for this file descriptor's context. ENOMEM w,SG_IO Kernel unable to find memory for internal buffers. This is usually associated with indirect IO. ENOSYS w,SG_IO 'interface_id' of a sg_io_hdr_t object was _not_ 'S'. ENXIO o "remove-single-device" may have removed this device. ENXIO o, w,r,SG_IO Internal error (including SCSI sub-system busy doing error processing - e.g. SCSI bus reset). This can be bypassed by opening O_NONBLOCK Direct IO --------- The term 'direct IO' in the context of sg refers to using the SCSI adapter to directly transfer data to and from user memory. Most modern SCSI adapters can be "bus masters" and use DMA for those data transfers. This lightens the load of the CPU. Speed and lower CPU utilization come at the expense of complexity (as always). The Linux kernel must be careful not to touch that part of the user process's memory that is accessed, for the duration of the data transfer. Due to these issues most drivers, including sg in the past, have taken the simpler approach which involves the double handling of data through pre-allocated kernel buffers. This latter approach is called "indirect IO". Direct IO is available as an option in sg 3.1.18 (before that the sg driver needed to be recompiled with an altered define). Direct IO support is designed in such a way that if it is requested and cannot be performed then the command will still be performed using indirect IO. If direct IO is requested and has been performed then the SG_INFO_DIRECT_IO bit will be set in the 'info' member of the sg_io_hdr_t control structure after the request has been completed. Direct IO uses facilities are only available in the lk 2.4 series. You can request direct IO with the 2.2 series version of this driver but indirect IO will be performed. Further, direct IO is not supported on ISA SCSI adapters since they only can address a 24 bit address space. One limit on direct IO is that sg_io_hdr_t::iovec_count==0. So the user cannot (currently) use application level scatter gather and direct IO on the same request. For direct IO to be worthwhile, a reasonable amount of data should be requested for data transfer. For transfers less than 8 KByte it is probably not worth the trouble. On the other hand "locking down" a multiple 512 KB blocks of data for direct IO could adversely impact overall system performance. Remember that for the duration of a direct IO request, the data transfer buffer is mapped to a fixed memory location and locked in such a way that it won't be swapped out. This can "cramp the style" of the kernel if it is overdone. The memory given to sg as the data buffer for direct IO would usually come from the heap or be an external or static array. It is probably not wise to use the stack or shared memory. Experience has shown that a large block of memory malloc-ed by a user does not look contiguous at all seen from the DMA hardware's point of view. This means that direct IO relies on the scatter gather capabilities of the DMA hardware on the SCSI adapter. [This is a _different_ scatter gather machanism to that which the user sees in the interface based on iovec.] This puts an effective limit on the size of a direct IO transfer whose size in bytes can be approximately calculated by: (max_scsi_adapter_scatter_gather_elements - 1) * PAGE_SIZE The "-1" component allows for alignment considerations. Prior to sg 3.1.18 the direct IO code was commented out with the "SG_ALLOW_DIO" define. In sg 3.1.18 (available for lk 2.4.2 and later) the direct IO code is active but is defaulted off by a run time value. This value can be accessed via the "proc" file system at /proc/scsi/sg/allow_dio . Direct IO is enabled when a user with root permissions writes "1" to that file: 'echo 1 > /proc/scsi/sg/allow_dio' . Driver and module initialisation -------------------------------- The size of the default reserved buffer can be specified when the sg driver is loaded. If it is built into the kernel then use: sg_def_reserved_size= on the boot line (only supported in 2.4 kernels). If sg is a module, an explicit "insmod" could look like: insmod sg def_reserved_size= "" is an integer (non negative). The default value is the value of the SG_DEF_RESERVED_SIZE define in sg.h . This is currently 32768. Additions to the "proc" file system ----------------------------------- The provision of SCSI mid level and sg information via the "proc" file system is in its infancy and may change as more experience is gained. The following files readable by all are defined in the sub directory "/proc/scsi/sg" : allow_dio 0 indicates direct IO disable, 1 for enabled debug debug information including active request data def_reserved_size default buffer size reserved for each file descriptor devices one line of numeric data per device device_hdr single line of column names corresponding to 'devices' device_strs one line of vendor, product and rev info per device hosts one line of numeric data per host host_hdr single line of column names corresponding to 'hosts' host_strs one line of host information (string) per host version sg version as a number followed by a string representation Each line in 'devices' and 'device_strs' corresponds to an sg device. For example the first line corresponds to /dev/sg0 (or /dev/sga). The line number (origin 0) also corresponds to the sg minor device number. This mapping is local to sg and is normally the same as given by 'cat /proc/scsi/scsi' reported by the SCSI mid level code. The two mappings diverge when 'remove-single-device' and 'add-single-device' are used. Each line in 'hosts' and 'host_strs' corresponds to a SCSI host. For example the first line corresponds to the host normally represented as "scsi0". This mapping is invariant across the SCSI sub system. [So these entries could arguably be migrated to the mid level.] The column headers in 'device_hdr' are given below. If the device is not present (and one is present after it) then a line of "-1" entries is output. Each entry is separated by a whitespace (currently a tab): host host number (indexes 'hosts' table, origin 0) chan channel number of device (is this every non-zero?) id SCSI id of device lun Logical Unit number of device type SCSI type (e.g. 0->disk, 5->cdrom, 6->scanner) bopens number of block device (sd or sr) opens at this time depth maximum queue depth supported by device busy number of commands being processed by host for this device The column headers in 'host_hdr' are given below. Each entry is separated by a whitespace (currently a tab): uid unique id (non-zero if multiple hosts of same type) busy number of commands being processed for this host cpl maximum number of command per lun (may be 0 if "device depth" is given sgat maximum elements of scatter gather the adapter (pseudo) DMA can accommodate isa 0 -> non-ISA adapter, 1 -> ISA adapter. ISA adapters are assumed to have a 24 bit address bus limit (16 MB). emu 0 -> real SCSI adapter, 1 -> emulated SCSI adapter (e.g. ide-scsi device driver) The 'def_reserved_size' is both readable and writeable. It is only writeable by root. It is initialized to the value of DEF_RESERVED_SIZE in the "sg.h" file. Values between 0 and 1048576 (which is 2 ** 20) are accepted and can be set from the command line with the following syntax: # echo "262144" > /proc/scsi/sg/def_reserved_size Note that the actual reserved buffer associated with a file descriptor could be less than 'def_reserved_size' if appropriate memory is not available. If the sg driver is compiled into the kernel (but not when it is a module) this value can also be read at /proc/sys/kernel/sg-big-buff . This latter feature is deprecated. The 'allow_dio' is both readable and writeable. It is only writeable by root. When it is 0 (default) any request to do direct IO (i.e. by setting SG_FLAG_DIRECT_IO) will be ignored and indirect IO will be done instead. The 'debug' file outputs the current internal state of the sg driver. Scsi commands that are being processed have state, opcode, pack_id, length, timeout and elapsed time (in milliseconds) shown. Apart from using 'cat /proc/scsi/sg/debug' from the command line, application code may consider lines like: system("cat /proc/scsi/sg/debug"); at appropriate points when in debugging mode. N.B. If sg has lots of activity then the "debug" output may span many lines and in some cases appear to be corrupted. This occurs because procfs requests fixed buffer sizes of information and, if there is more data to output, returns later to get the rest. The problem with this strategy is that sg's internal state may have changed. Rather than double buffering, the sg driver just continues from the same offset. While procfs is very useful, ioctl()s (such as SG_GET_REQUEST_TABLE) still have their place. Asynchronous usage of sg ------------------------ It is recommended that synchronous sg-based applications use the new SG_IO ioctl() command. Existing applications (which are mainly synchronous) can continue to use the older sg_header based interface which is still supported. Asynchronous usage allows multiple SCSI commands to be queued up to the device. If the device supports tagged queuing then there can be a major performance gain. Even if the device doesn't support tagged queuing (or is temporarily busy) then queuing up commands in the mid level or the host driver can be a minor performance win (since there will be a lower latency to transmit the next command when the device becomes free). Asynchronous usage usually starts with setting the O_NONBLOCK flag on open() [or thereafter by using the fcntl(fd, SETFD, old_flags | O_NONBLOCK) system call]. A similar effect can be obtained without using O_NONBLOCK when POSIX threads are used. There are several strategies that can then be followed: 1) set O_NONBLOCK and use a poll() loop 2) set O_NONBLOCK and use SIGPOLL signal to alert app when readable 3) use POSIX threads and a single sg file descriptor 4) use POSIX threads and multiple sg file descriptors to same device In Linux SIGIO and SIGPOLL are the same signal. If POSIX realtime signals are used (e.g. when SA_SIGINFO is used with sigaction() and fcntl(fd, F_SETSIG, SIGRTMIN + ) ) then the file descriptor with which the signal is associated is available to the signal handler. The associated file descriptor is in the si_fd member of the siginfo_t structure. The poll() system call that is often used after a signal is received can thus be bypassed. Other references ---------------- The most recent news on the sg driver can be found at: http://www.torque.net/sg A package of utilities called "sg_utils" can be found on this page. A similar package of utilities called "sg3_utils" targeted at the sg v3 driver can also be found on this page. Some notes on the sg v3 driver can be found at: http://www.torque.net/sg/s_packet.html For some timings (and CPU utilizations) comparisons between direct and indirect IO see: http://www.torque.net/sg/rbuf_tbl.html The Linux Documentation Project's SCSI-2.4-HOWTO may help to put this driver into perspective: http://linuxdoc.org/HOWTO/SCSI-2.4-HOWTO Douglas Gilbert dgilbert@interlog.com dougg@torque.net SG web site: http://www.torque.net/sg