The Linux SCSI Generic (SG3) Interface ====================================== 991105 D. Gilbert Introduction ------------ This document is a brief explanatory note about the new Linux SCSI Generic interface. To distinguish this document from others, the number "3" is appended to "sg". This is just for identification purposes and the driver will continue to be known to the kernel as "sg" and sg devices will be of the form "/dev/sg" where the trailing is a single letter or a number. The existing sg driver in 2.2.13 (for instance) is documented in /usr/src/linux/Documentation/scsi-generic.txt . Identifying a SG driver with the new interface ---------------------------------------------- Existing versions of the sg device driver either have no version number (e.g. the original driver) or a version number starting with "2". The drivers that support this new interface have a major version number of "3". The sg version numbers are of the form "x.y.z" and the single number given by the SG_GET_VERSION_NUM ioctl() is calculated by (x * 10000 + y * 100 + z). The sg driver discussed here will yield a number greater than or equal to 30000 from SG_GET_VERSION_NUM. The version number can also be seen using 'cat /proc/scsi/sg/version' in the new driver. Interface --------- This driver supports the system calls that would be expected of any character device driver in Linux. They are: open(), close(), write(), read() and ioctl() (plus a few others associated with polling and asynchronous notification). The major addition in sg3 is an ioctl() called SG_IO which is functionally equivalent to a write() followed by a blocking read(). In certain contexts the write()/read() combination have advantages over SG_IO (e.g. command queuing) and continue to be supported. The existing (and original) sg interface based on the sg_header structure is still available using a write()/read() sequence as before. The SG_IO ioctl will only accept the new interface based on the sg_io_hdr structure. The sg3 driver thus has a write() call that can accept either the older sg_header structure or the new sg_io_hdr structure. It makes its decision based on the second integer position of the passed header (i.e. sg_header::pack_len or sg_io_hdr::dxfer_direction). If it is a positive number then the old interface is assumed. If it is a negative number then the new interface is assumed. [I'm still thinking about zero ...] If a request is sent to a write() with the sg_io_hdr interface then the corresponding read() that fetches that request on completion must also use the sg_io_hdr interface. The same rule applies to the sg_header interface. Some seldom used ioctl()s introduced in the sg 2.x series drivers have been removed. They are: - SG_SET_UNDERRUN_FLAG (and _GET_) [use 'resid' in this new interface] - SG_SET_MERGE_FD (and _GET) [added complexity with little benefit] Theory of operation ------------------- The path of a request through the sg driver can be broken into 3 distinct functions: 1) The request is received from the user, resources are reserved as required (e.g. kernel buffer for indirect IO) and then the request is submitted to the SCSI mid level (and then onto the adapter) for execution. The SCSI mid level maintains a queue so the request may have to wait. If the device supports tagged queuing then it may be able to accommodate multiple outstanding requests. 2) Assuming the SCSI adapter supports interrupts, then an interrupt is received when the request is complete. When this interrupt arrives the data transfer is complete. This means that if the SCSI command was a READ then the data is in kernel buffers (indirect IO) or in user buffers (direct IO). The sg driver is informed of this interrupt via a kernel mechanism called a "bottom half" handler. Some kernel resources are freed up. 3) The user makes a call to fetch the result of the request. If necessary data in kernel buffers is transferred to the user space. If necessary the sense buffer is written out to the user space. The remaining kernel resources associated with this request are freed up. The write() call performs function 1 while the read() call performs function 3. If the read() call is made before function 2 is complete then it will either wait or yield EAGAIN (depending on whether the file descriptor is blocking or not). If asynchronous notification is being used then function 2 will send a SIG_IO signal to the user process (assuming it is properly registered). Also values are set so that the poll() system call will show this file descriptor is now readable (unless it was sent by the SG_IO ioctl()). The SIG_IO ioctl() performs function 1, waits for function 2 and then performs function 3. If the file descriptor in question is set O_NONBLOCK then SG_IO will ignore this and still block! Also a SIG_IO call will not effect the poll() state nor cause a SIG_POLL signal to be sent. If you really want non-blocking operation (e.g. for command queuing) then don't use SG_IO; use the write() read() sequence instead. The Sg_io_hdr structure in detail --------------------------------- char interface_id; /* [i] 'S' for SCSI generic */ This must be set to 'S' (capital ess). If not, the ENOSYS error message is placed in errno. The idea is to allow interface variants in the future that identify themselves with a different letter. unsigned char cmd_len; /* [i] SCSI command length ( <= 16 bytes) */ This is the length in bytes of the SCSI command that 'cmdp' points to. As a SCSI command is expected an EMSGSIZE error number is produced if the value is less than 6 or greater than 16. Further, if the subsystem has a further limit (as it does at 12 bytes) then EMSGSIZE is produced in this case as well. unsigned char iovec_count; /* [i] 0 implies no scatter gather */ This is the number of scatter gather elements in an array pointed to by 'dxferp'. If the value is zero then scatter gather (in the user space) is _not_ being used and 'dxferp' points to the data transfer buffer. If the value is greater than zero then each element of the array is assumed to be of the form: typedef struct sg_iovec { void * iov_base; /* starting address */ size_t iov_len; /* length in bytes */ } Sg_iovec; Note that this structure has been named and defined in such a way to parallel "struct iovec" used by the readv() and writev() system calls in Linux. See "man 2 readv". unsigned char mx_sb_len; /* [i] max length to write to sbp */ This is the maximum size that can be written back to the 'sbp' pointer when a sense_buffer is output which is usually in an error situation. The actual number written out is given by 'sb_len_wr'. In all cases 'sb_len_wr' <= 'mx_sb_len' . int dxfer_direction; /* [i] data transfer direction */ This is required to be one of the following: SG_DXFER_NONE SG_DXFER_TO_DEV /* e.g. a SCSI WRITE command */ SG_DXFER_FROM_DEV /* e.g. a SCSI READ command */ SG_DXFER_TO_FROM_DEV The value SG_DXFER_NONE should be used when there is no data transfer associated with a command (e.g. TEST UNIT READY). The value SG_DXFER_TO_DEV should be used when data is being moved from user memory towards the device (e.g. WRITE). The value SG_DXFER_FROM_DEV should be used when data is being moved from the device towards user memory (e.g. READ). The value SG_DXFER_TO_FROM_DEV is only relevant to indirect IO (otherwise it is treated like SG_DXFER_FROM_DEV). Data is moved from the user space to the kernel buffers. The command is then performed and most likely a READ-like command transfers data from the device into the kernel buffers. Finally the kernel buffers are copied back into the user space. This technique allows application writers to initialize the buffer and perhaps deduce the actually number of bytes read from the device (i.e. detect underrun). This is better done by using 'resid' if it is supported. If 'dxfer_len' is zero then all values are treated like SG_DXFER_NONE. unsigned int dxfer_len; /* [i] byte count of data transfer */ This is the number of bytes to be moved in the data transfer associated with the command. The direction of the transfer is indicated by 'dxfer_direction'. If 'dxfer_len' is zero then no data transfer takes place. void * dxferp; /* [i], [*io] points to data transfer memory or scatter gather array */ If 'iovec_count' is zero then this value is a pointer to user memory of at least 'dxfer_len' bytes in length. If there is a data transfer associated with the command then the data will be transferred to or from this user memory. If 'iovec_count' is greater than zero then this value points to a scatter-gather array in user memory. Each element of this array should be an object of type Sg_iovec. Note that data is sometimes written to user memory (e.g. from a failed SCSI READ) even when an error has occurred. unsigned char * cmdp; /* [i], [*i] points to command to perform */ This value points to the SCSI command to be executed. The command is assumed to be 'cmd_len' bytes long. If cmdp is NULL then the system call yields an EMSGSIZE error number. The user memory pointed to is only read (not written to). So the overwriting of the LUN on the 2nd byte (disabled by SG_FLAG_LUN_INHIBIT) occurs within the sg driver. unsigned char * sbp; /* [i], [*o] points to sense_buffer memory */ This value points to user memory of at least 'mx_sb_len' bytes length where the SCSI sense buffer will be output. Most successful commands do not output a sense buffer and this will be indicated by 'sb_len_wr' being zero. unsigned int timeout; /* [i] MAX_UINT->no timeout (unit: millisec) */ This value is used to timeout the given command. The units of this value are milliseconds. The time being measured is from when a command is sent until when sg is informed the request has been completed. A following read() can take as long as the user likes. When a timeout is exceeded the command is aborted and DID_TIME_OUT is set in the 'host_status' and DRIVER_TIMEOUT is set in the 'driver_status'. unsigned int flags; /* [i] 0 -> default */ These are single or multi-bit values that can be "or-ed" together: SG_FLAG_DIRECT_IO This is a request for direct IO on the data transfer. If it cannot be performed then the driver automatically performs indirect IO instead. If it is important to find out which type of IO was performed then check the values from the SG_INFO_DIRECT_IO_MASK in 'info' when the request packet is completed (i.e. after read() or ioctl(,SG_IO,) ). The default action is to do indirect IO. SG_FLAG_LUN_INHIBIT The default action of the sg driver to overwrite internally the top 3 bits of the second SCSI command byte with the LUN associated with the file descriptor's device. To inhibit this action set this flag. int pack_id; /* [i->o] unused internally */ This value is not normally acted upon by the sg driver. It is provided so the user can identify the request. This is useful when command queuing is being used. The "abnormal" case is when SG_SET_FORCE_PACK_ID is set and a 'pack_id' other than -1 is given to read(). In this case the read() will wait to fetch a request that matches this 'pack_id'. void * usr_ptr; /* [i->o] unused internally */ This value is not acted upon by the sg driver. It is meant to allow the user to associate some object with this request (e.g. to maintain state information). unsigned char status; /* [o] scsi status */ This is the SCSI status byte as defined by the SCSI standard. Note that it can have vendor information set in bits 0, 6 and 7 (although this is uncommon). Further note that this 'status' data does _not_ match the definitions in (e.g. CHECK_CONDITION). The following 'masked_status' does match those definitions. unsigned char masked_status;/* [o] shifted, masked scsi status */ Logically: masked_status == ((status & 0x3e) >> 1) So 'masked_status' strips the vendor information bits off 'status' and then shifts it right one position. This makes it easier to do things like "if (CHECK_CONDITION == masked_status) ..." using the definitions in . The defined values in this file are: GOOD 0x00 CHECK_CONDITION 0x01 CONDITION_GOOD 0x02 BUSY 0x04 INTERMEDIATE_GOOD 0x08 INTERMEDIATE_C_GOOD 0x0a RESERVATION_CONFLICT 0x0c COMMAND_TERMINATED 0x11 QUEUE_FULL 0x14 /* N.B. 1 bit offset from usual SCSI status values */ Note that SCSI 3 defines some additional status codes. unsigned char msg_status; /* [o] messaging level data (optional) */ The messaging level in SCSI is under the command level and knowledge of what is happening at the messaging level is very rarely needed. Furthermore most modern chip-sets used in SCSI adapters completely hide this value. Nearly all adapters will return zero in 'msg_status' all the time. unsigned char sb_len_wr; /* [o] byte count actually written to sbp */ This is the actual number of bytes written to the user memory pointed to by 'sbp'. 'sb_len_wr' is always <= 'mx_sb_len'. Linux 2.2 series kernels (and earlier) truncate this value to a maximum of 16 bytes. The actual number of bytes written will not exceed the length indicated by "Additional Sense Length" field (byte 7) of the Request Sense response. unsigned short host_status; /* [o] errors from host adapter */ These codes potentially come from the firmware on a host adapter or from one of several hosts that a driver controllers. The 'host_status' field has the following values whose #defines mimic those which are only visible within the kernel (with the "SG_ERR_" removed from the front of each define). A copy of these defines can be found in sg_err.h (see the utilities section): SG_ERR_DID_OK 0x00 /* NO error */ SG_ERR_DID_NO_CONNECT 0x01 /* Couldn't connect before timeout period */ SG_ERR_DID_BUS_BUSY 0x02 /* BUS stayed busy through time out period */ SG_ERR_DID_TIME_OUT 0x03 /* TIMED OUT for other reason */ SG_ERR_DID_BAD_TARGET 0x04 /* BAD target, device not responding? */ SG_ERR_DID_ABORT 0x05 /* Told to abort for some other reason */ SG_ERR_DID_PARITY 0x06 /* Parity error */ SG_ERR_DID_ERROR 0x07 /* Internal error [DMA underrun on aic7xxx]*/ SG_ERR_DID_RESET 0x08 /* Reset by somebody. */ SG_ERR_DID_BAD_INTR 0x09 /* Got an interrupt we weren't expecting. */ SG_ERR_DID_PASSTHROUGH 0x0a /* Force command past mid-layer */ SG_ERR_DID_SOFT_ERROR 0x0b /* The low level driver wants a retry */ As well as these defines in the sg_err.h header file that mimic those inside kernel header files, a mirror unsigned short driver_status;/* [o] errors from software driver */ One driver can potentially control several hosts. For example Advansys provide one Linux adapter driver that controls all adapters made by that company - if 2 of more Advansys adapters are in 1 machine, then 1 driver controls both. When ('driver_status' & SG_ERR_DRIVER_SENSE) is true the 'sense_buffer' is also output. The 'driver_status' field has the following values whose #defines mimic those which are only visible within the kernel (with the "SG_ERR_" removed from the front of each define). A copy of these defines can be found in sg_err.h (see the utilities section): SG_ERR_DRIVER_OK 0x00 /* Typically no suggestion */ SG_ERR_DRIVER_BUSY 0x01 SG_ERR_DRIVER_SOFT 0x02 SG_ERR_DRIVER_MEDIA 0x03 SG_ERR_DRIVER_ERROR 0x04 SG_ERR_DRIVER_INVALID 0x05 SG_ERR_DRIVER_TIMEOUT 0x06 SG_ERR_DRIVER_HARD 0x07 SG_ERR_DRIVER_SENSE 0x08 /* Implies sense_buffer output */ /* above status 'or'ed with one of the following suggestions */ SG_ERR_SUGGEST_RETRY 0x10 SG_ERR_SUGGEST_ABORT 0x20 SG_ERR_SUGGEST_REMAP 0x30 SG_ERR_SUGGEST_DIE 0x40 SG_ERR_SUGGEST_SENSE 0x80 int resid; /* [o] dxfer_len - actual_transferred */ This is the residual count from the data transfer. It is 'dxfer_len' less the number of bytes actually transferred. In practice in only reports underruns (i.e. positive number) as data overruns should never happen. At time of writing no SCSI adapters supported 'resid' but hopefully this will soon change. This value will be zero if there was no underrun or the SCSI adapter doesn't support this feature. unsigned int duration; /* [o] time taken (unit: millisec) */ This value will be the number of milliseconds from when a SCSI command was sent until sg is informed that it is complete. For i386 machines the granularity is 10ms while on alpha machines it is 1ms. This value is rounded toward zero. unsigned int info; /* [o] auxiliary information */ This value is designed to convey useful information back to the user about the associated request. This information does not necessarily indicate an error. Several single bit and multi-bit fields are "or-ed" together to make this value. A single bit component contained in SG_INFO_OK_MASK indicates whether some error or status field is non-zero. If either 'masked_status', 'host_status' or 'driver_status' are non-zero then SG_INFO_CHECK is set. The associated values are: SG_INFO_OK_MASK 0x1 SG_INFO_OK 0x0 /* no sense, host nor driver "noise" */ SG_INFO_CHECK 0x1 /* something abnormal happened */ A multi bit component contained in SG_INFO_DIRECT_IO_MASK indicates what type of data transfer has just taken place. If indirect IO (or no data transfer) has taken place then SG_INFO_INDIRECT_IO is matched. Note that even if direct IO was requested in 'flags' the driver may choose to do indirect IO instead. If direct IO was requested and performed then SG_INFO_DIRECT_IO will be matched. If direct IO is requested together with scatter gather in the user space (i.e. when 'iovec_count' > 0) then the initial version of this driver will only do direct IO on the first element of the scatter gather array and indirect IO on the remaining elements. If this occurs then SG_INFO_MIXED_IO will be matched. The associated values are: SG_INFO_DIRECT_IO_MASK 0x6 SG_INFO_INDIRECT_IO 0x0 /* data xfer via kernel buffers (or no xfer) */ SG_INFO_DIRECT_IO 0x2 SG_INFO_MIXED_IO 0x4 /* part direct, part indirect IO */ The new write() and read() calls -------------------------------- write(int sg_fd, const void * buffer, size_t count) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The action of write() with a control block based on struct sg_header is discussed in the document scsi-generic.txt . This section describes the action of write() when it is given a control block based on struct sg_io_hdr. The 'buffer' should point to an object of type Sg_io_hdr and 'count' should be sizeof(Sg_io_hdr) [it can be larger but the excess is ignored]. If the write() call succeeds the 'count' is returned as the result. Up to SG_MAX_QUEUE (16) write()s can be queued up before any finished requests are completed by read(). An attempt to queue more than that will result in an EDOM error. The write() command should return more or less immediately. [There is a small probability it will spend some time waiting for a command block to become available and if O_NONBLOCK is active this scenario will cause a EAGAIN.] read(int sg_fd, void * buffer, size_t count) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The action of read() with a control block based on struct sg_header is discussed in the document scsi-generic.txt . This section describes the action of read() when it is given a control block based on struct sg_io_hdr. The 'buffer' should point to an object of type Sg_io_hdr and 'count' should be sizeof(Sg_io_hdr) [it can be larger but the excess is ignored]. If the read() call succeeds the 'count' is returned as the result. By default, read() will return the oldest completed request that is queued up. A read() will not interfere with any request associated with the SG_IO ioctl() on this file descriptor except in a special case when a SG_IO ioctl() is interrupted by a signal. If the SG_SET_FORCE_PACK_ID,1 ioctl() is active then read() will attempt to fetch the packet whose pack_id (given earlier to write()) matches the sg_io_hdr::pack_id given to this read(). If not available it will either wait or yield EAGAIN. As a special case, -1 in sg_io_hdr::pack_id given to read() will match the oldest completed request. New ioctl()s ------------ SG_IO 0x2285 The idea is deceptively simple: just hand a sg_io_hdr object to an ioctl() and it will return when the SCSI command is finished. It is logically equivalent to doing a write() followed by a blocking read(). The word "blocking" here implies the read() will wait until the SCSI command is complete. One file descriptor can be used both for SG_IO synchronous calls and the write() read() sequences at the same time. [This implies several threads which may just as easily use separate file descriptors.] The sg driver makes sure that the response to a SG_IO call will never accidentally be fetched by a read(). It is possible for the wait for the command completion to be interrupted by a signal (e.g. SIGIO) and the SG_IO call yielding an EINTR error. This is reasonably complex to handle and is discussed in more detail later. The following SCSI commands will be permitted by SG_IO when the user only has read permissions on the device: TEST UNIT READY, INQUIRY, READ CAPACITY, READ(6), READ(10), READ(12) This is experimental and may change. All commands to device type scanner are accepted. Other cases yield an EACCES error. Note that the write() read() interface must have read-write permissions on the device as write permission is required by Linux to execute a write() call. SG_GET_REQUEST_TABLE 0x2286 This ioctl outputs an array of information about the status of requests associated with the current file descriptor. Its 3rd argument should point to memory large enough to receive SG_MAX_QUEUE objects of the Sg_req_info structure. This structure has the following members: req_state 0 -> request not in use 1 -> request has been sent, but is not finished (i.e. it is between functions 1 and 2 in the "theory of operation") 2 -> request is ready to be read() (i.e. it is between functions 2 and 3 in the "theory of operation") orphan 0 -> normal request 1 -> request sent by SG_IO ioctl() which has been interrupted by a signal sg_io_owned 0 -> request sent by a write() 1 -> request sent by a SG_IO ioctl() problem 0 -> no problem (or 1 == req_state) 1 -> req_state is 2 and either masked_status, host_status or driver_status is non-zero duration [if 1 == req_state] time since request was sent (in millisecs) [if 2 == req_state] duration of request (in millisecs). Clock is stopped when function 2 in "theory of operation" is reached pack_id usr_ptr these are user provided values in the Sg_io_hdr (or Sg_header) that sent the request SG_SET_KEEP_ORPHAN 0x2287 SG_GET_KEEP_ORPHAN 0x2288 These ioctl()s allow the setting and reading of the "keep_orphan" flag. This controls what happens to the request associated with a SG_IO ioctl() that is interrupted (i.e. errno is EINTR). The default action is to drop the response as soon as it is received. This corresponds to the "keep_orphan" flag being 0. When the "keep_orphan" flag is 1 the the response is transformed in such a way that it can be fetched by a read(). This is the only circumstance in which a request sent by a SG_IO ioctl() can be fetched by a read(). Other changes to system calls ----------------------------- The ability of the SG_IO ioctl() to issue certain SCSI commands has led to some relaxation on file descriptors open()ed "read-only". The open() call will now attempt to allocate a reserve buffer for all newly opened file descriptors. The SG_SET_RESERVED_SIZE ioctl() will now work on "read-only" file descriptors. Errors that can occur in the write(), read() and ioctl() calls -------------------------------------------------------------- With the original interface almost any string could be accidentally given to write() and potentially something nasty could happen. If some error was detected then more than likely EIO was placed in errno. Unfortunately this can still happen with write() since it can accept both the original Sg_header or the newer Sg_io_hdr described in this note. However since the SG_IO ioctl() will only accept the Sg_io_hdr structure there is less chance of a random string being interpreted as a command. Since the Sg_io_hdr interface does a lot more error checking, it attempts to give out more precise errno values to help the user pinpoint the problem. [Admittedly some of these errno values are picked in an arbitrary way from the large set of available values.] Below is a table of errno values indicating which calls to sg will give them and the meaning of the error. A write() call is indicated by "w", a read() call by "r" and an open() call by "o". errno which_calls Meaning ----- ----------- ---------------------------------------------- EACCES o, w,r,SG_IO User does not have permissions to do this. They will need read,write permissions currently. This has been relaxed for certain SCSI commands via SG_IO. EAGAIN r The file descriptor is non-blocking and the request has not been completed yet. EAGAIN w,SG_IO Internal error. SCSI sub-system has (temporarily) run out of command blocks. EBADF w File descriptor does not have open()ed O_RDWR. EBUSY o Someone else has an O_EXCL lock on this device. EBUSY Attempt to change something (e.g. buffer size) when the resource was in use. EDOM w,SG_IO Too many requests queued against this file descriptor. Limit is SG_MAX_QUEUE active requests. EFAULT w,r,SG_IO Pointer to user space invalid. EFAULT Pointer to user space invalid. EINVAL w,r Size given as 3rd argument not large enough to the Sg_io_hdr structure. EIO w Size given as 3rd argument less than size of old header structure (Sg_header). Additionally a write() with the old header will yield this error for most malformed requests. EINTR o While waiting for the O_EXCL lock to clear this call was interrupted by a signal. EINTR r,SG_IO While waiting for the request to finish this call was interrupted by a signal. EINTR w [Very unlikely] While waiting for an internal SCSI resource this call was interrupted by a signal. EMSGSIZE w,SG_IO SCSI command size ('cmd_len') was too small (i.e. < 6) or too large (i.e. currently > 12). ENODEV o Tried to open() a file will no associated device. [Perhaps sg has not been built into the kernel or is not available as a module?] ENOENT o Given filename not found. ENOMEM o [Very unlikely] Kernel was not even able to find enough memory for this file descriptor's context. ENOMEM w,SG_IO Kernel unable to find memory for internal buffers. This is usually associated with indirect IO. ENOSYS w,SG_IO First char of what should have been a Sg_io_hdr object was _not_ 'S'. ENXIO o "remove-single-device" may have removed this device. ENXIO o, w,r,SG_IO Internal error (including SCSI sub-system busy doing error processing). Direct IO --------- This is currently experimental. Direct IO support is designed in such a way that if it is requested and cannot be performed then the request will still be performed using indirect IO. If direct IO is requested and has been performed then the SG_INFO_DIRECT_IO bit will be set in the 'info' member of the Sg_io_hdr control structure after the request has been completed. [Although not currently functional, if the direct IO request is partially performed (e.g. first element of sg_iovec used direct IO while second used indirect IO) then SG_INFO_MIXED_IO is set.] Direct IO uses facilities added to the 2.3 series of kernels and hence will only be available in the 2.3/2.4 series of kernels. You can request direct IO with the 2.2 series version of this driver but indirect IO will be performed. Further, direct IO is not supported on ISA SCSI adapters since they only can address a 24 bit address space. One limit on direct IO is that Sg_io_hdr::iovec_count==0. So the user cannot (currently) use scatter gather and direct IO. For direct IO to be worthwhile, a reasonable amount of data should be requested for data transfer. For transfers less than 4 KByte it is probably not worth the trouble (more testing required by me here). On the other hand "locking down" a megabyte block of data for direct IO could adversely impact overall system performance. Remember that for the duration of a direct IO request, the data transfer buffer is mapped to real memory and locked in such a way that it won't be swapped out. The memory given to sg as the data buffer for direct IO would usually come from the heap or be an external or static array. Questions remain as to whether it could be on the stack (i.e. auto) or in shared memory. Experience has shown that a large block of memory malloc-ed by a user does not look contiguous at all seen from the DMA hardware's point of view. This means that direct IO relies on the scatter gather capabilities of the DMA hardware on the SCSI adapter. [This is a _different_ scatter gather machanism to that which the user sees in the interface based on iovec.] This puts an effective limit on the size of a direct IO transfer whose size in bytes can be approximately calculated by: (max_scsi_adapter_scatter_gather_elements - 1) * PAGE_SIZE The "-1" component allows for alignment considerations. Currently the driver source code contains a commented out define called "SG_ALLOW_DIO" that has the effect of disabling direct IO. To test direct IO in 2.3 series kernels this line needs to be uncommented and sg built into the kernel (i.e. not as a module) or alternatively the functions in that aren't already, need to be exported in kernel/ksyms.c . Additions to the "proc" file system ----------------------------------- The provision of SCSI mid level and sg information via the "proc" file system is in its infancy and may change as more experience is gained. The following files readable by all are defined in the sub directory "/proc/scsi/sg" : debug debug info from sg down to active request level def_reserved_size default buffer size reserved for each file descriptor devices one line of numeric data per device device_hdr single line of column names corresponding to 'devices' device_strs one line of vendor, product and rev info per device hosts one line of numeric data per host host_hdr single line of column names corresponding to 'hosts' host_strs one line of host information (string) per host version sg version as a number followed by a string representation Each line in 'devices' and 'device_strs' corresponds to an sg device. For example the first line corresponds to /dev/sg0 (or /dev/sga). This mapping is local to sg and normally the same as given by 'cat /proc/scsi/scsi' reported by the SCSI mid level code. The two mappings diverge when 'remove-single-device' and 'add-single-device' are used. Each line in 'hosts' and 'host_strs' corresponds to a SCSI host. For example the first line corresponds to the host normally represented as "scsi0". This mapping is invariant across the SCSI sub system. [So these entries could arguably be migrated to the mid level.] The column headers in 'device_hdr' are given below. If the device is not present (and one is present after it) then a line of "-1" entries is output. Each entry is separated by a whitespace (currently a tab): host host number (indexes 'hosts' table, origin 0) chan channel number of device (is this every non-zero?) id SCSI id of device lun Logical Unit number of device type SCSI type (e.g. 0->disk, 5->cdrom, 6->scanner) discon 0 -> disconnect not supported, 1 -> it is supported depth maximum queue depth supported by device tq 0 -> tagged queuing disabled, 1 -> tagged queuing enabled The column headers in 'host_hdr' are given below. Each entry is separated by a whitespace (currently a tab): uid unique id (non-zero if multiple hosts of same type) busy 0 -> adapter driver free, 1 -> adapter driver busy cpl maximum number of command per lun (may be 0 if "device depth" is given sgat maximum elements of scatter gather the adapter (pseudo) DMA can accommodate isa 0 -> non-ISA adapter, 1 -> ISA adapter. ISA adapters are assumed to have a 24 bit address bus limit (16 MB). emu 0 -> real SCSI adapter, 1 -> emulated SCSI adapter (e.g. ide-scsi device driver) The 'def_reserved_size' is both readable and writeable. It is only writeable by root. It is initialized to the value of DEF_RESERVED_SIZE in the "sg.h" file. Values between 0 and 1048576 (which is 2 ** 20) are accepted and can be set from the command line with the following syntax: # echo "262144" > /proc/scsi/sg/def_reserved_size Note that the actual reserved buffer associated with a file descriptor could be less than 'def_reserved_size' if appropriate memory is not available. If the sg driver is compiled into the kernel (but not when it is a modules) this value can also be read at /proc/sys/kernel/sg-big-buff . This latter feature is deprecated. Douglas Gilbert dgilbert@interlog.com dougg@triode.net.au SG web site: http://www.torque.net/sg