The Linux SG driver version 4.0




1 Introduction

2 SCSI Generic versions 1, 2, 3 and 4 interfaces

3 Changes to sg driver between version 3.5.36 and 4.0

4 Architecture of the sg driver

5 Synchronous usage

6 Sharing file descriptors

7 Async usage in v4

7.1 ioctl(SG_IOABORT)

8 Request sharing

8.1 Slave waiting

9 Sharing design considerations

10 Multiple requests

10.1 Processing mrq responses

10.2 Aborting multiple requests

10.3 Single/multiple (non-)blocking requests

11 pack_id or tag

12 Bi-directional command support

13 SG interface support changes

14 IOCTLs

15 Downloads and testing

16 Sg driver and the block layer

17 Other documents

18 Conclusion


1 Introduction

The SCSI Generic (sg) driver in Linux has been present since version 1.0 of the kernel in 1992. In the 27 years since then the driver has had 3 interfaces to the user space and now a fourth is being added. The first and second interfaces (v1 and v2) use the same header: 'struct sg_header' with only v2 now fully supported. The "v3" interface is based on 'struct sg_io_hdr'. Both these structures are defined in include/scsi/sg.h the bulk of whose contents will move to include/uapi/scsi/sg.h as part of this upgrade. Prior to the changes now proposed, the "v4" interface is only implemented in the block layer's bsg driver ("block SCSI generic" driver which is around 15 years old) . The bsg driver's user interface is found in include/uapi/linux/bsg.h . These changes propose adding support for the "v4" interface via ioctl(SG_IO) for synchronous use, and new ioctl(SG_IOSUBMIT) and ioctl(SG_IORECEIVE) for asynchronous/non-blocking use. The plan is to deprecate and finally remove (or severely restrict) the write(2)/read(2) based asynchronous interface used currently by the v1, v2 and v3 interfaces. The v3 asynchronous interface is supported by the new SG_IOSUBMIT_V3 and SG_IORECEIVE_V3 ioctl(2)s .

If the driver changes are accepted, the driver version which is visible via an ioctl(SG_GET_VERSION_NUM), will be bumped from 3.5.36 (in lk 5.0) to 4.0.x . The opportunity is being taken to clean the driver after 20 years of piecemeal patches. Those patches have left the driver with misleading variable names and comments that don't match the adjacent code. Plus there are new kernel facilities that the driver can take advantage of. Also of note is that much or the low level code once in the sg driver (and remnants remain) have been moved to the block layer and SCSI mid-level. This upgrade has been done as a two stage process: first clean the driver up, remove some restrictions and re-instate some features that have been accidentally lost. The first stage also adds basic v4 interface support using the ioctl(SG_IO) for sync/blocking usage; and ioctl(SG_IOSUBMIT) and ioctl(SG_IORECEIVE). A first stage patchset (containing 18 patches) was sent to the linux-scsi list on 20190616 and is titled: "sg: add v4 interface". The driver version number is 4.0.03 . The second stage patchset adds the new features such as file and request sharing, multiple requests (in one invocation) supports and the so-called extended ioctl(2). The second patchset is only currently available from this page (as patches 0019 to 0032 applied on top of the first stage). The second stage moves the driver version number to 4.0.31

Note that the Linux block layer implements the synchronous sg v3 interface via ioctl(SG_IO) on all block devices that use the SCSI subsystem, directly or via translation (e.g. SATA disks use libata which implements the SAT T10 standard). In pseudocode an example like this ' ioctl(open("/dev/sdc"), SG_IO, ptr_to_sg_io_hdr)' works as expected. This is not implemented by the sg driver so it is important that the sg driver's implementation of ioctl(SG_IO) remains consistent with other driver implementations (mainly the one found in block/scsi_ioctl.c kernel source code).

A tarball with patches and driver source files for recent Linux kernel versions can be found in the Downloads section.

2 SCSI Generic versions 1, 2, 3 and 4 interfaces

SCSI and other storage related command sets send a lot of data to storage devices and receive as much if not more data back from those devices. That data can be subdivided into metadata and user data. The SCSI metadata sent to the storage device is the command which is sometimes referred to as the cdb (command descriptor block). The metadata received back from the device is a SCSI status byte and optionally a sense buffer of 18 or more bytes. More generic terms for those transfers are the request and the associated response. User data sent to the storage device is termed as data-out and user data received from the device is called data-in. A few SCSI commands have both data-out and data-in transfers and are referred to as bi-directional (or bidi) while the majority of SCSI commands send user data either out or in, or transfer no user data. The SCSI commands sets define where user data will be placed in, or fetched from, the storage device but leave the details of where (and how) that user data is placed in the initiator (i.e. at the computer or local end) to the transport. Examples of transports are: iSCSI, SAS, SATA, FCP, SRP (Infiniband) and USB (UASP).

Another aspect of a SCSI pass-through is whether to map the sending of a request and receiving of the associated response onto a single system call or divide it into two parts. The single system call approach is termed here as blocking or synchronous. The two part approach is termed as non-blocking or asynchronous. Both approaches typically have an associated timeout. It is assumed that any user data transfer associated with the command will have taken place before a successful response is sent by the storage device. As a general rule the blocking approach is simpler to program while the non-blocking approach is more flexible, allowing code to do other chores while waiting for SCSI commands to complete.

Traditionally character device drivers in Unix have had a open(2), close(2), read(2), write(2), ioctl(2) interface to the user space. As well as those system calls this driver supports mmap(2), poll(2) and fasync(). The fasync() driver call is related to the fcntl(2) system call in which the file descriptor flags are changed to add O_AYSNC (e.g. fcntl(SET_FL(flags | O_ASYNC)) ) . When considering how to send SCSI commands and associated data to a pass-through driver such as sg, it soon becomes evident that a structure will be needed to hold all the components. This is the same approach used by other operating systems that offer a SCSI pass-through interface. And in the almost 30 years that Linux has been in existence, it has had three (and a half) such structures.

The sg driver was present in Linux kernel 1.0.0 released in 1992. It supported just two ioctl(2)s at the time: SG_SET_TIMEOUT and SG_GET_TIMEOUT plus some "pass-through" ioctl(2)s that started with "SCSI_IOCTL_" that were in common with other ULDs (e.g. sd and st) and implemented by the Linux SCSI mid-level. The only method of sending a SCSI command by this driver was with the async write(2) and read(2) system calls (that neglects counting the synchronous "pass-through" pass-through ioctl(2): SCSI_IOCTL_SEND_COMMAND implemented by the SCSI mid-level).

The version 1 SCSI pass-through interface only supported the asynchronous approach. This is its interface structure found in Linux kernel 1.0.0 (1992):

struct sg_header
 {
  int pack_len;    /* length of incoming packet <4096 (including header) */
  int reply_len;   /* maximum length <4096 of expected reply */
  int pack_id;     /* id number of packet */
  int result;      /* 0==ok, otherwise refer to errno codes */
  /* command follows then data for command */
 };

Only the pack_id field is found in all versions of the sg driver interface and its semantics remain the same. However there is an issue with the pack_id and the read(2) system call: pack_id is out-going data (to the driver in this case) while the rest of the data in that structure (with possibly the data-in from the storage device tacked onto the end) is in-coming data. This bidirectional data flow is abnormal for a read(2) which normally only expects in-coming data. It is of note that in the version 4 driver the new ioctl(2) to replace read(2) is ioctl(SG_IORECEIVE) and it is defined with the __IOWR() macro indicating both a write (from the user space) and a read (into the user space) data transfer.

The version 2 SCSI pass-through interface structure is really just a small extension of version 1:

struct sg_header {
        int pack_len;   /* [o] reply_len (ie useless), ignored as input */
        int reply_len;  /* [i] max length of expected reply (inc. sg_header) */
        int pack_id;    /* [io] id number of packet (use ints >= 0) */
        int result;     /* [o] 0==ok, else (+ve) Unix errno (best ignored) */
        unsigned int twelve_byte:1;
            /* [i] Force 12 byte command length for group 6 & 7 commands  */
        unsigned int target_status:5;   /* [o] scsi status from target */
        unsigned int host_status:8;     /* [o] host status (see "DID" codes) */
        unsigned int driver_status:8;   /* [o] driver status+suggestion */
        unsigned int other_flags:10;    /* unused */
        unsigned char sense_buffer[SG_MAX_SENSE];
};

There are various shortcoming of the version 2 (and hence version 1) interface structure: the command (cdb), data-in, and/or data-out were tacked onto the end of the interface structure. The command length was not given explicitly but derived from the cdb making it difficult to support vendor specific commands.

The version 3 SCSI pass-through interface structure was introduced around 2000 and was a departure from versions 1 and 2:

typedef struct sg_io_hdr {
        int interface_id;       /* [i] 'S' for SCSI generic (required) */
        int dxfer_direction;    /* [i] data transfer direction  */
        unsigned char cmd_len;  /* [i] SCSI command length */
        unsigned char mx_sb_len;/* [i] max length to write to sbp */
        unsigned short iovec_count;     /* [i] 0 implies no sgat list */
        unsigned int dxfer_len; /* [i] byte count of data transfer */
        /* dxferp points to data transfer memory or scatter gather list */
        void __user *dxferp;    /* [i], [*io] */
        unsigned char __user *cmdp;/* [i], [*i] points to command to perform */
        void __user *sbp;       /* [i], [*o] points to sense_buffer memory */
        unsigned int timeout;   /* [i] MAX_UINT->no timeout (unit: millisec) */
        unsigned int flags;     /* [i] 0 -> default, see SG_FLAG... */
        int pack_id;            /* [i->o] unused internally (normally) */
        void __user *usr_ptr;   /* [i->o] unused internally */
        unsigned char status;   /* [o] scsi status */
        unsigned char masked_status;/* [o] shifted, masked scsi status */
        unsigned char msg_status;/* [o] messaging level data (optional) */
        unsigned char sb_len_wr; /* [o] byte count actually written to sbp */
        unsigned short host_status; /* [o] errors from host adapter */
        unsigned short driver_status;/* [o] errors from software driver */
        int resid;              /* [o] dxfer_len - actual_transferred */
        /* unit may be nanoseconds after SG_SET_GET_EXTENDED ioctl use */
        unsigned int duration;  /* [o] time taken by cmd (unit: millisec) */
        unsigned int info;      /* [o] auxiliary information */
} sg_io_hdr_t;

Unused fields should be set to zero on input. It is recommended that the whole sg v3 structure is zeroed (e.g. with memset()) prior to a command request being built and submitted. Note that one of the constants: SG_DXFER_NONE, SG_DXFER_TO_DEV or SG_DXFER_FROM_DEV should be placed in the dxfer_direction field and they all have negative values (-1, -2 and -3 respectively). This is used to differentiate between the v1/v2 interface (which has reply_len in that position) and this (v3) interface.

The version 3 sg driver supported the version 1, 2 and 3 interface structures. It introduced the blocking ioctl(SG_IO) while keeping the write(2)/read(2) technique for asynchronous usage. The blocking ioctl(SG_IO) has also been implemented in the block layer for SCSI block devices (e.g. /dev/sdb) and in other drivers such as the SCSI tape driver (st). So at this time the version 3 interface structure together with ioctl(SG_IO) is the most used SCSI pass-through in Linux. Over time there has been a transfer of functionality from the write(2) and read(2) system calls to various ioctl(2)s. Using the write(2) and read(2) system calls in the way that this driver does is frowned upon by the Linux kernel architects. Even though adding new ioctl(2)s is also discouraged, two new ioctl(2)s were proposed in this post by a Linux architect (L. Torvalds). Those two ioctl(2)s plus two closely related ioctl(2)s have been implemented in this upgrade.

Some weaknesses of the version 3 interface were that it had no provision for bidirectional commands and that it included pointers. Pointers in interface structures are problematic because they change size when moving from 32 bit to 64 bit architectures (and that was a big issue at the time). Also the version 3 interface was too SCSI command set specific and could not easily pass related protocols such as SCSI task management functions (TMFs) or the SAS Management Protocol (SMP). So around 2005 the version 4 SCSI pass-through interface structure was introduced:

struct sg_io_v4 {
        __s32 guard;            /* [i] 'Q' to differentiate from v3 */
        __u32 protocol;         /* [i] 0 -> SCSI , .... */
        __u32 subprotocol;      /* [i] 0 -> SCSI command, 1 -> SCSI task
                                   management function, .... */

        __u32 request_len;      /* [i] in bytes */
        __u64 request;          /* [i], [*i] {SCSI: cdb} */
        __u64 request_tag;      /* [i] {SCSI: task tag (only if flagged)} */
        __u32 request_attr;     /* [i] {SCSI: task attribute} */
        __u32 request_priority; /* [i] {SCSI: task priority} */
        __u32 request_extra;    /* [i] {spare, for padding} */
        __u32 max_response_len; /* [i] in bytes */
        __u64 response;         /* [i], [*o] {SCSI: (auto)sense data} */

        /* "dout_": data out (to device); "din_": data in (from device) */
        __u32 dout_iovec_count; /* [i] 0 -> "flat" dout transfer else
                                   dout_xfer points to array of iovec */
        __u32 dout_xfer_len;    /* [i] bytes to be transferred to device */
        __u32 din_iovec_count;  /* [i] 0 -> "flat" din transfer */
        __u32 din_xfer_len;     /* [i] bytes to be transferred from device */
        __u64 dout_xferp;       /* [i], [*i] */
        __u64 din_xferp;        /* [i], [*o] */

        __u32 timeout;          /* [i] units: millisecond */
        __u32 flags;            /* [i] bit mask */
        __u64 usr_ptr;          /* [i->o] unused internally */
        __u32 spare_in;         /* [i] */

        __u32 driver_status;    /* [o] 0 -> ok */
        __u32 transport_status; /* [o] 0 -> ok */
        __u32 device_status;    /* [o] {SCSI: command completion status} */
        __u32 retry_delay;      /* [o] {SCSI: status auxiliary information} */
        __u32 info;             /* [o] additional information */
        __u32 duration;         /* [o] time to complete, in milliseconds (or nanoseconds) */
        __u32 response_len;     /* [o] bytes of response actually written */
        __s32 din_resid;        /* [o] din_xfer_len - actual_din_xfer_len */
        __s32 dout_resid;       /* [o] dout_xfer_len - actual_dout_xfer_len */
        __u64 generated_tag;    /* [o] {SCSI: transport generated task tag} */
        __u32 spare_out;        /* [o] multiple requests: secondary error */

        __u32 padding;
};

The __s32 and __u64 types could be replaced by the (more) standard int32_t and uint64_t C types. The pointers are still there but are placed in fixed length (64 bit) unsigned integers. All the other integer sizes are fixed so that the structure is the same size on 32 and 64 bit architectures. Between around 2005 and this upgrade, the version 4 interface structure was only used by the bsg driver which explains why its interface structure is found in the <linux/bsg.h> header file. The version 2 and version 3 interface structures are found in the <scsi/sg.h> header file.

v4 interface field

corresponding to in v3 interface

Notes

guard [io]

[interface_id]

Both fields are the first in their respective structures and are assumed to be 32 bits each. The guard for v4 is an ASCII 'Q' stored as an unsigned 32 bit integer. The interface_id is an ASCII 'S' stored as a 32 bit integer. The difference between signed and unsigned is not important in this case.

protocol [io]


A value of '0' (a 32 bit integer) is used for all SCSI protocols

subprotocol [io]


A value of '0' for SCSI commands sets based on SPC. The value '1' is reserved for SCSI Task Management Functions [TMFs] which are not implemented at this time.

request_len [i]

cmd_len

Number of bytes in SCSI command. Since cmd_len is an unsigned char (i.e. an 8 bit byte) the largest number it can represent is 255 in the v3 interface.

request [i, *o]

cmdp

Like all pointers in the v4 interface, request is a pointer value placed in a 64 bit unsigned integer. This is done to make the size of v4 interface constant (as long as pointers (by C definition able to fit in unsigned long) fit in 64 bits). Conversely, cmdp is a pointer so its size will very between 32 and 64 bit systems.

request_tag [i]


Used if ioctl(SG_SET_FORCE_PACK_ID) third argument points to non-zero integer and SG_CTL_FLAGM_TAG_FOR_PACK_ID is set via the extended ioctl(2) on this file descriptor. This value is acted upon by ioctl(SG_IORECEIVE) and ioctl(SG_IOABORT).

Note that generated_tag is only written when ioctl(SG_IOSUBMIT) completes. So the user space code needs to copy the contents of generated_tag to this field to match by that tag value in a call to ioctl(SG_IORECEIVE).

request_attr [i]


not currently used

request_priority [i]


not currently used

request_extra [i,{o}]

pack_id

A packet identifier of -1 is taken as a wildcard (i.e. match any). Twos complement is assumed for the 32 bit unsigned request_extra so -1 becomes 0xffffFFFF . Also used by ioctl(SG_IOABORT) for identification.

max_response_len [i]

mx_sb_len

No more than this number of sense bytes will be written out starting at where response points.

response [i, {*i},o]

sbp

A pointer to the sense buffer. Only used when the SCSI device yields sense data for the associated command. In the non-blocking case, the pointer value given to ioctl(SG_IOSUBMIT) is used and any value given to ioctl(SG_IORECEIVE) is ignored and when that ioctl(2) returns this field will contain the original value in it. The note given for request applies here also.

dout_iovec_count [i]

[iovec_count]

If this field is zero then dout_xferp (or dxferp) points to user data to be written from the host to the storage device. If this field is non-zero, then its the the number of elements in the scatter gather list pointed to by dout_xferp (or dxferp).

dout_xfer_len [io]

[dxfer_len]

This field is the number of bytes pointed to by dout_xferp (or dxferp). The data is (or will be) moved from the host to the SCSI device (e.g. a SCSI WRITE command)

din_iovec_count [i]

[iovec_count]

If this field is zero then din_xferp (or dxferp) points to user data to be written from the host to the storage device. If this field is non-zero, then its the the number of elements in the scatter gather list pointed to by din_xferp (or dxferp).

din_xfer_len [io]

[dxfer_len]

This field is the number of bytes pointed to by din_xferp (or dxferp). The data is (or will be) moved from the SCSI device to the host (e.g. a SCSI READ command).

dout_xferp [i, *o]

[dxferp]

If the dout_iovec_count field is zero then this field points to the first byte to be transferred from the user space memory to the storage device. All the other bytes (indicated by dout_xfer_len) should follow the first byte with no gaps. If the dout_iovec_count field is non-zero then this field points to a scatter gather list which the driver will use to output data from the user space to the storage device. The note given for request applies here also.

din_xferp [i, *i]

[dxferp]

If the din_iovec_count field is zero then this field points to the first byte to be transferred from the storage device to the user space memory. All the other bytes (indicated by din_xfer_len) should follow the first byte with no gaps. If the din_iovec_count field is non-zero then this field points to a scatter gather list which the driver will use to read data from the storage device to the user space. The note given for request applies here also.

timeout [i]

timeout

This is the number of milliseconds the SCSI mid level will wait for a command to finish before it attempts to abort that command. If zero is given, a driver default of SG_DEFAULT_TIMEOUT (60,000 or 60 seconds) is chosen. Several SCSI commands (e.g. FORMAT UNIT with the IMMED bit cleared on a 10 Terabyte disk (hard disk or SSD)) take a lot longer than that. User manuals for disks often indicate how long such commands will take.

flags [io]

flags

This is a 32 bit integer in which the lower numbered bit positions are boolean flags. The available settings are listed in the <include/uapi/scsi/sg.h> header file. They start with SG_FLAG_ or SGV4_FLAG_ .

usr_ptr [io]

usr_ptr

The driver does not use this value. Whatever pointer value that is placed in usr_ptr will be sent back to the user space after the command has completed. This may be useful in async (non-blocking) code when the submission and completion are separated (e.g. in different threads). Whenever multiple submissions are outstanding, the order of completion is up to the storage device. The note given for request applies here also.

spare_in [i]


not currently used

driver_status [o]

driver_status

This value is output by the driver. Zero indicates no errors. These are not so much sg driver errors as errors from the SCSI mid-level. The possible values are listed in the <include/scsi/scsi.h> header and they start with DRIVER_ . If driver_status is non-zero then SG_INFO_CHECK is set in the info field.

transport_status [o]

host_status

This value is output by the driver. Zero indicates no errors. These are not so much sg driver errors as errors from a SCSI Low Level Driver (LLD) typically controlling a Host Bus Adapter (HBA). The possible values are listed in the <include/scsi/scsi.h> header and they start with DID_ . If transport_status is non-zero then SG_INFO_CHECK is set in the info field.

device_status [o]

status

This value is output by the driver. Zero indicates no errors. This is the 8 bit SCSI Status returned in response to all SCSI commands (unless they time out). The possible values are listed in the <include/scsi/scsi_proto.h> header and they start with SAM_STAT_ . If device_status is non-zero then SG_INFO_CHECK is set in the info field.

retry_delay [o]


not currently used. Zero is output by the driver in this field.

info [o]

info

This value is output by the driver. This value contains boolean flags OR-ed together. The possible flags are listed in the <include/uapi/scsi/sg.h> and they start with SG_INFO_ .

duration [o]

duration

This value is output by the driver. It is the time between when a command is issued to the block layer until the internal completion occurs. By default the unit is milliseconds, however if SG_CTL_FLAGM_TIME_IN_NS is set in the extended ioctl(2) on this file descriptor then the unit is nanoseconds.

response_len [o]

sb_len_wr

This value is output by the driver. This is the length of the sense buffer (i.e. the response) that is returned from the storage device. This usually indicates something has gone wrong with the command. A value of 0 indicates there is no sense buffer and the storage device has most likely successfully completed the command. Due to caches in storage devices WRITEs may initially report success and later report a "deferred error". If response_len is greater than zero then SG_INFO_CHECK is set in the info field.

din_resid [o]

resid

This value is output by the driver. This is din_xfer_len less the number of bytes actually transferred in from the storage device.

dout_resid [o]


This value is output by the driver. This is dout_xfer_len less the number of bytes actually transferred out to the storage device.

generated_tag [o ***]


This value is output by the driver. Zero will be placed in this field unless the SGV4_FLAG_YIELD_TAG flag is one of the flags set in the flags field in a call to ioctl(SG_IOSUBMIT). In this case, the block layer's tag value is placed there.

spare_out [o]


Only used with multiple requests, otherwise zero is placed in this field. With multiple requests (both in the control object and in one response array element) the secondary error is placed in this field. A secondary error is an errno value (so 0 is good). It can be viewed as the errno from a failed ioctl(2) (e.g. ioctl(SG_IOSUBMIT)) if the request in question was submitted by itself (i.e. not in a mrq). Secondary errors are typically caused by syntax errors in an input sg_io_v4 object.



Unused fields should be set to zero on input. It is recommended that the whole sg_io_v4 structure is zeroed (e.g. with memset() ) prior to a command request being built and submitted. In the first column of the above table, the "i" and "o" indications within the square brackets are in some cases expansions on what is shown in the sg_io_v4 structure definition comments above the table. Those with "i" should be set (or left as zero) before a call to ioctl(SG_IO) or ioctl(SG_IOSUBMIT), Those with "o" will in some cases be set by this driver and can be checked after a call to ioctl(SG_IO) or ioctl(SG_IORECEIVE). The "*o" indicates a pointer being used as the source starting address to copy data from the user space to the driver and often on to a storage device. The "*i" indicates a pointer being used as the destination starting address to copy data from a storage device into the user space. This level of detail becomes more important when a request is split between a ioctl(SG_IOSUBMIT) and a ioctl(SG_IORECEIVE). Some input values (e.g. din_xfer_len) are copied to the output as a convenience (e.g. to help in this calculation: (din_xfer_len - din_resid) which is the number of bytes actually read). The "[o ***]" indication notes the special case of generated_tag whose value is output after ioctl(SG_IOSUBMIT), all other output values (and generated_tag itself) are output after ioctl(SG_IORECEIVE) has completed.

The square brackets in the second column of the above table implies the v3 interface field is similar to, but not exactly the same as, the v4 interface field.

Note that multiple requests (in one invocation) use an instance of the same sg_io_v4 structure as its control object. Most fields have a different, but related, meaning when they are in a control object. A control object is distinguished by having the SGV4_FLAG_MULTIPLE_REQS flag set. Multiple requests are described in a later section.

3 Changes to sg driver between version 3.5.36 and 4.0

A summary is given as bullet points:

There are still some things to do:

4 Architecture of the sg driver

Nothing much has changed in the overall architecture of the sg driver between version 3 (v3) and version 4 (v4). Having a pictorial summary of the driver's object tree may help later explanations:




The sg driver is shown as a laptop at the top of the object tree. The arrow end of solid lines shows objects that are created automatically or by actions outside the user interface to the sg driver. So the disk-like objects created at the second level come from the device scanning logic undertaken by the SCSI mid-level driver in Linux. Note that there are SCSI devices other than disks such as tape units and SCSI enclosures. Also note that not all storage devices in Linux use the SCSI subsystem, examples of these are NVME SSDs and SD cards that are not attached via USB. The type of SCSI device objects is sg_device (and in the driver code they appear as objects of C type 'struct sg_device'). Even though the sg driver's implementation is closely associated with the block subsystem, the sg driver's device nodes are character devices in Linux (e.g. /dev/sg1). The nodes are also known as character special devices.

At the third level are file descriptors which the user creates via the open(2) system call (e.g. 'sg_fd = open("/dev/sg1", O_RDWR);') . Various other system calls such as close(2), write(2), read(2), ioctl(2) and mmap(2) can use that file descriptor made by open(2). The file descriptor will stay in existence until the process containing the code that opened it exits or the user closes it (e.g. 'close(sg_fd);'). A dotted line is shown from the "owning" device to each file descriptor in order to indicate that it was created by direct user action via the sg interface. The type of file descriptor objects is sg_fd. BTW most system calls have "man pages" and the form open(2) indicates that there is a manpage in section 2 which is for system calls. Other common manpage sections are "1" for commands and utilities (e.g. 'man 1 cp' explaining the copy command); "3" for system libraries (e.g. 'man 3 snprintf') and "8" for system administration commands.

At the lowest level are the sg_request objects each of which carry a user provided SCSI command to the target device which is its grandparent in the object tree. These requests are then sent via the block and SCSI mid-level to a Low Level Driver (LLD) and then across the transport (with iSCSI that can be a long way) to the target device (e.g. a SSD). User data that moves in the same direction as the request is termed as "data-out" and the SCSI WRITE command is an example. In nearly all cases (an exception is a command timeout) a response traverses the same route as the request, but in the reverse direction. Optionally it may be accompanied by user data which is termed as "data-in" and the SCSI READ command is an example. Notice that a heavy (thicker) line is associated with the first request of each file descriptor; it points to a reserve request (in earlier sg documentation this was referred to as the reserve buffer). That reserve request is built after each file descriptor is created and before the user has a chance to send a SCSI command/request on that file descriptor. This reserve request was originally created to make sure CD writing programs didn't run out of kernel memory in the middle of a "burn". That is no longer a major concern but the reserve buffer has found other uses: for mmap-ed and direct IO. So when the mmap(2) system call is used on a sg device, it is the associated file descriptor's reserve request's buffer that is being mapped into the user space.

The lifetime of sg_request objects is worth noting. When a sg_request object is active ("inflight" is the term used in the driver) it has both an associated block request and a SCSI mid-level object. They have similar roles and overlap somewhat. However once the response is received (i.e. the internal completion point in the next diagram) the block request and the SCSI mid-level objects are freed up. The sg_request object lives on, along with the data carrying part of the block request called the bio as that may be carrying "data-in" that has yet to be delivered to the user space. That is because the default user data handling (termed as "indirect IO") is a two stage process. For data-in that will be first DMA-ed from the target device into kernel memory, typically under the control of the LLD; the second stage is copying from that kernel memory to user space, under the control of this driver. Even after the user has fetched the response and any data-in, the sg_request continues to live. [However once any data-in has been fetched the block request bio is freed.] The sg_request object is then marked "inactive" and placed on a sg_request object free list, one of which is maintained for each file descriptor. So each sg file descriptor contains two request lists: one for any command that is active and the other one is a free list for inactive requests (there is an exception). The next time a user tries to send a SCSI command through that file descriptor, its free list will be checked to see if any inactive sg_request objects has a large enough data buffer suitable for the new request; if so that object will be (re-)used for the new request. Only when the user calls close(2) on that file descriptor will all the requests on the free list be truly freed. Note that in Unix, and thus Linux, the OS guarantees that it will call the close(2) command (called release() in the kernel and sg_release() in this driver) in this driver for every file descriptor that the user has opened in this process, irrespective of what the code in that process does. This is important because processes can be shut down by signals from other processes or drivers, segmentation violations (i.e. bad code) or the kernel's OOM (out-of-memory) killer.

The above description is setting the stage for a newly added feature called "sharing" introduced in the sg v4 driver. It also uses the reserve request.

5 Synchronous usage

These two forms: ioctl(sg_fd, SG_IO, ptr_to_v3_obj) and ioctl(sg_fd, SG_IO, ptr_to_v4_obj) can be used for submitting SCSI commands (requests) and waiting for the response before returning to the calling thread. This action is termed as synchronous or blocking in this driver. In Linux most block devices that use or can translate the SCSI command set also support the first form (i.e. the ioctl(2) that takes a pointer to a v3 interface object as its third argument). So this pseudo code will work: ioctl(open("/dev/sdc"), SG_IO, ptr_to_v3_obj) but not if the third argument is a ptr_to_v4_obj. Some storage related character devices (e.g. /dev/st2 and /dev/ses3) will also accept the first form.

Only two drivers currently support the second form (i.e. whose third argument is a ptr_to_v4_obj): this driver and the bsg driver.

It is important to understand that the use of ioctl(SG_IO) is only synchronous seen from the perspective of the calling thread/task/process. It is only the calling thread that waits for completion of the request. Any other thread or process submitting requests to the same or other devices associated with the sg driver will not be impeded by that wait. This assumes that the underlying devices can queue SCSI commands which most current SCSI devices are capable of doing. As an example: a large copy between two storage devices can be broken down into multiple copy segments, with each copy segment copying a comfortable amount of data (say 1 MByte); then multiple threads can each take a copy segment from a pool and fulfil them by doing a READ then a WRITE SCSI command. Each READ/WRITE pair of commands seems synchronous but overall the threads are doing asynchronous READs and WRITEs with respect to one another.

Apart from some special cases (one shown below), it isn't generally useful to mix synchronous and asynchronous commands/requests on the same thread. An asynchronous command/request (i.e. non-blocking) could be submitted followed by a second synchronous command which will go through to completion before it returns; then the first command's completion can be fetched. Care is taken within the driver so that an asynchronous completion, even if it is pending will not be incorrectly supplied as the result of a synchronous command.

The simplest way to issue SCSI commands to any device is with a synchronous ioctl(SG_IO). Asynchronous commands have some advantages (mainly performance) but that comes at the expense of more complexity for the user application. When a program is juggling multiple asynchronous submissions and completions it needs to track either pack_id, tag or a user pointer to correctly match completions with submissions. Since the sg driver maintains strong per file descriptor context, one way to simplify the matching problem is to have one file descriptor per submission/completion. However then multiple file descriptors need to be juggled, which is not so onerous.




In the diagram above a synchronous (i.e. blocking) ioctl(SG_IO) is shown. As a general rule the ioctl(2) will return -1 with a positive errno value if there is a problem creating the object of type sg_request in the top left of the diagram. Examples of this are syntax or contradictory information in the v3 or v4 interface object. Another cause could be out of resources. Once the sg_request object is "inflight" any errors will be reported via the v3 or v4 interface object. As noted in the diagram the user thread is placed in a interruptible wait state, awaiting command/request completion. If the command takes some time the user may use a keyboard interrupt (e.g. control-C) to "kill" the containing process from another terminal (e.g. with kill(1)). This will cause the shown sg_request object to become an orphan. The default action is to remove orphan sg_request objects as soon as practical. However if the file descriptor has the "keep orphan" flag set (see ioctl(SG_SET_KEEP_ORPHAN) below) a further read(2) or ioctl(SG_IORECEIVE) will fetch the response information from the orphan which will then be placed on the free list.

The main context that a user space application controls in this driver is the file descriptor, shown as a sg_fd object in the earlier object tree diagram. Roughly speaking a file descriptor object is created when sg_fd=open(<sg_device_name>) succeeds and is destroyed by a close(sg_fd). Again, roughly speaking a file descriptor is confined to a user process. In multi-threaded programs it is often a good idea to have separate sg file descriptors in each thread. Some exceptions to these generalizations are discussed on the next section.

Another feature of the file descriptor object in the sg driver is that each one has a reserve request created at the same time as the file descriptor. This reserve request is immediately placed on the new sg file descriptor's free list. Any new command/request on that file descriptor will use that reserve request if :

When a command request is completed, its sg_request object is placed (or replaced) on the free list. So no sg_request objects are actually deleted (i.e. the memory they use being freed up) until the owning file descriptor is close(2)-d. In the case where there are copies of the file descriptor (e.g. a forked process or due to dup(2)) then is is the last close(2) that frees up all associated sg_request objects.

6 Sharing file descriptors

First a rationale. Copying data between storage devices is a relatively common operation. It can be both time and resource consuming. The best approach is to avoid copying all together. Another approach is to defer copies (or part of them) until they are really necessary which is the basis of COW (i.e. copy on write). Then there are offloaded copies, for example where the source and destination are disks in the same array, then a "third party copy" program (e.g. based on SCSI EXTENDED COPY and its related commands) can tell the array to do the copy itself and inform you if it finishes successfully or not. However in many cases copies are unavoidable.

If the dd(1) program is considered, copying one part of a normal block storage device to another storage device involves a surprising number of copies. Copies of large amounts of data are typically done in a staggered fashion to lessen the impact on other things the system may be doing. So typically 1 MegaByte (say) is read from the source device into a buffer, followed by a write of that buffer to the destination device; if no error occurs, repeat until finished. Copies between a target device and kernel memory are typically done by DMA (direct memory access) controlled by the LLDs owning the storage devices. So another copy is needed on each side of the copy to get the data in and out of kernel buffers to the user space. Moving data between a user space process and the kernel space has a little extra overhead to deal with the situations like the process being killed while data is being copied to and from it. So a reasonable implementation of dd(1) has three buffers (2 in the kernel space) and performs 2 DMAs then 2 copies between the user space and the kernel space. As storage devices and transports get quicker, the time taken to do those copies may become significant compared to the device access time.

Another aspect of the sharing being proposed is security. Often a user has the right to copy data but not see it. This is usually accomplished by encrypting the data. Another approach might be to make sure the copy's data is kept in kernel buffers and thus hidden from the user who is copying it. While the v4 sg driver can do this, the sg driver is not written with a view to security, since it offers a pass-through interface which, by definition, is a method to circumvent an Operating System. Those building a highly secure computer systems might consider removing the sg driver or restricting its access to highly privileged users.

Sharing is a new technique added to the sg v4 driver to speed copy operations. The user first sets up a sharing relationship between two sg file descriptors, one that will be used for doing SCSI READ commands (more generally any data-in SCSI command), and the other that will be used for doing SCSI WRITE commands using the data received by the previous READ. Any data-out command can be used so, for example, the SCSI WRITE command could be replaced by WRITE AND VERIFY or WRITE SCATTERED. The file descriptor that does the READ is called the master side by the driver and the file descriptor that does the WRITE is called the slave side. The following diagram shows how one share between two file descriptors is set up.




Here the master side is /dev/sg1 and has 4 open file descriptors (fd_s 1 through 4). The slave side is /dev/sg2 has 3 open file descriptors (fd_s 5 through 7). The share shown is set up when the thread or process containing fd5 calls the "EXTENDED" ioctl on the fd5 file descriptor (i.e. the ioctl's first parameter) with a pointer to an integer containing fd1 as the ioctl's third parameter. The C code is a little more complicated than that.

How does the thread or process containing fd5 know about fd1? That is up to the design of the user space application. If they are both in the same thread then it should be obvious. If they are in different threads within the same process then it should be relatively simple to find out. The interesting case is when they are in different processes. A child process inherits all open file descriptors (including those belonging to the sg driver) from its parent in the Linux fork() system call. For processes that don't have a parent child relationship, UNIX domain sockets can be used to "send" an open file descriptor from one process to another. Note that in this case the file descriptor number might differ (e.g. because the receiver side already is using the same file descriptor number as the sender's number) but they will still logically refer to the same thing. Also that statement above about process termination leading to sg_release() being called for any sg file descriptors open(2)-ed in that process needs qualification: in this case the last process to hold an open file descriptor being terminated causes the driver's sg_release() to be called. In short the last close(2) on a file descriptor causes sg_release() to be called.

The sg driver's file descriptors can only be part of one share (pair). Given this restriction, in the above diagram, fd5 cannot also be in a share with fd4. fd6 may be in a share with fd7; that would imply that the share could be used for a copy from /dev/sg2 to /dev/sg2 . The master side of the share monopolizes that file descriptor's reserve request hence there can only be one outstanding share request per pair of shared file descriptors. Given this restriction one way to do a copy using queued command is to use POSIX threads. As an example from the above diagram, if 3 copy worker threads were used then the first thread could utilize fd1 and fd5, the second thread could utilize fd3 and fd6 while the last thread could utilize fd4 and fd7. This is what the sgh_dd test utility does (see below).

After a share of two file descriptors is established command requests can still be sent to both file descriptors in the normal fashion. Only when the new flag SGV4_FLAG_SHARE is given, or OR-ed in with other flags, is request sharing performed. See the Request sharing section below.

7 Async usage in v4

The terms asynchronous and non-blocking are generally used as synonyms in this description. Those terms are related to the Unix file descriptor flags O_ASYNC and O_NONBLOCK which have more precise meanings and are set in either the open(2) or fcntl(2) system calls. In Unix the O_NONBLOCK flag on a regular file descriptor causes read(2) to return promptly with an EAGAIN errno if there is no data available to be read. This driver's ioctl(SG_IORECEIVE) and read(2) will react in the same fashion. However this driver's ioctl(SG_IO) ignores the O_NONBLOCK flag. The O_ASYNC file descriptor flag causes signals to be sent to process owning the file descriptor whenever something 'interesting' happens (e.g. data arriving) to that file descriptor. When the term asynchronous is used in this description it is more likely referring to non-blocking behaviour rather than enabling signals.

The asynchronous interface in the context of the sg driver means issuing a SCSI command in one operation then at some later time a second operation retrieves the status of that SCSI command. Any data being transferred associated with the SCSI command is guaranteed to have occurred before that second operation succeeds. The synchronous interface can be viewed as combining these two operations into a single system call (e.g. ioctl(SG_IO) ).

The asynchronous interface starts with a call to ioctl(SG_IOSUBMIT) which takes a pointer to the sg v4 interface object. This object includes the SCSI command with data transfer information for either data-in (from device) or data-out (to device). Depending on the storage device accessed (identified by the sg file descriptor given as the first argument to the ioctl() system call) the SCSI command will take milliseconds or microseconds to complete. Chances are the ioctl(SG_IOSUBMIT) will complete in a sub-microsecond timescale (on a modern processor) and that will be done before the SCSI command completes. If further processing depends on the result of that SCSI command then the program must wait until that SCSI command is complete. When that completion occurs, the data-out is guaranteed to be on the nominated storage device (or in its cache). And if a data-in transfer was specified, that data is guaranteed to be in the user space as directed. How does the program find out when that SCSI command has completed?

The exact timing of the data-out and data-in transfers can be thought of as a negotiation between the HBA (Host Bus Adapter controlled by the LLD) and the storage device. The essential point is that the data transfer and the completion are asynchronous to the program that requested the SCSI command. Since the completion is guaranteed to follow any associated data transfer then the completion event is what we will concentrate on. Detecting asynchronous events depends on Operating System features such as signals and polling. Polling is the simpler technique. However the simplest approach is to call the final step in the process which is ioctl(SG_IORECEIVE) as soon as possible. In the likely case that the SCSI command completion has not occurred, then the ioctl(2) can do one of two things: it can wait until the completion does occur or yield an "error" called EAGAIN. Similar to SCSI sense data, a UNIX errno doesn't always imply a hard error. So EAGAIN is not a hard error, but it tells the program that the operation didn't occur but may happen later, so try again, but preferably don't retry immediately. What determines whether the ioctl() waits or returns EAGAIN is the presence of the O_NONBLOCK flag on the file descriptor.

Two file descriptor flags are important to the asynchronous interface of the sg driver: O_NONBLOCK and O_ASYNC. The file descriptor flags are defined in such a way that they can be OR-ed together. The normal place to define flags is in the open(2) system call (its second argument) but they can be changed (and added to) later with the fcntl(2) system call. If the O_NONBLOCK is given then it will typically be given in the open(2). The O_ASYNC flag is a bit more difficult to handle because it arms the SIGIO (also known as SIGPOLL) signal which if it occurs before a program has set up a handler for it, then the program will exit. Actually Linux ignores O_ASYNC is the open(2) call (see 'man 2 open' in the BUGS section), so fcntl(2) is the only way to set it. Below is a simplified example of adding the O_ASYNC flag to a file descriptor (sg_fd) that is already open:

flags = fcntl(sg_fd, F_GETFL, NULL);

fcntl(sg_fd, F_SETFL, flags | O_ASYNC);

It is possible to replace the classic Unix SIGIO signal with a POSIX real-time signal by making an additional call:

fcntl(sg_fd, F_SETSIG, SIGRTMIN + 1);

After that call the SIGRTMIN+1 real time signal will be used instead of SIGIO. Even though you could use hard numbers for the real-time signals the advice is to always use and offset from SIGRTMIN or SIGRTMAX (negative offset in the MAX case) because the library can (and does for its POSIX threads implementation) steal some of the lower real time signals and adjusts the SIGRTMIN value that the application program sees. Real time signals have improved semantic compared to the classic Unix signals (e.g. multiple instances of the same real time signal can be queued against a process where Unix signals would meld into one signal event in a similar situation).

In the diagram below the lifetime of an active sg_request object is shown from when it is created or retrieved from the free list in the top left to when the SCSI command has completed and the user space has been informed on the bottom right. It assumes that either the O_NONBLOCK flag is set on the file descriptor (assumed to be the same in all the system call boxes shown with the blue band at the top), or ioctl(SG_IORECEIVE) has SGV4_FLAG_IMMED or-ed into its flags. When the first ioctl(SG_IORECEIVE) is called the SCSI command has not completed so it gets rejected with EAGAIN. The first poll(2) system call indicates with POLLOUT that another SCSI command can be issued but there are no SCSI commands waiting for an ioctl(SG_IORECEIVE) on this file descriptor. Note that poll(2) description refers to a file descriptor, not this particular sg_request object but for simplicity we will assume there is only one outstanding SCSI command on this file descriptor. At some future time, preferably long before the command approaches its timeout (often 60 seconds or more) the storage device via its LLD informs the sg driver that a SCSI command belonging to this file descriptor has completed. If O_ASYNC has been set on this file descriptor then the sg driver will issue a SIGIO signal to the owning process. A poll(2) system call after the internal completion point yields (POLLIN | POLLOUT) [IOWs both POLLIN and POLLOUT]. That tells us that the next ioctl(SG_IORECEIVE) will be successful as is indicated in the diagram.




While it is useful to think and illustrate the above mentioned ioctl(2)s and poll(2)s as being in reference a single sg_request object, they are all actually against the file descriptor that is the parent of that sg_request object. This distinction matters when multiple sg_request objects are outstanding. In the absence of any selection information (e.g. a pack_id or a tag) the ioctl(SG_IORECEIVE) will fetch the oldest sg_request object since the active (and completed) command list acts as a FIFO. Instead of poll(2) the user may call the ioctl(SG_GET_NUM_WAITING) which yields the number of sg_request objects belonging to a file descriptor that have completed internally but are yet to have ioctl(SG_IORECEIVE) [or read(2) for the async v3 interface] called on them.

7.1 ioctl(SG_IOABORT)

After starting an asynchronous request with ioctl(SG_IOSUBMIT) the user may decide to abort the SCSI command associated with that request. This can be a bit tricky in practice and may not succeed because internal processing is beyond the internal completion point shown in the above diagram. In that case the user must complete the normal processing (e.g. by using ioctl(SG_IORECEIVE)) and the call to ioctl(SG_IOABORT) will most likely yield ENODATA. When the request to be aborted is inflight blk_abort_request() is called in which case normal processing should still be done. The user code should expect a driver_status of DRIVER_HARD or DRIVER_SOFT or a transport status of DID_TIMEOUT if the abort "catches" the request it is after.

The request to abort needs to be identified, preferably uniquely. The default case (i.e. when the extended ioctl(2) has not been used on the current file descriptor to set SG_CTL_FLAGM_TAG_FOR_PACK_ID) is to use the pack_id. The user code provides to pack_id input in the request_extra field in the call to ioctl(SG_IOSUBMIT). To abort that request the user code needs to build a empty v4 interface object (with 'Q' in the first 32 bit integer and place the pack_id in the request_extra field. A pointer to that object can then be given as the third argument of the ioctl(SG_IOABORT).

Alternatively a tag may be used to identify a request to be aborted. This is a little more involved. The the SG_CTL_FLAGM_TAG_FOR_PACK_ID flag needs to be set in the extended ioctl(2) on current file descriptor. Then the SGV4_FLAG_YIELD_TAG flag needs to be set in the flags field in the ioctl(SG_IOSUBMIT). On the completion of that ioctl(2), the tag can be read from the generated_tag field. Then that tag value needs to be placed in the request_tag field of v4 object pointed to in the third argument of ioctl(SG_IOABORT).

By default the scope of the search to find the request is restricted the the file descriptor given as the first argument of ioctl(SG_IOABORT). The SGV4_FLAG_DEV_SCOPE flag may be set in the flags field of the v4 interface object and in this case, if no match is found using the current file descriptor then the search continues on the other sg file descriptors belonging to that device (e.g. /dev/sg2) stopping with the first match found. The abort is then sent to that request. The user code should make no assumptions about the order those other file descriptors are searched (but "oldest first" would be a good guess). The importance of having unique (failing that, random) pack_id or tag values should be apparent. Setting them always to zero (for example) could lead to unpleasant surprises when the DEV_SCOPE flag is used. If all searches (or a single one) sfind no match then ioctl(SG_IOABORT) fails with errno set to ENODATA.

An asynchronous request started with ioctl(SG_IOSUBMIT_V3) can also be aborted, but only via its pack_id. In this case, even though the submit uses the v3 interface, the ioctl(SG_IOABORT) must use the v4 interface, with the pack_id placed in its request_extra field.

8 Request sharing

Request sharing refers to two requests, usually belonging to different storage devices (e.g. two disks), sharing the same in-kernel data buffer. Before request sharing can take place a share of two file descriptors belonging to those two storage devices needs to be set up. This is discussed in the previous Sharing file descriptors section.

The diagram below shows the synchronous sg driver interface using ioctl(SG_IO) which can take either the v3 or v4 interface. The synchronous interface can be seen as the combination of the various calls that make up the asynchronous interface discussed in the previous section. The time that the synchronous ioctl(SG_IO) takes is directly related to the access time of the underlying storage device. To stress that point the system call rectangles (with a blue band at the top) in the diagram below are shown as elongated rectangles with a beginning component to the left and a completion component to the right. The elongated system call boxes span the access time of the associated storage device.

A request share only takes place when a command request is issued and a SGV4_FLAG_SHARE flag is used (OR-ed with any other flags). This should be done first on the master side with a READ (like) command request. Other flags that might be combined with this are SG_FLAG_NO_DXFER or SG_FLAG_MMAP_IO flags (but not both). The SG_FLAG_NO_DXFER flag stops the copy from the in-kernel data buffer to user space. The SG_FLAG_MMAP_IO flag maps the in-kernel data buffer into the user space; that user space area is made available via a mmap(2) system call preceding the command request being sent. The diagram below shows the simpler case where the minimum number of flags are set. For brevity the leading SGV4_ is removed from the flag values in the following diagrams.





The slave may continue to send normal command requests but at some stage it should send the corresponding WRITE (like) command request with both the SGV4_FLAG_SHARE and SG_FLAG_NO_DXFER flags set. That will use the in-kernel data buffer from the preceding master share command request and send that data (i.e. data-out) to the slave's device. So a single, in-kernel data buffer is used for a master share request followed by a slave share request.

In the terminology of the block subsystem both the master and slave share requests have their own request object, each with their own bio object. However the sg driver provides the data storage for those bios and arranges for the slave share request to use the same data buffer as the preceding master request's bio. And this is the reason that the slave request must use the SG_FLAG_NO_DXFER flag, otherwise a transfer from the user space usually associated with a WRITE (like) command would overwrite the in-kernel data buffer.

Once the slave request has successfully completed another master share request may be issued. Sanity checks ensure that using the SG_FLAG_SHARE flag on non-shared file descriptor will cause an error, as will trying to send a master share request before a prior master share request is complete (which means its matching slave request has finished). Once a pair of file descriptors are shared, the master's side reserve request will only be used for command requests that have the SGV4_FLAG_SHARE flag set.

If the master share request fails (i.e. gives back any non zero status, or fails or warns at some other level) then the master request on completion will go to state "rs_inactive" (i.e. not "rs_swap"). Even if the master request succeeds, it is also possible that the application wants to stop the copy (e.g. because the user wants to abort the copy or there is something wrong with the data copied to the user space near the location marked "***" in the above diagram). This call: ioctl(master_fd, EXTENDED{MASTER_FINI}) manipulates a boolean which can be used to finish a share request after the master request has completed. What is needed here is setting this boolean to 1 (true) which changes changes the "rs_swap" to "rs_inactive" state. The inverse operation: setting that boolean to 0 (false) changes "rs_inactive" to "rs_swap" state which is used in the single read, multi write case below.

The brown arrow-ed lines in the above diagram show the movement of the "dataset" which is usually an integral number of logical blocks (e.g. each containing 512 or 4096 bytes). The brown arrow-ed lines that are vertical and horizontal do not involve copying (or DMA-ing) of that dataset. That leaves three brown arrow-ed lines at an angle: the DMA from the device being read, the DMA to the device being written, and an optional in-kernel to user space copy (annotated with "***").

A practical single READ, multiple WRITE solution needs the ability to have multiple slaves each associated with a different disk. Looking at the diagram above, two things need to happen to the master: it needs to adopt a new slave and it needs to get back into "rs_swap" state. A variant of the above mentioned ioctl(slave_fd, EXTENDED{SHARE_FD},) called ioctl(master_fd, EXTENDED{CHG_SHARE_FD},) has been added. As long as the new slave file descriptor meets requirements (e.g. it is not part of a file descriptor share already) then it will replace the existing slave file descriptor. To get back into "rs_swap" state the MASTER_FINI boolean in the EXTENDED ioctl, writing the value 0 (false) will do what is needed. The EXTENDED ioctl is a little tricky to use (because it essentially replaces many ioctls) but a side benefit is that multiple actions can be taken by a single EXTENDED ioctl call. So both the actions required to switch to another slave, ready to do another WRITE, can be done with a single invocation of the EXTENDED ioctl.

Here is a sequence of user space system calls to READ from /dev/sg1 (the master) and WRITE that same data to /dev/sg5, /dev/sg6 and /dev/sg7 (the slaves). Assume that fd1 is a file descriptor associated with /dev/sg1, fd5 with /dev/sg5, etc. In pseudocode that might be: ioctl(fd5, EXTENDED{SHARE_FD}, fd1); ioctl(fd1, SG_IO, FLAG_SHARE + READ); ioctl(fd5, SG_IO, FLAG_SHARE|NO_DXFER + WRITE); ioctl(fd1, EXTENDED{CHG_SHARE_FD=fd6 + MASTER_FINI=false}); ioctl(fd6, SG_IO, FLAG_SHARE|NO_DXFER + WRITE); ioctl(fd1, EXTENDED{CHG_SHARE_FD=fd7 + MASTER_FINI=false}); and ioctl(fd7, SG_IO, FLAG_SHARE|NO_DXFER + WRITE). So four ioctls to move data (one READ and three WRITEs) and three "housekeeping" ioctls. Notice that the WRITEs are done sequentially, they could theoretically be done in parallel but that would add complexity. Also note that a second READ cannot be done until the final WRITE from the previous sequence has completed, there is no easy way around that since only one, in-kernel buffer is being used (and a second READ would overwrite it). To make this sequence slightly faster (and hide the data from the user space) the flag in the second ioctl (the READ) can be expanded to FLAG_SHARE|NO_DXFER .

The sgh_dd utility in the sg3_utils testing directory (rev 803 or later) has been expanded to test the single READ, multiple WRITE feature. It has two extra "of" (output file) parameters: "of2=" and "ofreg=". The "of2=" is for a second WRITE sg device and the "ofreg=" takes a regular file or a pipe and will use the data that comes from the READ operation marked with "***" in the above diagram. If "ofreg=" is present among sgh_dd's operands then the READ's flag will be FLAG_SHARE, if "ofreg=" is not present its flags will be FLAG_SHARE|NO_DXFER . The latter should be slightly faster, and that difference can be reduced with "iflag=mmap". The "of2=" operand shares "oflag=" and "seek=" with "of=".

8.1 Slave waiting

A simple analysis of an asynchronous copy segment cycle based on the previous section starts with READ command being sent to the master's file descriptor [one user-to-kernel space context swap], followed by a signal or sequence of polls [one or more context swaps] followed by a read(2) or ioctl(SG_IORECEIVE) to get the result [so another context swap]. Assuming the response is good, then the same sequence is repeated, this time on the slave's file descriptor doing a WRITE. So that is at least six context swaps and importantly they must occur in that order. This is what the diagram in the previous section shows, but with synchronous rather than asynchronous calls.

An enhancement has been added to relax the strict ordering outlined in the previous paragraph. The slave's WRITE command can be sent to the driver in advance of its paired master READ command completing. Again the diagram below shows a copy segment: a READ from one disk followed by a WRITE of the data fetched to a second disk.




The important feature of this diagram is that the slave WRITE is started before the prior master READ has completed. Three synchronization points are shown: S1, S2 and S3. The S1 point is when the slave becomes aware that the master request (the READ) has been issued, but not necessarily completed. The slave request can be issued at any time following S1. If the slave request is in another thread or process, then the application needs a way of signalling to the slave thread/process that it can now issue the slave WRITE. The S2 synchronization point is purely internal (i.e. there is no code needed by the application). S2 is when the driver gets notification that the READ has finished. Assuming the READ was successful, and that S1 is before S2 then the slave WRITE, which has been held internally, can now be issued to the device [/dev/sg2]. Notice that the slave request is in rs_swait state between S1 and S2, indicating that it is being held. The S3 synchronization point is when the slave WRITE has finished and the master transitions from rs_slave to rs_inactive state. After S3 the next copy segment can be started.

Why show the master as an asynchronous request and the slave as a synchronous request? As a practical matter, the application needs to know when the master READ request has been issued so it can then issue the slave WRITE request. The simplest way to do that is to make the master READ asynchronous (a timer is another technique, but it may be too quick (e.g. occurring before S1) or too slow, wasting time). As for the slave WRITE request we are not interested in it until it has completed, hopefully successfully, hence the use of a synchronous request.

So this "slave waiting" approach decouples the strict ordering outlined in the first paragraph of this section into two loosely coupled sequences, the first for the master, the second for the slave. the only addition to application complexity is making the master request asynchronous. Notice that all completions (e.g. the ioctl(master_fd, SG_IORECEIVE,)) must still be processed and checks made for errors.

What about errors? Wouldn't code be simpler without error processing, but it would be a lot less interesting. The simpler case is the slave WRITE request failing, in which case the error is conveyed in the WRITE's completion in the normal manner. Then the application can decide whether to repeat the WRITE, or to WRITE somewhere else, or abort the copy. The more interesting case is when the master READ request fails as the notification of that may occur after the application has issued the slave WRITE request. In that case, a decision is made at the S2 synchronization point, not to issue the WRITE request to /dev/sg2 . Instead the ioctl(slave_fd, SG_IO,) completes just after S2 with a return value of 0 (so there is no error value in errno) but with sg_io_hdr::driver_status or sg_io_v4::driver_status set to DRIVER_SOFT. And whenever ::driver_status, ::device_status or ::transport_status are non-zero then the SG_INFO_CHECK flag is OR-ed into the info field. So that field is always worth checking on completion. The actual error is given in the master's completion in the normal fashion.

Can the master's call to ioctl(master_fd, SG_IORECEIVE) be after S3? Yes it can. That allows a single thread to do the following pseudo-code sequence:

ioctl(master_fd, SG_IOSUBMIT, <ptr_to_READ_pt_object>);

ioctl(slave_fd, SG_IO, <ptr_to_WRITE_pt_object>);

ioctl(master_fd, SG_IORECEIVE, <ptr_to_READ_pt_object>);

There is only three context swaps in that sequence with only ioctl(slave_fd, SG_IO,) taking the time required to actually do the READ followed by the WRITE. In real code, those three calls should have their return values checked plus, at the very least, a check that info does not have the SG_INFO_CHECK flag OR-ed into it.

The testing/sgh_dd.cpp utility (in sg3_utils-1.45 rev 811 or later, see main page) has an oflag=swait command line operand for exercising this feature.

Details: "error" can be a bit difficult to define in SCSI. The interesting ones are like: that READ worked but in the firmware's opinion this storage device will soon fail! You can ignore that if this is the final copy of data on that medium to something safer, but otherwise it is probably more serious than that READ failing. Anyway, when this driver is deciding internally whether a request has failed (e.g. that other requests are queued on), then any non-zero value in the SCSI status, or the driver or transport status is regarded as an error with the queued commands that have not been sent to the device getting DRIVER_SOFT as indicated above.

As the naming suggests the IOSUBMIT and IOSUBMIT_V3 ioctls are closely related. The same is true of SG_IORECEIVE and SG_IORECEIVE_V3. The '_V3' versions take a point to a v3 interface object (i.e. struct sg_io_hdr) as their third argument. These ioctl(s) have been separated y to simplify 32 bit to 64 bit compatibility handling. The v3 and v4 interface objects have different sizes. Further the v4 interface object is the same size in both 32 and 64 bit environments (by design) while the v3 interface object size differs between 32 and 64 bit environments (due to embedded pointers).

9 Sharing design considerations

The primary application of sharing is likely to be copying from one storage device to another storage device where both are SCSI devices (or translate the SCSI command set such as SATA disks do in Linux). Lets assume the copy is large enough so that it needs to be cut up into segments, implemented by READ (from source), WRITE (to destination) commands, each pair of which share the same data. Even with modern SSDs, maximum performance is usually obtained by queuing commands to storage devices. However the design of sharing in the sg driver requires sequential READ then WRITE commands on a pair of shared file descriptors in a way that precludes queuing using those two file descriptors. Worse still, the storage device that does the READ (i.e. the master side of the share) must wait, effectively doing nothing while its paired WRITE command is being done; it could be doing the next READ while it's waiting.

One relatively simple solution is to take advantage of threading which is well supported by the Linux kernel. Multi-threaded programs are typically multiple threads of execution running in a single process. All threads within a process share the same memory and other resources such as file descriptors. In the case of a copy using request sharing in the sg driver, a good approach would be to have one management thread and multiple worker threads. Each worker thread would go to a distribution centre where information about next segment offsets to be copied would be fetched; then the worker thread could go and do that copy segment using those offsets and return to the distribution centre for information on the next segment offsets to be copied, or be told there is nothing more to do in which case the thread could exit. The distribution centre needs to be stateful which in this context means that it needs to remember when it has given out copy segment offsets and not give them out again (unless the original thread reports an error). One way to protect this distribution centre from two worker threads accessing it at the same time is with a mutex shared between all worker threads. Finer grained threading mechanisms such as atomic integers may be able provide this protection in the place of a mutex.

With the sg driver there is no limit (in the driver, modulo memory availability) to the number of file descriptors that there can be referring to a single storage device. So for this segmented copy using sg driver sharing, a good approach would be to do a separate open(2) system call on the source and another on the destination in each worker thread. Then each worker thread could set up a file descriptor share with the master being the copy source file descriptor and the slave which will be the copy destination file descriptor. The number of worker threads should be no more than the maximum queue depth the two devices can comfortably handle. That said, having hundreds of worker threads may consume a lot of the machine's resources. An advantage of this approach is that each worker thread can use the sg driver's simpler synchronous interface (i.e. ioctl(SG_IO) ). Then the reader might wonder, is there any command queuing taking place? The answer is yes, because one way of viewing the sg driver is that under the covers it is always asynchronously accessing the SCSI devices. So even when one thread is blocked on a ioctl(SG_IO) another thread can call ioctl(SG_IO) and that command will be forwarded to the device.

There is a big "gotcha" with this design (and almost any other design for segmented copy that isn't completely single threaded). The gotcha does not apply when the destination device is a SCSI device, or uses the pwrite(2) or writev(2) system calls but does apply to the write(2) system call, often used to write to a pipe or socket. The problem is that if a read is issued by one thread (or any asynchronous mechanism) called R1 and before it completes another thread issues a read called R2, then there is no guarantee that R1 will complete before R2. And if R2 does complete before R1 and the write(2) system call is called for W2 (i.e. the pair of R2) before W1 then those writes will be out of order. Detecting out-of-order writes when gigabytes are being copied can be painful. If the source and shuffled destination are available as files then a utility like sha1sum will show them as different (because they are) but an old school sum(1) (like from 'sum -s') will give the same value for both. There is a related issue associated with the atomicity of the Linux write(2) command. There is no corresponding atomicity issue with the SCSI WRITE command.

To save time and resources the master side shared READ request should be issued with SG_FLAG_NO_DXFER flag OR-ed with its other flags. That is assuming that the copy program does not need to "see" the data as it flies past. As a counter example, a copy program might want to do a sha256sum(1) on the data being copied in which case that program needs to "see" the inflight data.

The above design can be extended to the single reader, multiple writer case. In other words each worker thread would open file descriptors to the READ storage device and every WRITE storage device. Code to demonstrate these techniques can be found in the sg3_utils package's testing/sgh_dd.cpp utility. That code uses ioctl(SG_SET_GET_EXTENDED, {SG_SEIM_CHG_SHARE_FD}) to change the slave side of an existing share to the next writer.

SCSI storage devices optionally report a "Block limits" Vital Product Data (VPD) page which contains a field called "Optimal transfer length" whose units are Logical blocks (e.g. usually either 512 or 4096 bytes). There is also a "Maximum transfer length" whose units are the same. If that VPD page is present (fetched via the SCSI INQUIRY command) but those fields are 0 then no guidance is provided. Otherwise the segment size chosen for a copy should probably be the minimum of the source and destination Optimal transfer length fields. However if that implies a segment size in the Megabyte range (say over 4 MB) then the Linux kernel may object.

Other copy designs are possible, that instead of using threads, use separate processes . One practical problem with this is the ioctl(2) that sets up the share between a destination file descriptor (fd) and a source fd. That will be done in the process containing the destination fd but how does it find out about the source fd? One way is in a process containing the source file descriptor, to use the Unix fork(2) system command to spawn a new process. The child process will share the same file descriptors as its parent. So if the child then goes on to open the destination storage device then it has the two file descriptors it needs to set up the share. While that solution may look good on paper, it may require a radical rewrite of existing code to implement. Perhaps a better solution is to pass an open file descriptor from one process to another process using a Unix socket. The blog by Keith Packard outlines the technique. Code based on both techniques can be found in the sg3_utils package's testing/sg_tst_ioctl.c (with the '-f' option).

10 Multiple requests

The bsg write(2) based asynchronous interface (removed from the kernel around lk 4.15) supported multiple sg_io_v4 objects in a single invocation. Such an invocation is abbreviated to mrq it the following. That is harder to do with an ioctl(2) based interface as the kernel favours pointers to fixed size objects passed as the third argument. Multiple requests (in one invocation) have been implemented in this driver using an extra level of indirection which is a common technique for solving software challenges.

A new sg v4 interface flag: SGV4_FLAG_MULTIPLE_REQS, has been added to sg_io_v4::flags . An instance of a sg_io_v4 object with the SGV4_FLAG_MULTIPLE_REQS flag set is termed as a control object which is abbreviated to ctl_obj below. A pointer to a ctl_obj can be given as the third argument to either ioctl(SG_IO), ioctl(SG_IOSUBMIT), ioctl(SG_IORECEIVE) or ioctl(SG_IOABORT). The members of a control object are interpreted a little differently from a normal sg v4 interface object:

control object's fields

input value

Notes (flags are written without the leading SGV4_FLAGS_ for brevity)

guard

'Q'

associated ctl_obj.protocol and ctl_obj.subprotocol fields must both be 0 implying SCSI command protocol. This is the same as the normal v4 interface object

request

0 or ptr-> array of cdbs

if 0 then ctl_obj.request_len field must be 0. If non-zero then it is a pointer to an array of cdbs (SCSI command descriptor blocks). The number of elements ('n') in this array is ctl_obj.dout_xfer_len divided by the size of a request object (sg_io_v4_sz). The actual length of each cdb in this array is given by the req->request_len field in the corresponding request array element. All actual cdb lengths must be less than or equal to ctl_obj.request_len divided by n.

request_len

0 or length of array of cdbs

if 0 then ctl_obj.request field must be 0. If non-zero then it is the length in bytes of the array of cdbs pointed to by ctl_obj.request

request_extra

mrq pack_id if non-zero

If the user wants the option of being able to use ioctl(SG_IOABORT) on this invocation before it finishes, then they may set this field to a non-zero value. Only one outstanding mrq invocation per file descriptor can have a non-zero mrq pack_id.

dout_xferp

ptr-> request array

request array is provided by the user space and copied into the driver for processing. In the case of ioctl(SG_IORECEIVE) it may be 0. The ioctl(2) fails with E2BIG if the size of the request array exceeds 2 MB

dout_xfer_len

length of request array

length in bytes of array pointed to by ctl_obj.dout_xferp . It must be an integer multiple of the size of a request object (sg_io_v4).

din_xferp

ptr-> space to receive response array

pointer to space that will have the response array written out to it. May be the same value as dout_xferp. In the case of ioctl(SG_IOSUBMIT) when MULTIPLE_REQS and IMMED flags are given, may be zero. Size cannot exceed 2 MB

din_xfer_len

length of response array

length in bytes which must be an integer multiple of the size of a response object which is the same size as the request object.

response

ptr-> space for sense data

this and the next field will be used to "stuff" (overwrite) any element in the request array that has zero in both corresponding fields. It is for SCSI command sense data

max_response_len

18 to 256

this relies on the assumption that it is unlikely that more than one of the multiple requests will yield sense data

flags

MULTIPLE_REQS

plus optionally the IMMED or STOP_IF flags.

dout_resid

<<output>>

number of requests implied by dout_xfer_len less the number of requests submitted. 0 is the expected value. Note: unit is v4 requests, not bytes.

din_resid

<<output>>

number of responses implied by din_xfer_len less the number actually written to din_xferp .

info

<<output>>

if ioctl(SG_IO) or ioctl(SG_IOSUBMIT) then the number of requests submitted is written. For ioctl(SG_IORECEIVE) the number of responses output to din_xferp is written.

spare_out

<<output>>

secondary error code. Usually this will be caused by an error detected in submission in one of the requests (e.g. using the 'ON_OTHER' flag when there is no file share established; this will cause spare_out==ERANGE). If multiple submissions have this type of syntax error, spare_out will be set from the last one.




<<all other input fields>>

0

for example: the ioctl(2) fails with ERANGE if either din_iovec_count or dout_iovec_count is non-zero

<<all other output fields>>

<<output>>

0 written



Note that 'din' and 'dout' maintain their data transfer direction sense which is with respect to the user space. The response array is a request array with the output fields written to it. However with ioctl(SG_IORECEIVE) the request array is not available and its response array has zero-ed 'in' fields. Further, in that case the response array's elements are in completion order which may be different from the request array which dictates the submission order. The size, in bytes, of the version 4 interface object (i.e. in C: sizeof(struct sg_io_v4) ) is shown as sg_io_v4_sz . Notice the control object can optionally provide an array of cdbs; if given the elements in that array of cdbs will override the cdbs pointed to in each request array element.

The benefit of multiple requests is to lessen the number of context switches and bulk up some transfers of meta-information so more information is transferred in fewer transfers. Three use cases were considered:

A table summarizing four different varieties of multiple requests follows with a more in depth explanation after the table:


ordered blocking

variable blocking

submit non-blocking

full non-blocking

ioctl arguments of first call

sg_fd, SG_IO, &ctl_obj

sg_fd, SG_IOSUBMIT, &ctl_obj

sg_fd, SG_IOSUBMIT, &ctl_obj

sg_fd, SG_IOSUBMIT, &ctl_obj

ctl_obj flags (without leading SGV4_FLAG_ )

required: MULTIPLE_REQS

optional: STOP_IF

excluded: IMMED

required: MULTIPLE_REQS

optional: STOP_IF

excluded: IMMED

required: MULTIPLE_REQS, IMMED

optional:

excluded: STOP_IF

required: MULTIPLE_REQS, IMMED

optional:

excluded: STOP_IF

req_arr element flags (without leading SGV4_FLAG_ ); MULTIPLE_REQS excluded on all

optional: SHARE, DO_ON_OTHER, NO_DXFER

optional: SHARE, DO_ON_OTHER, NO_DXFER, SIG_ON_OTHER, COMPLETE_B4

optional: SIG_ON_OTHER

excluded: SHARE, DO_ON_OTHER, COMPLETE_B4

optional: SIG_ON_OTHER

excluded: SHARE, DO_ON_OTHER






ioctl arguments of second call

<<everything completed in first call>>

<<everything completed in first call>>

sg_fd, SG_IORECEIVE, &ctl_obj

sg_fd, SG_IORECEIVE, &ctl_obj

ctl_obj flags



required: MULTIPLE_REQS

optional:

excluded: STOP_IF, IMMED

required: MULTIPLE_REQS, IMMED

optional:

excluded: STOP_IF

req_arr element flags; MULTIPLE_REQS excluded on all



optional: SIG_ON_OTHER

excluded: SHARE, DO_ON_OTHER, COMPLETE_B4

optional: SIG_ON_OTHER

excluded: SHARE, DO_ON_OTHER, COMPLETE_B4



The ordered blocking multiple request method submits every command found in req_arr (read into the driver via ctl_obj.dout_xferp), waiting for each request to complete before moving to the next request in req_arr. It will exit when all the requests have been completed or an error occurs. After (partial) success, the updated req_arr will be written out to ctl_obj.din_xferp. Each completed request will have SG_INFO_MRQ_FINI OR-ed into its req.info field. The updated ctl_obj is written out to the location indicated by the ioctl(SG_IO)'s third argument. The ctl_obj.dout_resid field will contain the number of requests in ctl_obj.dout_xferp less the number successfully submitted; so zero is the expected number. The order that requests appear in req_arr will be the same as the order of the response array written out on completion. The DO_ON_OTHER flag on requests instructs the driver to submit that request on the shared file descriptor rather than the one given in first argument of the ioctl(2). If there is no file descriptor share already established then the ioctl(2) fails with an errno of ERANGE. Most syntax violations in multiple request handling will yield an ERANGE error. The DO_ON_OTHER flag is only permitted with multiple requests, using it on single request methods will cause the ioctl(2) to fail with ERANGE.

The variable blocking multiple request method is similar to ordered blocking but by default requests are submitted without waiting for the previous submission to complete. This can be overridden on a request by request basis with either the SHARE or COMPLETE_B4 flags. With either of these flags given, the current request will complete before the next request (if any) is submitted. After the submission loop, all outstanding completions are fetched before ioctl(SG_IOSUBMIT) returns to the user. The same information is copied back to the user space as outlined in the previous paragraph.

These two blocking multiple request methods both can optionally take the STOP_IF flag on the control object. That will cause a check to be done at completion of each request for driver, transport or device (SCSI) errors or warnings. If any errors or warnings are detected then no more requests will be submitted. Notice that the STOP_IF flag has no effect in variable blocking if there are no SHARE or COMPLETE_B4 flags as all requests have already been submitted before any completions are checked. The action of the STOP_IF flag has been designed this way so as to not orphan requests that are inflight due to an error occurring on some other request.

The submit non-blocking and full non-blocking multiple request methods are the same on the submission side (i.e. the first call). They both call ioctl(SG_IOSUBMIT) with the MULTIPLE_REQS and IMMED flags set on the ctl_obj. All requests are submitted (which should not block, but could run out of resources) after which control is returned to the caller. Notice that many flags are now "excluded" apart from SIG_ON_OTHER "signal on other". Any command in the request array using those excluded flags will cause the ioctl(2) to fail with an errno of ERANGE and no requests will be submitted. File descriptor sharing may be used but this is not request sharing, rather it will allow some of the multiple requests to use the SIG_ON_OTHER flag. When SIG_ON_OTHER is given on a request, then after that request completes, the response array (in its current state) is flushed out (i.e. written to where ctl_obj.din_xferp points), then on the other file descriptor poll(2) will have POLLIN set and a signal will be issued if it is has been set up. The other file descriptor is just a convenient auxiliary that selected requests can trigger poll(2) and/or a signal on. The file descriptor given as the first argument to the ioctl(2) will have POLLIN set and optionally signal traffic for every completed request.

The second half of the submit non-blocking multiple request method is performed by calling ioctl(SG_IORECEIVE) with the MULTIPLE_REQS flag set on the control object. The ctl_obj.din_xferp and ctl_obj.din_xfer_len fields are expected to be non-zero. The ctl_obj.din_xfer_len field divided by sg_io_v4_sz is the number of request completions this ioctl(2) will attempt to yield. As an example: if that division yields 5 and 3 requests are already completed then this ioctl(2) will wait for those other two requests to complete before returning with all 5 responses. And if the number already completed was 6 then the first 5 would be written out to ctl_obj.din_xferp and the ioctl(2) would return without blocking, leaving one completed request for another ioctl(SG_IORECEIVE) invocation to "pick up". If there are no requests waiting (i.e. completed) and no requests are submitted, pending completion, then this ioctl(2) fails with an errno of ENODATA. The response array output to ctl_obj.din_xferp is zero filled with only output fields (including the usr_ptr field) filled for those requests that have completed. See the Processing mrq responses section below for more details.

The second half of the full non-blocking multiple request method is performed by calling ioctl(SG_IORECEIVE) with the MULTIPLE_REQS and IMMED flags set on the control object. The ctl_obj.din_xferp and din_xfer_len fields are expected to be non-zero. The ctl_obj.din_xfer_len field divided by sg_io_v4_sz is the maximum number of request completions this ioctl(2) will yield. As an example: if that number is 5 and 3 requests are already completed then the ioctl(2) will only yield those 3 completed requests and then return to the caller. If there are no requests waiting (i.e. completed) and no requests are submitted, pending completion, then this ioctl(2) fails with an errno of ENODATA. If there are no requests waiting (i.e. completed) and there is one or more submitted requests still inflight then the response array output to ctl_obj.din_xferp will be all zeros.

A secondary error is an error that occurs after zero or more commands from a multiple request array have already been submitted. No further commands are submitted after a secondary error is detected. The secondary error is placed in the ctl_obj.spare_out field and in the response_arr[n].spare_out where n is the index of the request that caused the error. Any commands that are inflight when a secondary error is detected, are completed. Secondary errors are (positive) errno values. The expectation is that if the same request was input individually with ioctl(SG_IOSUBMIT) (i.e. a non-mrq invocation) then that ioctl(2) would fail with the same errno value.

Only the ordered blocking and variable blocking multiple request methods (and not the two non-blocking methods) can additionally use request sharing with the following modification. Since all multiple request methods use a single file descriptor (i.e. the first argument of the ioctl(2) ), then there needs to be another way of indicating a particular request should use the other (i.e. shared) file descriptor. This is done with the DO_ON_OTHER flag. File descriptor sharing can be used with all four multiple request methods either to support request sharing or to nominate another file descriptor to which some POLL_IN and signal indications are sent to, triggered by the SIG_ON_OTHER flag.

With the non-blocking multiple requests methods, rather than use the poll(2) command or signals, ioctl(sg_fd, SG_GET_NUM_WAITING, &an_integer) can be used. It will place the number that are completed but not "picked up" into an_integer with little overhead and it won't block. The user can also find out how many requests are active on the given file descriptor; this includes those requests that are inflight plus those that are waiting to be "picked up". That number can be found with ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_WAITING}). With the non-blocking multiple requests methods there is no ability to fetch the response of a particular request using a pack_id or tag. However with a normal ioctl(SG_IORECEIVE) a request submitted via a multiple request ioctl(SG_IOSUBMIT) can be found by pack_id or tag.

The O_NONBLOCK flag can be set on a sg driver file descriptor with the open(2) or fcntl(2) system calls. [Note that the related O_ASYNC file descriptor flag for enabling signals can only be set with the fcntl(2) system call.] If the O_NONBLOCK flag is set on the sg_fd given as the first argument of ioctl(SG_IOSUBMIT) or ioctl(SG_IORECEIVE) then it has a similar effect to giving the IMMED flag to ctl_obj.flags . If the O_NONBLOCK flag is set on the sg_fd given as the first argument of ioctl(SG_IO) then the O_NONBLOCK flag is ignored and ioctl(SG_IO) is fully blocking as described above.

Typically few SCSI commands yield sense data and when they do, it is not necessarily related directly to the command response that it is attached to. For example after a WRITE command a SSD may decide to yield sense data indicating that it has run out of resources to do further WRITEs and that the SSD will soon become a read-only device! So it is never a good idea to ignore sense data. On the other hand allocating (and freeing) buffers for each command's possible sense data can be burdensome and error prone. To simplify this a little the control objects can be given a sense data pointer and its length in bytes (ctl_obj.response and ctl_obj.max_response_len respectively) and that will be used for any associated command request in the request array that has zero in those two fields. The downside of doing this is that if two or more commands yield sense data, only the last one will be seen.

The following diagram illustrates some transactions in part of the ordered blocking method that is also using request sharing. Notice that the master share file descriptor is the one given to ioctl(SG_IO) and that requests (i.e. WRITE commands) for the slave file descriptor use the DO_ON_OTHER flag.




The sequence points, shown as blue circles in the above diagram, are where the driver notionally changes its attention from one file descriptor to the other, with the prime (i.e. the trailing quote) showing the receiving end of that attention. SQ1 is where the share between the two file descriptors is established and that does not necessarily need to be immediately before the main multiple request ioctl(2). SQ2 on the master is at the completion of the first READ (i.e. 'ind 0') and at this point the driver starts the first WRITE (i.e. 'ind1') on the other file descriptor which is the slave. The performance win here is that there is no return to the user space to check the just completed command and issue the next command. At SQ3 the WRITE has completed and this causes the second READ (i.e. 'ind 2') to start. If the SGV4_FLAG_STOP_IF flag has been OR-ed into the ctl_obj.flag field then at SQ2, SQ3 and SQ4 an additional check is made to see if an error or warning has been issued by the storage device, the transport to it, or the LLD (and its associated HBA); if so ioctl(SG_IO) will exit.

Note that multiple requests are not available using the v3 interface object: neither with ioctl(SG_IO) nor ioctl(SG_IOSUBMIT_V3)+ioctl(SG_IORECEIVE_V3) . ioctl(SG_IO) using the v4 interface can be used for issuing a mrq invocation (i.e. 'ordered blocking' as shown in the table above).

10.1 Processing mrq responses

When a mrq succeeds, or fails (other than an ioctl(2) failure that results in errno being set) then the control object is written back back to the user space to the same location it was read from (i.e. where the third argument of the ioctl(2) points). The only exception is ioctl(SG_IOSUBMIT) when the SGV4_FLAG_IMMED flag is set in which case nothing is written back. In the writeback case, the control object's fields of interest to the user code are dout_resid, din_resid, info and spare_out. All four refer to the array of normal sg_io_v4 objects written back (to the user space) using the ctl_obj.din_xferp pointer. [That pointer may be zero in some cases and nothing is written back but that is not recommended.]

ctl_obj.dout_resid is the number of elements in request array (i.e. ctl_obj.dout_xfer_len / sg_io_v4_sz) less the number actually submitted. Like most "residuals" a value of zero is good as it implies all given elements is the request are have been given. Any non-zero (positive) value suggests something abnormal has happened and that the array of normal sg_io_v4 objects written back (to the user space) using the ctl_obj.din_xferp pointer should be checked.

ctl_obj.din_resid is the number of elements in response array (i.e. ctl_obj.din_xfer_len / sg_io_v4_sz) less the number actually completed. Any non-zero (positive) value suggests something abnormal has happened and that the array of normal sg_io_v4 objects written back (to the user space) using the ctl_obj.din_xferp pointer should be checked. Note that the full size request array is always written back to the user space but normal sg_io_v4 objects beyond that indicated by din_resid will be all zeros.

ctl_obj.info is the actual number of completions in the response array.

ctl_obj.spare_out holds a secondary error which is an errno value, so zero implies no secondary error. When requests are submitted individually, any problem with the syntax of the normal control object is reported back via a failed ioctl(2) and setting errno (e.g. when using ioctl(SG_IO) or ioctl(SG_IOSUBMIT) ). However when invoking multiple requests, that technique is not available because a failed mrq ioctl(2) with an errno value indicates something is wrong with the control object. So when a contained normal sg_io_v4 object within a mrq has a syntax error, then that errno value is placed in ctl_obj.spare_out. Note that there may have been good requests submitted before the problematic normal sg_io_v4 object and their completions should be processed as usual.

Each element in the response array written back to the user space using the ctl_obj.din_xferp pointer is more or less what would be expected if that request had been submitted individually. There is one notable addition with mrqs: the SG_INFO_MRQ_FINI mask is OR-ed into each element's info field when its completion is processed, so the absence of the mask implies that a request has not been completed.

Note that when ioctl(SG_IORECEIVE) is used, the written back response array is not necessarily in the order of the corresponding mrq submission. This reflects that the storage device may process commands out-of-order. A typical example is when a disk has queued READs and WRITEs and a TEST UNIT READY is sent; that TEST UNIT READY will typically be responded to as soon as the storage device receives it. It is left up to the user code to match the response array(s) with their corresponding mrq submission. The usr_ptr, pack_id (request_extra field in the v4 interface) and request_tag are provided to help the user code do that matching. The usr_ptr is a "raw" pointer in the v3 interface and a 64 bit unsigned value in the v4 interface, in both cases it is treated as an opaque value that the sg driver does not use or modify, it just keeps it with a request and sends it back to the user space in the response after its completion. This style of pointer is sometimes called a "closure pointer".

Note that ctl_obj.driver_status, ctl_obj.transport_status and ctl_obj.device_status are not used and are set to zero in a mrq response.

10.2 Aborting multiple requests

The ioctl(SG_IOABORT) can be used to abort all inflight and yet-to-be submitted requests associated with a single mrq invocation. Any mrq invocation that the user may want to abort later must be given a non-zero mrq pack_id (in the request_extra field of the control object). There can only be one of these non-zero mrq pack_ids outstanding at a time per file descriptor.

Usually ioctl(SG_IOABORT) would be issued after a ioctl(SG_IOSUBMIT) call (i.e. async or non-blocking usage). It is possible, using another thread, to abort a ioctl(SG_IO) mrq invocation. In this case the first thread would still be waiting inside the ioctl(SG_IO) mrq invocation when a second thread used the same pack_id (and file descriptor) to call ioctl(SG_IOABORT).

Any requests in a mrq invocation that have already reached their internal completion point when the mrq abort is issued must be processed in the normal fashion. Any inflight requests will have blk_abort_request() called on them. Those remaining requests that have not yet been submitted will be dropped. See the Processing mrq responses section above for how an abort will be reported.

In the ioctl(sg_fd, SG_IOABORT, &ctl_obj) invocation the SGV4_FLAG_MULTIPLE_REQS flag must be set set and the request_extra field must be set to the non-zero mrq pack_id. SG_PACK_ID_WILDCARD can be given for the mrq pack_id. Optionally the SGV4_FLAG_DEV_SCOPE flag may be given. In that case after the current file descriptor (i.e. the one given as the first argument of the ioctl(SG_IOABORT)) is checked for a mrq match on the pack_id, if that fails to find a match, then the open file descriptors belonging to the current fd's sg device (e.g. /dev/sg3) are checked for a match. The abort is sent to the first match, if any, on the pack_id and the ioctl(SG_IOABORT) finishes (i.e. no further checks for match are done).

If a call to ioctl(SG_IOABORT) is successful the array of sg_io_v4 objects pointed to by ctl_obj.din_xferp should be examined carefully. That array will be populated by either ioctl(SG_IORECEIVE), ioctl(SG_IO) or when an element of the array that uses the SGV4_FLAG_SIG_ON_OTHER flag. When that flag is detected in an element of the request array (i.e. in a normal v4 interface object) then after that command is completed, the contents of that array of sg_io_v4 objects (complete to and including that command) is written out to where ctl_obj.din_xferp points. In the case where the mrq is submit or full, non-blocking (i.e. issued with ioctl(SG_IOSUBMIT) with MULTIPLE_REQS and IMMED flags set), then it is the ctl_obj.din_xferp given to ioctl(SG_IOSUBMIT) that is used. This must be the case as the mating ioctl(SG_IORECEIVE) has typically not yet been issued.

10.3 Single/multiple (non-)blocking requests

Almost all interactions between a user space program and the sg driver involve using a sg driver file descriptor. Each sg driver file descriptor belongs to a sg device. [And optionally each file descriptor may be paired (shared) with another sg file descriptor which may belong to the same or a different sg device.] More precisely within the sg driver a file descriptor corresponds to a kernel object of type 'struct file'. Using the terminology found in 'man 2 dup' (i.e. the manpage of the dup system call) that kernel object is an open file description containing a set of flags and a file offset, among other things. In a user space process an open(2) system call returns an integer (zero or greater) which refers to that open file description. That integer is often termed a file descriptor. The dup(2) system call creates a second reference to the same open file description, as does passing a file descriptor to another process using Unix sockets. Since such operations are relatively uncommon, an open file description in this driver and a file descriptor created by using open(2) on a sg device will be regarded as the same thing.

Each sg driver file descriptor has one active request list (and an associated free list). All commands/requests are issued by this driver to lower levels (i.e. levels that are closer to the storage devices) using a non-blocking, asynchronous interface with completion flagged using a software interrupt mechanism. So blocking requests are managed by this driver. Given one file descriptor, any two commands whose execution overlap, then at some point both commands will have entries on that file descriptor's active request list. It is important to match the correct response with each request. If both requests were blocking then this matching is relatively simple since the identity of each request is known and can be searched for on the active request list. If one request was blocking and the other non-blocking then handling the active request list is still relatively simple. Non-blocking request completions are processed in FIFO order (first (completion) in becomes first out (to the user space)). If two or more non-blocking requests are on the same request list then the problem of matching the responses with their corresponding requests is left up to the user space! To aid the user space to do this matching, the pack_id, tag and usr_ptr fields are provided. The driver does do some work in this regard: all blocking requests on an active request list are marked so that they will never be seen by non-blocking mechanisms such as poll(2), ioctl(SG_IORECEIVE), ioctl(SG_GET_NUM_WAITING), or ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_WAITING}).

No other distinction (other than between requests submitted as blocking or non-blocking) is made on a file descriptor's active request list. This means single non-blocking requests and multiple non-blocking requests can be submitted on the same file descriptor and they are all treated the same way on the active queue. Their responses can be fetched (in FIFO order of completion) by any combination of single and multiple request calls, using SIGPOLL (or RT signals), poll(2) and ioctl(SG_GET_NUM_WAITING) to detect completion, and either read(2) or ioctl(SG_IORECEIVE) to fetch the response once a completion has occurred.

When there are no active requests on a sg file descriptor, its associated free list will have at least one entry which will be the inactive reserve request created when that file descriptor was open(2)-ed. There may be other entries on the free list which reflects at some earlier time (in the lifetime of that file descriptor) a newly issued request found that the reserve request was busy, its data buffer was not big enough, or it was unavailable. On the master side of a file descriptor share, the reserve request is only used for requests that have the SGV4_FLAG_SHARE flag set, so the reserve request is unavailable for new requests that don't use the share flag. The number of inactive requests on a file descriptor's free list can be found with ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_FL_RQS}). The total number of inactive requests of the given file descriptor and all file descriptors that have the same owning sg device, can be found with ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_DEV_FL_RQS}).

11 pack_id or tag

When doing asynchronous IO with the sg driver there needs to be a way to wait for a particular response, not just the response that is the oldest. [By oldest is meant the command request in the active queue (a per file descriptor queue) whose callback occurred at the earliest time; this will usually be the first one in the active queue.] A common example would be a multi-thread application where each worker thread shares the same file descriptor and issues one command request and waits for the response to that request before issuing another command request.

Historically the way to do this with the sg driver is with a pack_id (short for packet identifier) which is a 32 bit integer. The pack_id is generated by the user application and passed into the interface structure (and in the v4 interface the pack_id is place in request_extra). The pack_id doesn't have to be unique (per file descriptor) but it is practical that it is unique (and the sg driver does not check its uniqueness). The user application should then call ioctl(SG_SET_FORCE_PACK_ID, 1) which alerts the sg driver to read (from the user space) the pack_id given to ioctl(SG_IORECEIVE) or read(2) and then get the (first) matching request on the active queue or wait for it to arrive. The pack_id value -1 (or 0xffffffff if viewed as an unsigned integer) is used as a wildcard or to report nothing is available, depending on the context. The pack_id method has worked well and generated few error or queries over the years and will continue to be supported in the sg v4 driver.

So what is a tag in this context? It is also a 32 bit integer but instead of being generated by the user application, it is generated by the block system. So instead of being given via the v4 interface structure to SG_IOSUBMIT, it is returned in the interface structure at the completion of ioctl(SG_IOSUBMIT) in the request_tag field (which is a 64 bit integer). Notice that the tag is only available in the v4 interface structure and via the two new async ioctls: SG_IOSUBMIT and SG_IORECEIVE. Using the tag to find a command response is very similar to the way it is done with pack_id described above. As currently implemented the tag logic does not work all the time, its reliability will most likely depend on the SCSI host (HBA driver (LLD)) that the target device belongs to. There seems no reliable way for this driver to fetch the tag from the block infrastructure. Currently this driver simply asks for it after forwarding the command request to the block code. However 3 cases have then been observed: it gets a tag; it doesn't get the tag (it is too early); it doesn't get the tag (it is too late), the request has already finished! The third case may only occur with the scsi_debug driver which can complete requests in a microsecond or less (that is configurable). The tag wildcard is also -1 (or all "f"s in hex when viewed as an unsigned integer) so again the logic is very similar to pack_id.

So given the above, the default remains what it was in v3 of the sg driver, namely, using pack_id unless another indication is given. To use tags to choose a response ioctl(SG_SET_FORCE_PACK_ID, &one_in_int) is needed first on the file descriptor. Then the v4 interface object given to ioctl(SG_IOSUBMIT) should OR SGV4_FLAG_YIELD_TAG with other flags in that interface object. Then after that ioctl has finished successfully, the request_tag field in that object should be set. If it is -1 then no tag was found (as discussed in the previous paragraph). The match ioctl(SG_IORECEIVE) call should make sure the request_tag field is set as appropriate and the SGV4_FLAG_FIND_BY_TAG flag should be OR-ed with other flags.

12 Bi-directional command support

N.B. Support for SCSI bidirectional commands has been removed from the Linux kernel in version 5.1 . To allow the driver to merge post lk 5.1, bidi support has been removed from this driver. That bidi support is available as a separate patch if the driver is used with kernel's prior to bidi support being removed.

One of the main reasons for designing the sg V4 interface was to handle SCSI (or other storage protocols) bi-directional commands (abbreviated here to bidi). In the SCSI command sets, bidi commands are mainly found in block commands that support RAID (e.g. XDWRITEREAD(10)) and many of the Object Storage Device (OSD) commands. Linux contains a "osd" upper level driver (ULD) and an object based file system called exofs. New SCSI commands are being considered such as READ GATHERED which would most likely be a bidi command. The NVMe command set (NVM) extends the bidi commands concept to "quad-di": data-in and data-out plus metadata-in and metadata-out.

Synchronous SCSI bidi commands have been available in the bsg driver for more than 12 years using ioctl(<bsg_dev_fd>, SG_IO) using the sg V4 interface (i.e. struct sg_io_v4) and are now available with the sg V4 driver where <bsg_dev_fd> is replaced by <sg_dev_fd>. Asynchronous SCSI bidi commands were available for the same period but were withdrawn around Linux kernel 4.15 due to problems with the bsg driver. Those asynchronous commands were submitted via the Unix write(2) call and the response was received using a Unix read(2) call. In the sg v4 driver the submitted and received object structure remains the same but the Unix write(2) and read(2) system calls can no longer be used. Instead two new ioctl(2)s have been introduced called SG_IOSUBMIT and SG_IORECEIVE to replace write(2) and read(2) respectively. The functionality is almost identical, read on for details.

In the sg driver the direct IO flag has the effect of letting the block layer manage the data buffers associated with a command. The effect of indirect IO in the sg driver is to let the sg driver manage the data buffers. Indirect IO is the default for the sg driver with the other options being mmap IO (memory mapped IO) and direct IO. Indirect IO is the most flexible with the sg driver, it can be used by both uni-directional and bidi commands and has no alignment requirements on the user space buffers. Request sharing discussed above cannot be used with direct IO (because the sg driver needs control of the data buffers to implement the share) while mmap IO is not implemented for bidi commands. Also a user space scatter gather list cannot be used for either the data-out or data-in transfers associated with a bidi command.

Other than the exclusions in the previous paragraph, all other capabilities of the sg driver are available to bidi commands. The completion is sent when the second transfer (usually a data-in transfer) has completed. pack_id and/or tags can be used as discussed in the previous section. Signal on completion, polling for completion and multi-threading should also work on bidi commands without issues.

13 SG interface support changes

In the following table, a comparison is made between the supported interfaces of the sg driver found in lk 4.20 (V3.5.36) and the proposed V4 sg driver. The movement of the main header file from the include/scsi directory to include/uapi/scsi/sg should not impact user space programs since modern Linux distributions should check both and the stub header now in include/scsi/sg.h includes the other one. There is a chance the GNU libc maintainers don't pick up this change/addition, but if so the author would expect that to be a transient problem. The sg3_utils/testing directory in the sg3_utils package gets around this problem with a local copy of the "real" new sg header in a file named uapi_sg.h .


 Table 1. sg interfaces supported by various sg drivers

interface support/



sg driver version

v1+v2 interfaces

Non-blocking

struct sg_header

v3 interface

Non-blocking

struct sg_io_hdr

v3 interface

Blocking

struct sg_io_hdr

v4 interface

Non-blocking

struct sg_io_v4 (bsg.h)

v4 interface

Blocking

struct sg_io_v4 (bsg.h)

sg driver V3.5.36

lk 2.6, 3, 4 and 5.0

interface header ==>

write(2)+read(2)

include/scsi/sg.h

write(2)+read(2)

include/scsi/sg.h

ioctl(SG_IO)

include/scsi/sg.h

not available ^^^

not available ***

sg driver V4.0.x

lk ?

interface header ==>

write(2)+read(2) ****



include/uapi/scsi/sg.h

ioctl(SG_IOSUBMIT_V3)+

ioctl(SG_IORECEIVE_V3) or

write(2)+read(2)



include/uapi/scsi/sg.h

ioctl(SG_IO)



include/uapi/scsi/sg.h

ioctl(SG_IOSUBMIT)+

ioctl(SG_IORECEIVE)



include/uapi/scsi/sg.h +

include/uapi/linux/bsg.h

ioctl(SG_IO)


include/uapi/scsi/sg.h +

include/uapi/linux/bsg.h


*** available via the bsg driver; ^^^ removed from the bsg driver in lk 4.15; **** the plan is to deprecate the write(2)/read(2) based interfaces which would leave v1+v2 interfaces unsupported.

Note that there is no v1+v2 blocking interface. Rather than completely drop the write(2)+read(2) interface, it could be kept alive for only v1+v2 interfaces. Applications based on the v1+v2 interfaces would be written around 20 years ago and need a low level re-write to use the v3 or v4 non-blocking interfaces. So what might be dropped is the ability of the v3 interface to use the write(2)+read(2) interface as the only code change required should be to change the write(2) to an ioctl(SG_IOSUBMIT_V3) and the read(2) to an ioctl(SG_IORECEIVE_V3).

14 IOCTLs

Over time there has been a transfer of functionality from the write(2) and read(2) system calls to various ioctl(2)s which are listed below. Using the write(2) and read(2) system calls in the way that this driver does is frowned upon by the Linux kernel architects, as is adding new ioctl(2)s! Only 6 new ioctl(2)s have been added in the sg v4 driver as noted in the status column of the table below. Two of those ioctl(2)s were proposed in this post by a Linux architect (L. Torvalds). And two more (i.e. SG_IO_SUBMIT_V3 and SG_IORECEIVE_V3) are very closely related to those proposed. Still, there is a lot of extra information exchanged between the user space and the driver to support the new functionality added in v4 of this driver. That is nearly all done via one new omnibus ioctl(2): SG_SET_GET_EXTENDED using a 96 byte structure; see the second , third and fourth tables below.

The following table lists the ioctl(2)s that the sg v4 driver processes. They are in the alphabetical order of the name of the second ioctl(2) argument. In most cases the scope of the action of the ioctl(2) is that of the file descriptor, given as the first argument and referred to below as the current file descriptor. If the scope is other than the current file descriptor, that is noted in the second column. Note that there is a "fall-through" in the last row of this table, so any ioctl(2)s not processed by this driver will be passed to the SCSI mid-level and if it doesn't process them and thence onto the LLD (SCSI low level driver) that owns the "host" that the file descriptor's device is connected to. If no driver processes an ioctl(2) then it should return -1 with an errno of ENOTTY (according to POSIX) but the sometimes other error codes given, depending on the LLD.

ioctl name [hex value]

second argument to ioctl(2) call

Status

output via 3rd arg ptr unless noted

Notes

BLKSECTGET [0x1267]

active

scope: host (HBA)

this ioctl value replicates what a block layer device file (e.g. /dev/sda) will do with the same value. It calls the queue_max_sectors() helper on the owning device's command queue. The resulting number is multiplied by 512 to get count in bytes and output where the third argument points, assumed to be a pointer to int (so a maximum of about 2 GB). It represent the maximum data size of a single request that the block layer will accept.

BLKTRACESETUP [0xc0481273]

active

scope: device

third argument of ioctl(2) is pointer to a struct blk_user_trace_setup object. Needs a kernel with CONFIG_BLK_DEV_IO_TRACE=y . This ioctl(2) and its siblings are passed through to the block layer which implements them: a pass-through inside a pass-through

BLKTRACESTART [0x1274]

active

scope: device

ignores third argument of ioctl(2). See blktrace and blkparse utilities in the blktrace package.

BLKTRACESTOP [0x1275]

active

scope: device

ignores third argument of ioctl(2). Part of blktrace support.

BLKTRACETEARDOWN [0x1276]

active

scope: device

ignores third argument of ioctl(2). Part of blktrace support.

SCSI_IOCTL_GET_BUS_NUMBER

[0x5386]

active, deprecated

scope: host

implemented by the SCSI mid-level. Assumes the third argument is pointer to int (32 bit) and places a field called 'host_no' in it. host_no is an index of SCSI HBAs (host bus adapters) in the system. In this case it will the host number that the SCSI device is connected to. That SCSI device has been open(2)-ed to yield the file descriptor that this ioctl(2) uses. In modern Linux usage, this information is better obtained from sysfs. Alternatively ioctl(SG_GET_SCSI_ID) can be used (see below).

SCSI_IOCTL_GET_IDLUN [0x5382]

active, deprecated

scope: device

implemented by the SCSI mid-level. Assumes the third argument is pointer to int (32 bit) and places a packed integer (with 4 components) in it. The lower 8 bits are a target device number, the next 8 bits are the LUN, the next 8 bits are the channel number, and the top 8 bits are the host_no mentioned in the previous item. The are many things wrong with this from a modern SCSI perspective. In modern Linux usage, this information is better obtained from sysfs.

SCSI_IOCTL_PROBE_HOST [0x5385]

active, deprecated

scope: host

implemented by the SCSI mid-level. Yields an identifying string associated with the host. Assumes the third argument is a pointer to a byte array whose length in placed in a (32 bit) int in the first 4 bytes. That length will be overwritten by the ASCII byte array output. This information can also be obtained from sysfs.

SCSI_IOCTL_SEND_COMMAND [0x1]

active, deprecated

this is the SCSI mid-level pass-through which is very old, found in lk 1.0 with sg v1 interface vintage and even worse. Please do not use.

SG_EMULATED_HOST [0x2203]

seems to be "dead"

originally indicated a host emulated SCSI (e.g. ATAPI) but libata does not seem to set this value in the host template provided by each LLD.

SG_GET_ACCESS_COUNT [0x2289]

not supported

returns 1 [unless the owning sg device is missing in which case 0 is returned, very unlikely]

SG_GET_COMMAND_Q [0x2270]

active

see SG_SET_COMMAND_Q notes below. Yields current state of the COMMAND_Q flag held by the this file descriptor.

SG_GET_KEEP_ORPHAN [0x2288]

active

when a synchronous ioctl(SG_IO) is interrupted (e.g. by a signal from another process) the default action (depending on the signal) may be to terminated the ioctl(2) with an errno of EINTR. The driver terms such an inflight command/request an "orphan". The default action is to "throw away" the response from the device and clean up the request's resources. This loses information such as whether the command succeeded. This ioctl return 0 (the default) or 1 depending on whether the request belonging to this file descriptor will throw away (when 0) or keep (when 1) the response to interrupted requests. Note that closing a sg file descriptor will clean-up any outstanding request resources this file descriptor is using at the time of the close(2) [in reality that takes place a little later (when the last response "lands") because nothing is permitted to suspend a close(2)].

SG_GET_LOW_DMA [0x227a]

active, deprecated

scope: host

Yields the host's unchecked_isa_dma flag (0 or 1) via the third argument. The 'host' is typically the host bus adapter (HBA) that this sg device (the parent of the current file descriptor) is connected to.

SG_GET_NUM_WAITING [0x227d]

active

Number of non-blocking requests on the active list that are waiting to be read. That "read" can be done with either an ioctl(SG_IORECEIVE) or a read(2) system call. Requests that are inflight are not counted. If there are any blocking requests waiting on the list, they are not counted. Similar to ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_SUBMITTED}) which additionally counts (non-blocking) inflight request. When using non-blocking multiple requests this will be the expected number of responses that ioctl(SG_IORECEIVE, FLAG_MULTIPLE_REQS | FLAG_IMMED) will receive. This ioctl(2) holds no locks in the sg driver and accesses an atomic integer. So it is fast and should never block making it suitable for polling. In the presence of other producers or consumers the number waiting may change before a user has time to act on the result of this call.

SG_GET_PACK_ID [0x227c]

active

the third argument is expected to be a pointer to int. By default it will set that int to the pack_id of the first (oldest) command that is completed internally but still awaits ioctl(SG_IORECEIVE) or read(2) to finish. If no requests are waiting -1 (i.e. the wildcard value) is place din that int. This ioctl(2) yields the pack_id by default, unless the SG_CTL_FLAGM_TAG_FOR_PACK_ID boolean has been set on this file descriptor.

SG_GET_REQUEST_TABLE [0x2286]

active

The third argument is assumed to point to an array of 16 struct sg_req_info objects (that struct is defined in include/uapi/scsi/sg.h). First the array is zeroed making all req_state fields zero which corresponds to INACTIVE state. Then any requests that are active have fields placed in the sg_req_info elements. Then if there is still room requests from the free list are placed in sg_req_info elements. This action stops when either 16 elements are filled or there are no more requests associated with the current file descriptor to transfer.

SG_GET_RESERVED_SIZE [0x2272]

active

this is the size, in bytes, that the reserve request associated with this file descriptor currently has. The third argument is assumed to be a pointer to an int that receives this value.

SG_GET_SCSI_ID [0x2276]

active, enhanced in v4

the third argument should by a pointer to an object of type struct sg_scsi_id . This ioctl(2) fills the fields in that structure. The extension in v4 is to use two 'unused' 32 bit integers at the end of that struct as an array of 8 bytes to which the SCSI LUN is written. This is the preferred LUN format from t10.org . This extension does not change the size of struct sg_scsi_id . For those looking for the corresponding HCTL tuple for the device this file descriptor belongs to ,this ioctl(2) is one way: H --> sg_scsi_id::host_no; C --> sg_scsi_id::channel, T --> sg_scsi_id::scsi_id and L --> sg_scsi_id::scsi_lun[8] . Another way is to use 'lsscsi -g' which data-mines in sysfs or the user can write their own sysfs data-mining code.

SG_GET_SG_TABLESIZE [0x227F]

active

yields the maximum number of scatter gather elements that the associated host (HBA) supports. That is the host through which the sg device is attached, that "owns" the given file descriptor. The third argument is assumed to point to an int.

SG_GET_TIMEOUT [0x2201]

active, deprecated; timeout

in seconds is return value

the v1 and v2 interfaces did not contain a command timeout field so this was a substitute. Both the v3 and v4 interface have a command timeout field which is better than using this ioctl

SG_GET_TRANSFORM [0x2205]

seems to be "dead"

this driver passes this ioctl value through to the SCSI mid-level which seems to do nothing with it. Testing reveals that it yields an errno of EINVAL

SG_GET_VERSION_NUM [0x2282]

active

uses the third argument as a pointer to write out a 32 bit integer whose latter digits went seen in decimal are in the form [x]xyyzz . [x] means blank (space) if zero. This is usually expressed as an ASCII string as '[x]x.[y]y.zz' .

SG_IO [0x2285]

active, added functionality

in v4 driver

both v3 and v4 interface blocking commands can be issued with this ioctl(2). Only returns -1 and sets errno when the preparation for submitting the command/request encounters an problem. Thereafter any problems encountered set the out fields in the v3 or v4 interface object. So both should be checked.

SG_IOABORT [0x40a02243]


new in v4 driver

only the v4 interface can use this ioctl(2) to abort a command in process, using either the pack_id (in the request_extra field) or the tag. The current file descriptor is checked first and if no match is found then all of the open file descriptors belonging to the current device are checked. The pack_id is used by default, unless the SG_CTL_FLAGM_TAG_FOR_PACK_ID boolean has been set on this file descriptor. If no corresponding request is found (capable of being aborted) then errno is set to ENODATA. The completion on an aborted command will have DRIVER_SOFT set in the driver_status field.

SG_IORECEIVE [0xc0a02242]


new in v4 driver

only the v4 interface can use this ioctl(2) to complete a command/request started with asynchronous ioctl(SG_IOSUBMIT_V3) on the same file descriptor. If multiple requests are outstanding on the same file descriptor, then setting ioctl(SG_SET_FORCE_PACK_ID) indicates that subsequent requests on this file descriptor should take account of the pack_id (in the ::request_extra field) or the tag (in the ::request_tag) field to choose a matching response.

SG_IORECEIVE_V3 [0xc0582246]

new in v4 driver

only the v3 interface can use this ioctl(2) to complete a command/request started with asynchronous ioctl(SG_IOSUBMIT_V3) on the same file descriptor. If multiple requests are outstanding on the same file descriptor, then setting ioctl(SG_SET_FORCE_PACK_ID) indicates that subsequent requests on this file descriptor should take account of the pack_id field to choose a matching response.

SG_IOSUBMIT [0xc0a02241]


new in v4 driver

only the v4 interface can use this ioctl(2) to issue (submit) new commands. This ioctl(2) will return relatively quickly potentially well before the command has completed. Each call to ioctl(SG_IOSUBMIT) needs to be paired with a call to ioctl(SG_IORECEIVE) using the same (sg) file descriptor. This call is part of the v4 asynchronous (non-blocking) interface.

SG_IOSUBMIT_V3 [0xc0582245]

new in v4 driver

only the v3 interface can use this ioctl(2) to issue (submit) new commands. This ioctl(2) will return relatively quickly potentially well before the command has completed. Each call to ioctl(SG_IOSUBMIT_V3) needs to be paired with a call to ioctl(SG_IORECEIVE_V3) using the same (sg) file descriptor. This call is part of the v3 asynchronous (non-blocking) interface.

SG_NEXT_CMD_LEN [0x2283]

active, deprecated

only applies to the v2 interface which does not include a command (cdb) length field. That assumes the driver can work out what the cdb length. While that works for standard cdbs (from T10) it may not work for vendor specific commands, hence this ioctl(2).

SG_SET_COMMAND_Q [0x2271]

active

in the v1 and v2 drivers the default was 0 (so no command queuing on this file descriptor). In the v3 driver it was 0 until a v3 interface structure was presented, in which case it was turned on (1) for this file descriptor. In the v4 driver it is on (1) by default. 0 --> only allow one command per fd; 1 --> allow command queuing. When command queuing is off, if a second command is presented before the previous has finished and errno of EDOM will result.

SG_SET_DEBUG [0x227e]

active, scope=device

0 --> turn off (def), 1 --> turn on . Currently the only impact of setting this is to print out sense data (to the log) of any request on all fds that belong to the current device. Typically only requests that yield a SCSI status of "Check condition" provide sense data.

SG_SET_FORCE_LOW_DMA [0x2279]

does nothing

users of modern Linux systems should not concern themselves with "low DMA", this comes from the ISA era. 0 --> use adapter setting (def); 1 --> force "low dma". However this ioctl(2) has since been neutered and does nothing.

SG_SET_FORCE_PACK_ID [0x227b]

active

when activated, a non-blocking response is only accepted if it has a matching pack_id (or tag). A pack_id (or tag) of -1 is treated as a wildcard. In the v4 interface the request_extra field is used for the pack_id. A non-blocking request is finished with either ioctl(SG_IORECEIVE[_V3]) or read(2). The third argument to this ioctl(2) is assumed to be a pointer to a 32 bit integer. 0 --> take the oldest available response (def); 1 --> match on pack_id (or tag) given in each subsequent request on this fd. Even though the third argument is a pointer to int, this ioctl(2) is effectively boolean. The default is to use the pack_id rather than the tag unless SG_SET_GET_EXTENDED{SG_CTL_FLAGM_TAG_FOR_PACK_ID} is active on this file descriptor.

SG_SET_GET_EXTENDED [0xc0602251]

new in v4 driver

takes pointer to 96 byte sg_extended_info structure; it can set and get 32 bit values and it can set and get boolean values. Each ioctl(2) can perform more than one action. Explained in the following tables and associated descriptions.

SG_SET_KEEP_ORPHAN [0x2287]

active

how to treat a SCSI response when a ioctl(SG_IO), read(2) or ioctl(SG_IORECEIVE) that is waiting is interrupted. 0 --> drop it (def); 1 -> hold it so the response can be fetched with either another read(2) or ioctl(SG_IORECEIVE) call

SG_SET_RESERVED_SIZE [0x2275]

active

sets or resets the size of the reserve request data buffer size of this file descriptor to the given value (in bytes). If this file descriptor is in use (i.e. sending a SCSI command) then this ioctl(2) will fail with an errno of EBUSY.

SG_SET_TIMEOUT [0x2201]

active, deprecated

command timeout in seconds (pointed to by third argument). See "_GET_" notes.

SG_SET_TRANSFORM [0x2204]

seems to be "dead"

this driver passes this ioctl value through to the SCSI mid-level which seems to do nothing with it. Testing reveals that it yields an errno of EINVAL

<< any others>>

??

sent through to the SCSI mid-level (and then to the LLD associated with the device the fd belongs to) for further processing.



The third argument to ioctl(SG_SET_GET_EXTENDED) is a pointer to an object of type struct sg_extended_info . That structure is found in the <scsi/sg.h> header and is shown here:

struct sg_extended_info {
        uint32_t sei_wr_mask;       /* OR-ed SG_SEIM_* user->driver values */
        uint32_t sei_rd_mask;       /* OR-ed SG_SEIM_* driver->user values */
        uint32_t ctl_flags_wr_mask; /* OR-ed SG_CTL_FLAGM_* values */
        uint32_t ctl_flags_rd_mask; /* OR-ed SG_CTL_FLAGM_* values */
        uint32_t ctl_flags;         /* bit values OR-ed, see SG_CTL_FLAGM_* */
        uint32_t read_value;        /* write SG_SEIRV_*, read back related */
        uint32_t reserved_sz;       /* data/sgl size of pre-allocated request */
        uint32_t tot_fd_thresh;     /* total data/sgat for this fd, 0: no limit */
        uint32_t minor_index;       /* rd: kernel's sg device minor number */
        uint32_t share_fd;          /* SHARE_FD and CHG_SHARE_FD use this */
        uint32_t sgat_elem_sz;      /* sgat element size (must be power of 2) */
        uint32_t pad_to_96[52];     /* pad so struct is 96 bytes long */
};

If both the sei_wr_mask and sei_rd_mask fields are zero then ioctl(SG_SET_GET_EXTENDED) does nothing. If those fields are non-zero then they should contain one or more of the following mask values OR-ed together. The field names of struct sei_extended_info are shown in italics:



SG_SET_GET_EXTENDED

sei_wr_mask and sei_rd_mask values

Associated field(s)


Notes [fd: file descriptor given as 1st arg to ioctl(2)]

SG_SEIM_CHG_SHARE_FD [0x40]

share_fd 'read before write' [rbw]

when written, this is only valid if fd is the master side of a share. If so share_fd replaces the prior slave fd (which is the value read back) so that share_fd becomes the new slave side of a fd share.

SG_SEIM_CTL_FLAGS [0x1]

ctl_flags, ctl_flags_wr_mask and

ctl_flags_rd_mask

three fields in a sg_extended_info object are associated with this variant of the ioctl(2), a value mask, a write mask and a read mask. The mask value are the SG_CTL_FLAGM_* values shown in a following table.

SG_SEIM_MINOR_INDEX [0x10]

minor_index 'read only' [ro]

when read places the minor number of the sg device that this fd is associated with in minor_index . For example after open(2)-ing "/dev/sg3" that fd should place 3 in the minor_index field.

SG_SEIM_READ_VAL [0x2]

read_value 'read after write' [raw]

when a known value (see SG_SEIRV_* entries in table below) is written to read_value then after this ioctl(2) the corresponding value will be in the read_value field. For this action, SG_SEIM_READ_VAL should be OR-ed into both sei_wr_mask and sei_rd_mask fields.

SG_SEIM_RESERVED_SIZE [0x4]

reserved_sz [raw]

when written, this fd's reserve request's data buffer will be resized to reserved_sz bytes.The given value may be trimmed down by system limits. When read, the actual size of this fd's (resized) data buffer will be placed in reserved_sz when this ioctl(2) completes. So when both written and read, this ioctl(2) is very similar to ioctl(SG_SET_RESERVED_SIZE) combined with ioctl(SG_GET_RESERVED_SIZE) .

SG_SEIM_SGAT_ELEM_SZ [0x80]

sgat_elem_sz [rbw]

when the driver builds a scatter gather list for a request's data buffer a fixed element size is used which is a power of 2 and greater than or equal to the machine's page size (often 4 KB). The default size is currently 32 KB (2**15). When written, sgat_elem_sz will replace the prior element size. When read the prior element size is placed in sgat_elem_sz . Effects future requests on this fd that use data-in or data-out.

SG_SEIM_SHARE_FD [0x20]

share_fd [rbw]

when written, a shared fd relationship is set up by this ioctl(2). The fd that is the first argument of the ioctl(2) should be the future slave (i.e. the WRITE side of a copy) and share_fd identifies the future master. Neither fd can already be part of a share. When read (read before write), if successful share_fd should yield 0xffffffff which indicates (internally) both fds were not previously part of a share.

When read, but not written, then share_fd will yield: 0xffffffff (-1) if the first argument is not part of a share; 0xfffffffe (-2) if the first argument is the master side of a share; or the master's fd if the first argument is the slave side of a share.

SG_SEIM_TOT_FD_THRESH [0x8]

tot_fd_thresh [raw]

By default, a limit of all data buffers that can be active on a fd is set at 16 MB. A request that tries to exceed this will be rejected with an errno of E2BIG. The default can be changed by writing to tot_fd_thresh . A value of 0 is taken as unlimited.


An example follows of changing the scatter gather list element size to 64 KB and reading prior value:

        sei.sei_wr_mask |= SG_SEIM_SGAT_ELEM_SZ;
        sei.sei_rd_mask |= SG_SEIM_SGAT_ELEM_SZ;
        sei.sgat_elem_sz = 64 * 1024 * 1024;
        if (ioctl(sg_fd, SG_SET_GET_EXTENDED, &sei) < 0) {
                err = errno;
                goto error_processing;
        }
        prev_sgat_elem_sz = sei.sgat_elem_sz;
        /* success */

The ctl_flags field can be viewed as 32 boolean (i.e. 1 bit) fields. If both the ctl_flags_wr_mask and ctl_flags_rd_mask fields are zero then ioctl(SG_SET_GET_EXTENDED) does nothing with the ctl_flags field. All three fields should contain one or more of the following mask values OR-ed together:

SG_SET_GET_EXTENDED

ctl_flags, ctl_flags_wr_mask + ctl_flags_rd_mask values

Type


Notes

SG_CTL_FLAGM_IS_MASTER [0x40]

read-only [ro]

when read, if set implies this fd is part of a file share and this is the master side

SG_CTL_FLAGM_IS_SHARE [0x20]

read-only [ro]

when read, if set implies this fd is part of a file share.

SG_CTL_FLAGM_MASTER_ERR [0x200]

[ro]

when read, if set implies the master's request has completed with a non-zero SCSI status or other driver error. In this case the shared request state is terminated (i.e. the slave side will not be able an associated slave request). This may be used either on the master's or slave's fd

SG_CTL_FLAGM_MASTER_FINI [0x100]

[ro]

when read, if set implies the master's request has completed and is waiting for the slave request to start. This may be used either on the master's or slave's fd

SG_CTL_FLAGM_MORE_ASYNC [0x400]

[rbw]

The blk_get_request() can still block in standard async mode. When this is written to 1 (true) that call is made non-blocking and SG_IOSUBMIT will yield EBUSY if the case when it would block.

SG_CTL_FLAGM_NO_DURATION

[0x400]

[rbw]

when written to 1 (true) instructs driver not to calculate command duration. This saves two ktime_get_bootime() calls per command. The default (and when 0 is written) is to always calculate command/request duration.

SG_CTL_FLAGM_ORPHANS [0x8]

[ro]

when read, if set implies there is one or more orphaned commands/request associated with this fd.

SG_CTL_FLAGM_OTHER_OPENS [0x4]

[ro]

when read, if set implies there are other sg driver open(2)s active of this sg device.

SG_CTL_FLAGM_Q_TAIL [0x10]

read-after-write [raw]

when written, set causes the following commands/requests on this fd to be queued to the block layer at the tail of its queue; clear causes them to be queued at head (the default). Each v3 and v4 command can use the SG_FLAG_Q_AT_TAIL or SG_FLAG_Q_AT_HEAD OR-ed into the flags field to override this setting.

SG_CTL_FLAGM_TAG_FOR_PACK_ID [0x2]

[raw]

when written, set causes the following commands/requests on this fd to use the tag field rather than pack_id (or sg_io_v4::request_extra)

SG_CTL_FLAGM_TIME_IN_NS [0x1]

[raw]

when written, set causes the following commands/requests to this fd to use command/request duration calculations to be done in nanoseconds; clear causes duration calculations to be done in milliseconds which is the default.

SG_CTL_FLAGM_UNSHARE [0x80]

[w, rd-->0]

this will undo the share relationship between a master fd and a slave fd. It can be sent to either fd. If a shared command/request is active using either fd then this ioctl(2) will fail with an errno of EBUSY. If no share relationship exists for the given fd this ioctl(2) will return 0 and do nothing.



For example to set command duration time to nanoseconds, the following snippet of code could be used. It is assumed that sei is an object of type struct sg_extended_info and that it has been zeroed out:

        sei.sei_wr_mask |= SG_SEIM_CTL_FLAGS;
        sei.ctl_flags_wr_mask |= SG_CTL_FLAGM_TIME_IN_NS;
        sei.ctl_flags |= SG_CTL_FLAGM_TIME_IN_NS;
        if (ioctl(sg_fd, SG_SET_GET_EXTENDED, &sei) < 0) {
                err = errno;
                goto error_processing;
        }
        /* success */

Finally it was noticed that there are many more "interesting" values to read from the driver (e.g. about its state) than values to write to the driver. So rather than potentially fill struct sg_extended_info with 32 bit values that are only read, the read_value field was introduced. One of the following constants is written to the read_value field, then the associated value can be read from the same field when the ioctl(2) finishes successfully.

SG_SET_GET_EXTENDED

value written to read_value

scope

Notes

SG_SEIRV_BOOL_MASK [0x1]

fd

with read_value set to SG_SEIRV_BOOL_MASK, after ioctl(SG_SET_GET_EXTENDED{SG_SEIM_READ_VAL) read_value has a 32 bit mask of bit positions that are used in ctl_flags (and ctl_flags_wr_mask and ctl_flags_rd_mask). That value is currently 0xfff .

SG_SEIRV_DEV_FL_RQS [0x4]

SCSI_device

sum of number of free list requests on each fd belonging to the SCSI device (e.g. a SSD) that owns the given fd.

SG_SEIRV_DEV_SUBMITTED [0x6]

SCSI_device

sum of number of active list elements, excluding those associated with synchronous (blocking) invocation, on each fd belonging to the SCSI device that owns fd given as the first argument to the ioctl(2)

SG_SEIRV_FL_RQS [0x3]

fd

number of "inactive" request objects currently on this fd's free list. When there are no active command/requests, this value should be 1 and that entry should be this fd's reserved request (waiting for a user request to commence).

SG_SEIRV_INT_MASK [0x0]

fd

after ioctl(2) read_value has a 32 bit mask of bit positions that are used in sei_wr_mask and sei_rd_mask . That value is currently 0xff .

SG_SEIRV_SUBMITTED [0x5]

fd

after ioctl(2), read_value has a 32 bit integer which is the number requests on the active list; this includes all submitted non-blocking requests that have not yet been completed and read (and hence placed on the free list). So this includes requests that are inflight. ioctl(SG_GET_NUM_WAITING) is similar but it does not include inflight requests. This ioctl(2) holds no locks in the sg driver and accesses an atomic integer. So it is fast and should never block making it suitable for polling. In the presence of other producers or consumers the number submitted may change before a user has time to act on the result of this call.

SG_SEIRV_VERS_NUM [0x2]

driver

after ioctl(2) read_value has a 32 bit integer whose latter digits went seen in decimal are in the form [x]xyyzz . [x] means blank (space) if zero. This is usually expressed as an ASCII string as '[x]x.[y]y.zz' .



For example, to find out the number of commands/requests submitted (but not yet finished) on the device (e.g. /dev/sg3) associated with file descriptor sg_fd:

        sei.sei_wr_mask |= SG_SEIM_READ_VAL;
        sei.sei_rd_mask |= SG_SEIM_READ_VAL;
        sei.read_value = SG_SEIRV_DEV_SUBMITTED;
        if (ioctl(sg_fd, SG_SET_GET_EXTENDED, &sei) < 0) {
                err = errno;
                goto error_processing;
        }
        tot_num_submitted = sei.read_value;

Note that this number only counts non-blocking requests submitted through the sg driver. If, for example, /dev/sdc and /dev/sg3 were the same device then it doesn't count any requests that might be submitted by the sd driver through /dev/sdc .

15 Downloads and testing

This tarball: sgv4_20190629 contains one directory: lk5.2 . This directory contains a patchset with 32 patches. Patches 0001 through 0018 were posted to the linux-scsi list on 20190616; its cover letter was titled "[PATCH 00/18] sg: add v4 interface". The remaining patches (i.e. 0019 through 0032) are the extension alluded to in that post. That directory also contains the 3 files that represent the sg v4 driver in the kernel: drivers/scsi/sg.c and include/scsi/sg.h and include/uapi/scsi/sg.h . The last file is new (i.e. it is not found in the production (v3) sg driver). If those 3 files are copied into the corresponding locations in a kernel source tree then a subsequent kernel build will generate the sg v4 driver. It might be a good idea to take a copy of driver/scsi/sg.c and include/scsi/sg.h before copying those files to simplify reverting to the sg v3 driver currently in the kernel.

The patches are against Martin Petersen's 5.3/scsi_queue branch (the part under lk5.2). If these patches are applied to earlier kernels, these recent changes to the sg driver are listed here:

203cd55914857 (Christoph Hellwig 2019-05-01) a small patch to add a SPDX-License-Identifier; appeared in lk 5.2-rc1

96d4f267e40f9 (Linus Torvalds 2019-01-03 18:57:57 -0800) access_ok() [3 -->2 function arguments] appeared in lk 5.0-rc1

92bc5a24844ad (Jens Axboe 2018-10-24 13:52:28 -0600) remove double underscore version of blk_put_request(), appeared in lk 5.0-rc1

abaf75dd610cc (Jens Axboe 2018-10-16 08:38:47 -0600) blk_put_request(srp->rq) addition, first appeared in lk 4.20-rc1

The sg driver patch prior to that was 8e4a4189ce02f (Tony Battersby 2018-07-12), first appeared in v4.18-rc8

The sg3_utils was originally written to test the v3 sg driver interface when it was introduced, circa 2000. So where better to put sg v4 test code? Since the sg3_utils is well established, the author sees no benefit in introducing a sg4_utils package in which less than an estimated 5% of the code would change, much easier to incorporate that code change/addition in the existing package. The latest sg3_utils beta on the main page (revision 827 (a beta of version 1.45) as this is written) contains utilities for testing the sg v4 interface. The underlying support library has been using the sg v4 header for many years as a common (i.e. intermediate) format (API). If the given device was a bsg device node then the sg v4 interface was used; otherwise (e.g. for sg and block devices) the sg v4 header was translated down to a v3 header and forwarded on. In the current beta, the sg3_utils will use ioctl(SG_GET_VERSION_NUM) on sg devices and if it is a v4 driver then it will send a v4 header, otherwise it will do as it does now. [That v4 interface usage can be defeated by './configure --disable-linux-sgv4' .]

The presence of the environment variable SG3_UTILS_LINUX_NANO (typically with 1 assigned into it) in the shell executing sg3_utils package utilities will cause the elapsed time of SCSI commands to be calculated in nanoseconds if the v4 sg driver is active. Typically command times are only shown when the --verbose option is given (or several of them). The duration is measured starting from the point the sg driver sends a command to the block layer to the point when the sg driver receives a (soft) interrupt indicating that command has finished. Note that user space measures of a command duration should always be greater than the duration the sg driver calculates. Most of the test utilities in the next paragraph also act on SG3_UTILS_LINUX_NANO .

In the testing directory of that beta are several utilities that are "v4" driver aware:

These test utilities are not built by default since they are not part of the automake setup; instead an old school Makefile in the testing directory is used. And sg_tst_async and sgh_dd are C++ programs and can be built with 'make -f Makefile.cplus' . Prior to building these test utilities the sg3_utils library needs to be built. That can be done with 'cd <root_of_sg3_utils> ; ./configure ; cd lib ; make ; cd ../testing' . There is a 'make install' which will place the C test utilities in /usr/local/bin ; there is also a 'make -f Makefile.cplus install' for placing the C++ utilities in /usr/local/bin .

The fio utility has an ioengine for the sg driver (e.g. '--ioengine=sg --filename=/dev/sg1'). It supports the sg v3 interface, both async (via write(2) and read(2)) and sync (i.e. blocking via ioctl(SG_IO)). A patch was sent to the linux-block and linux-scsi lists titled: "fio: add sgv4 engine" on 20190706 to add a new io engine called "sgv4". It is modelled on the existing "sg" engine but unlike the sg engine, the "sgv4" engine uses the sg v4 interface. The "sgv4" engine can be used with the bsg device nodes (because the bsg driver uses the sg v4 interface) and sg device nodes if the patchset described above is applied. Only patches 0001 through 0018 (as sent to the linux-scsi list on 20190616) are needed.

Since the sg v4 driver may or may not be present in the kernel that the above utilities are built and run in, a local copy of the new <kernel_src>/include/uapi/scsi/sg.h header needed for the sg v4 driver is kept in the testing directory. It has the name 'uapi_sg.h' so it won't collide with the "real" header if it is present.

16 Sg driver and the block layer

One might think that a SCSI pass-through such as the sg driver would inject user supplied SCSI commands and associated data into the SCSI mid-level which would be routed down through a SCSI low level driver (LLD) to the (virtual) SCSI Host Bus adapter (HBA) and onto a SCSI device (e.g. a SAS disk). So that path doesn't involve the Linux block layer, right? Wrong, in Linux those commands are injected into the block layer as pass-through commands and of course the block layer won't interfere with them before forwarding them to the SCSI mid-level, surely? Wrong again. If the SCSI device being accessed has reached its queue limit or the HBA that the command passes through has run out of resources then a SCSI pass-through driver should tell this to the user in the author's opinion. Not in Linux. The block layer treats the SCSI pass-through as another disk user (even if the SCSI device is a tape unit which is not a block device) and queues up the injected commands assuming the resource problem is temporary. Well that temporary problem may need a SCSI administrative command or task management function issued from the user space to fix. However in Linux a user program may need to resort to desperate methods such as resetting the logical unit (LU), target, HBA or even rebooting the machine to clear those other commands in the block layer's stalled queue.

When using the sg_tst_async test utility with an option favouring submissions over completions (e.g. 'sg_tst_async --qfav=2'), the block layer's queue can cause problems. For long running tests the number of non-blocking requests a thread has outstanding will grow without bound, finally invoking the OOM ("out of memory") killer. And the OOM killer isn't particularly accurate, in tests it only killed the culprit process about 60% of the time. The OOM killer can be configured to bring down (i.e. reboot) the machine and that may lead to more predictable outcomes than letting it kill the process it thinks is the culprit. And it can't kill the block layer :-) The '--override=OVN' option has been added to the sg_tst_async utility in order to put an upper limit on how large a queue can become. Note that this was not a problem with the V3 sg driver since it limited the number of outstanding commands to 16 on each file descriptor.

17 Other documents

The original sg driver documentation is here: SCSI-Generic-HOWTO and a more recent discussion of ioctl(SG_IO) is here: sg_io .

18 Conclusion

The sg v4 driver is designed to be backwardly compatible with the v3 driver. This simplest way for an application to find which driver version it has is with the ioctl(SG_GET_VERSION_NUM). Removing a restriction such as 16 outstanding commands per file descriptor can catch out programs that rely on hitting that limit. If the need arises, driver parameters to re-impose that limit and any other differing behaviour can be added. The best way to test backward compatibility is to place this new driver "under" existing apps that use sg driver nodes and check their functionality.

Return to main page.

Douglas Gilbert

Last updated: 16th July 2019 11:00