The Linux SG driver version 4.0



1 Introduction

2 Changes to sg driver between version 3.5.36 and 4.0

3 Architecture of the sg driver

4 Synchronous usage

5 Sharing file descriptors

6 Async usage in v4

7 Request sharing

7.1 Slave waiting

8 Sharing design considerations

9 Multiple requests

9.1 Single/multiple (non-)blocking requests

10 pack_id or tag

11 Bi-directional command support

12 SG interface support changes

13 IOCTLs

14 Downloads and testing

15 Sg driver and the block layer

16 Other documents

17 Conclusion


1 Introduction

The SCSI Generic (sg) driver in Linux has been present since version 1.0 of the kernel in 1992. In the 26 years since then the driver has had 3 interfaces to the user space and now a fourth is being added. The first and second interfaces (v1 and v2) use the same header: 'struct sg_header' with only v2 now fully supported. The "v3" interface is based on 'struct sg_io_hdr'. Both these structures are defined in include/scsi/sg.h the bulk of whose contents will move to include/uapi/scsi/sg.h as part of this upgrade. Prior to the changes now proposed, the "v4" interface is only implemented in the block layer's bsg driver ("block SCSI generic" which is around 15 years old) . The bsg driver's user interface is found in include/uapi/linux/bsg.h . These changes propose adding support for the "v4" interface via the SG_IO ioctl(2) for synchronous use, and new SG_IOSUBMIT and SG_IORECEIVE ioctl(2)s for asynchronous/non-blocking use. The plan is to deprecate and finally remove (or severely restrict) the write(2)/read(2) based asynchronous interface used currently by the v1, v2 and v3 interfaces. The v3 asynchronous interface is supported by the SG_IOSUBMIT_V3 and SG_IORECEIVE_V3 ioctl(2)s .

If the driver changes are accepted the driver version which is visible via an ioctl(2), will be bumped from 3.5.36 (in lk 5.0) to 4.0.x . The opportunity is being taken to clean the driver after 20 years of piecemeal patches. Those patches have left the driver with misleading variable names and nonsensical comments. Plus there are new kernel facilities that the driver can take advantage of. Also of note is that much or the low level code once in the sg driver (and remnants remain) have been moved to the block layer and SCSI mid-level. This upgrade has been done as a two stage process: first clean the driver up, remove some restrictions and re-instate some features that have been accidentally lost. Three versions of a patchset were sent to the linux-scsi list in October 2018. That patchset took the sg driver to version 3.9.01 . Now the v4 interface is supported as described here, so the sg driver number has been bumped to 4.0.0811.

Note that the Linux block layer implements the synchronous sg v3 interface via ioctl(SG_IO) on all block devices that use the SCSI subsystem, directly or via translation (e.g. SATA disks use libata which implements the SAT T10 standard). In pseudocode an example like this ' ioctl(open("/dev/sdc"), SG_IO, ptr_to_sg_io_hdr)' works as expected. This is not implemented by the sg driver so it is important that the sg driver's implementation of ioctl(SG_IO) remains consistent with other driver implementations (mainly the one found in block/scsi_ioctl.c kernel source code).

A tarball with patches and driver source files for recent Linux kernel versions can be found in the Downloads section.

2 Changes to sg driver between version 3.5.36 and 4.0

A summary is given as bullet points:

There are still some things to do:

3 Architecture of the sg driver

Nothing much has changed in the overall architecture of the sg driver between version 3 (v3) and version 4 (v4). Having a pictorial summary of the driver's object tree may help later explanations:




The sg driver is shown as a laptop at the top of the object tree. The arrow end of solid lines shows objects that are created automatically or by actions outside the user interface to the sg driver. So the disk-like objects created at the second level come from the device scanning logic undertaken by the SCSI mid-level driver in Linux. Note that there are SCSI devices other than disks such as tape units and SCSI enclosures. Also note that not all storage devices in Linux use the SCSI subsystem, examples of these are NVME SSDs and SD cards that are not attached via USB. The type of SCSI device objects is sg_device (and in the driver code they appear as objects of C type 'struct sg_device'). Even though the sg driver's implementation is closely associated with the block subsystem, the sg driver's device nodes are character devices in Linux (e.g. /dev/sg1). The nodes are also known as character special devices.

At the third level are file descriptors which the user creates via the open(2) system call (e.g. 'fd = open("/dev/sg1", O_RDWR);') . Various other system calls such as close(2), write(2), read(2), ioctl(2) and mmap(2) can use that file descriptor made by open(2). The file descriptor will stay in existence until the process containing the code that opened it exits or the user closes it (e.g. 'close(fd);'). A dotted line is shown from the "owning" device to each file descriptor in order to indicate that it was created by direct user action via the sg interface. The type of file descriptor objects is sg_fd. BTW most system calls have "man pages" and the form open(2) indicates that there is a manpage in section 2 which is for system calls. Other common manpage sections are "1" for commands and utilities (e.g. 'man 1 cp' explaining the copy command); "3" for system libraries (e.g. 'man 3 snprintf') and "8" for system administration commands.

At the lowest level are the sg_request objects each of which carry a user provided SCSI command to the target device which is its grandparent in the object tree. These requests are then sent via the block and SCSI mid-level to a Low Level Driver (LLD) and then across the transport (with iSCSI that can be a long way) to the target device (e.g. a SSD). User data that moves in the same direction as the request is termed as "data-out" and the SCSI WRITE command is an example. In nearly all cases (an exception is a command timeout) a response traverses the same route as the request, but in the reverse direction. Optionally it may be accompanied by user data which is termed as "data-in" and the SCSI READ command is an example. Notice that a heavy (thicker) line is associated with the first request of each file descriptor; it points to a reserve request (in earlier sg documentation this was referred to as the reserve buffer). That reserve request is built after each file descriptor is created and before the user has a chance to send a SCSI command/request on that file descriptor. This reserve request was originally created to make sure CD writing programs didn't run out of kernel memory in the middle of a "burn". That is no longer a major concern but the reserve buffer has found other uses: for mmap-ed and direct IO. So when the mmap(2) system call is used on a sg device, it is the associated file descriptor's reserve request's buffer that is being mapped into the user space.

The lifetime of sg_request objects is worth noting. When a sg_request object is active ("inflight" is the term used in the driver) it has both an associated block request and a SCSI mid-level object. They have similar roles and overlap somewhat. However once the response is received (and typically before the user has seen that response or any "data-in") the block request and the SCSI mid-level objects are freed up. The sg_request object lives on, along with the data carrying part of the block request called the bio as that may be carrying "data-in" that has yet to be delivered to the user space. That is because the default user data handling (termed as "indirect IO") is a two stage process. For data-in that will be first DMA-ed from the target device into kernel memory, typically under the control of the LLD; the second stage is copying from that kernel memory to user space, under the control of this driver. Even after the user has fetched the response and any data-in, the sg_request continues to live. [However once any data-in has been fetched the block request bio is freed.] The sg_request object is then marked "inactive" and placed on a sg_request object free list, one of which is maintained for each file descriptor. So each sg file descriptor contains two request lists: one for any command that is active and the other one is a free list for inactive requests (there is an exception). The next time a user tries to send a SCSI command through that file descriptor, its free list will be checked to see if any inactive sg_request objects has a large enough data buffer suitable for the new request; if so that object will be (re-)used for the new request. Only when the user calls close(2) on that file descriptor will all the requests on the free list be truly freed. Note that in Unix, and thus Linux, the OS guarantees that it will call the close(2) command (called release() in the kernel and sg_release() in this driver) in this driver for every file descriptor that the user has opened in this process, irrespective of what the code in that process does. This is important because processes can be shut down by signals from other processes or drivers, segmentation violations (i.e. bad code) or the kernel's OOM (out-of-memory) killer.

The above description is setting the stage for a newly added feature called "sharing" introduced in the sg v4 driver. It also uses the reserve request.

4 Synchronous usage

These two forms: ioctl(sg_fd, SG_IO, ptr_to_v3_obj) and ioctl(sg_fd, SG_IO, ptr_to_v4_obj) can be used for submitting SCSI commands (requests) and waiting for the response before returning to the calling thread. This action is termed as synchronous or blocking in this driver. In Linux most block devices that use or can translate the SCSI command set also support the first form (i.e. the ioctl(2) that takes a pointer to a v3 interface object as its third argument). So this pseudo code will work: ioctl(open("/dev/sdc"), SG_IO, ptr_to_v3_obj) but not if the third argument is a ptr_to_v4_obj. Some storage related character devices (e.g. /dev/st2 and /dev/ses3) will also accept the first form.

Only two drivers currently support the second form (i.e. whose third argument is a ptr_to_v4_obj): this driver and the bsg driver.

It is important to understand that the use of ioctl(SG_IO) is only synchronous seen from the perspective of the calling thread/task/process. It is only the calling thread that waits for completion of the request. Any other thread or process submitting requests to the same or other devices associated with the sg driver will not be impeded by that wait. This assumes that the underlying devices can queue SCSI commands which most current SCSI devices are capable of doing. As an example: a large copy between two storage devices can be broken down into multiple copy segments, with each copy segment copying a comfortable amount of data (say 1 MByte); then multiple threads can each take a copy segment from a pool and fulfil them by doing a READ then a WRITE SCSI command. Each READ/WRITE pair of commands seems synchronous but overall the threads are doing asynchronous READs and WRITEs with respect to one another.

Apart from some special cases (one shown below), it isn't generally useful to mix synchronous and asynchronous commands/requests on the same thread. An asynchronous command/request (i.e. non-blocking) could be submitted followed by a second synchronous command which will go through to completion before it returns; then the first command's completion can be fetched. Care is taken within the driver so that an asynchronous completion, even if it is pending will not be incorrectly supplied as the result of a synchronous command.

The simplest way to issue SCSI commands to any device is with a synchronous ioctl(SG_IO). Asynchronous commands have some advantages (mainly performance) but that comes at the expense of more complexity for the user application. When a program is juggling multiple asynchronous submissions and completions it needs to track either pack_id, tag or a user pointer to correctly match completions with submissions. Since the sg driver maintains strong per file descriptor context, one way to simplify the matching problem is to have one file descriptor per submission/completion. However then multiple file descriptors need to be juggled, which is not so onerous.




In the diagram above a synchronous (i.e. blocking) ioctl(SG_IO) is shown. As a general rule the ioctl(2) will return -1 with a positive errno value if there is a problem creating the object of type sg_request in the top left of the diagram. Examples of this are syntax or contradictory information in the v3 or v4 interface object. Another cause could be out of resources. Once the sg_request object is "inflight" any errors will be reported via the v3 or v4 interface object. As noted in the diagram the user thread is placed in a interruptible wait state, awaiting command/request completion. If the command takes some time the user may use a keyboard interrupt (e.g. control-C) or to "kill" the containing process from another terminal (e.g. with kill(1)). Another abnormal situation is the kernel OOM killer ("out of memory"). This will cause the shown sg_request object to become an orphan. The default action is to remove orphan sg_request objects as soon as practical. However if the file descriptor has the "keep orphan" flag set (see ioctl(SG_SET_KEEP_ORPHAN) below) a further read(2) or ioctl(SG_IORECEIVE) will fetch the response information from the orphan which will then be freed.

The main context that a user space application controls in this driver is the file descriptor, shown as a sg_fd object in the earlier object tree diagram. Roughly speaking a file descriptor object is created when sg_fd=open(<sg_device_name>) succeeds and is destroyed by a close(sg_fd). Again, roughly speaking a file descriptor is confined to a user process. In multi-threaded programs it is often a good idea to have separate sg file descriptors in each thread. Some exceptions to these generalizations are discussed on the next section.

Another feature of the file descriptor object in the sg driver is that each one has a reserve request created at the same time as the file descriptor. This reserve request is immediately placed on the new sg file descriptor's free list. Any new command/request on that file descriptor will use that reserve request if :

When a command request is completed, its sg_request object is placed (or replaced) on the free list. So no sg_request objects are actually deleted until the owning file descriptors is close(2)d. In the case where there are copies of the file descriptor (e.g. a forked process) when the last close(2) is done.

5 Sharing file descriptors

First a rationale. Copying data between storage devices is a relatively common operation. It can be both time and resource consuming. The best approach is to avoid copying all together. Another approach is to defer copies (or part of them) until they are really necessary which is the basis of COW (i.e. copy on write). Then there are offloaded copies, for example where the source and destination are disks in the same array, then a "third party copy" program (e.g. based on SCSI EXTENDED COPY and its related commands) can tell the array to do the copy itself and inform you if it finishes successfully or not. However in many cases copies are unavoidable.

If the dd(1) program is considered, copying one part of a normal block storage device to another storage device involves a surprising number of copies. Copies of large amounts of data are typically done in a staggered fashion to lessen the impact on other things the system may be doing. So typically 1 MegaByte (say) is read from the source device into a buffer, followed by a write of that buffer to the destination device; if no error occurs, repeat until finished. Copies between a target device and kernel memory are typically done by DMA (direct memory access) controlled by the LLDs owning the storage devices. So another copy is needed on each side of the copy to get the data in and out of kernel buffers to the user space. Moving data between a user space process and the kernel space has a little extra overhead to deal with the situations like the process being killed while data is being copied to and from it. So a reasonable implementation of dd(1) has three buffers (2 in the kernel space) and performs 2 DMAs then 2 copies between the user space and the kernel space. As storage devices and transports get quicker, the time taken to do those copies may become significant compared to the device access time.

Another aspect of the sharing being proposed is security. Often a user has the right to copy data but not see it. This is usually accomplished by encrypting the data. Another approach might be to make sure the copy's data is kept in kernel buffers and thus hidden from the user who is copying it. While the v4 sg driver can do this, the sg driver is not written with a view to security, since it offers a pass-through interface which, by definition, is a method to circumvent an Operating System. Those building a highly secure computer systems might consider removing the sg driver or restricting its access to highly privileged users.

Sharing is a new technique added to the sg v4 driver to speed copy operations. The user first sets up a sharing relationship between two sg file descriptors, one that will be used for doing SCSI READ commands (more generally any data-in SCSI command), and the other that will be used for doing SCSI WRITE commands using the data received by the previous READ. Any data-out command can be used so, for example, the SCSI WRITE command could be replaced by WRITE AND VERIFY or WRITE SCATTERED. The file descriptor that does the READ is called the master side by the driver and the file descriptor that does the WRITE is called the slave side. The following diagram shows how one share between two file descriptors is set up.




Here the master side is /dev/sg1 and has 4 open file descriptors (fd_s 1 through 4). The slave side is /dev/sg2 has 3 open file descriptors (fd_s 5 through 7). The share shown is set up when the thread or process containing fd5 calls the "EXTENDED" ioctl on the fd5 file descriptor (i.e. the ioctl's first parameter) with a pointer to an integer containing fd1 as the ioctl's third parameter. The C code is a little more complicated than that.

How does the thread or process containing fd5 know about fd1? That is up to the design of the user space application. If they are both in the same thread then it should be obvious. If they are in different threads within the same process then it should be relatively simple to find out. The interesting case is when they are in different processes. A child process inherits all open file descriptors (including those belonging to the sg driver) from its parent in the Linux fork() system call. For processes that don't have a parent child relationship, UNIX domain sockets can be used to "send" an open file descriptor from one process to another. Note that in this case the file descriptor number might differ (e.g. because the receiver side already is using the same file descriptor number as the sender's number) but they will still logically refer to the same thing. Also that statement above about process termination leading to sg_release() being called for any sg file descriptors open(2)-ed in that process needs qualification: in this case the last process to hold an open file descriptor being terminated causes the driver's sg_release() to be called. In short the last close(2) on a file descriptor causes sg_release() to be called.

The sg driver's file descriptors can only be part of one share (pair). Given this restriction, in the above diagram, fd5 cannot also be in a share with fd4. fd6 may be in a share with fd7; that would imply that the share could be used for a copy from /dev/sg2 to /dev/sg2 . The master side of the share monopolizes that file descriptor's reserve request hence there can only be one outstanding share request per pair of shared file descriptors. Given this restriction one way to do a copy using queued command is to use POSIX threads. As an example from the above diagram, if 3 copy worker threads were used then the first thread could utilize fd1 and fd5, the second thread could utilize fd3 and fd6 while the last thread could utilize fd4 and fd7. This is what the sgh_dd test utility does (see below).

After a share of two file descriptors is established command requests can still be sent to both file descriptors in the normal fashion. Only when the new flag SGV4_FLAG_SHARE is given, or OR-ed in with other flags, is request sharing performed. See the 6 Request sharing section below.

6 Async usage in v4

The terms asynchronous and non-blocking are generally used as synonyms in this description. Those terms are related to the Unix file descriptor flags O_ASYNC and O_NONBLOCK which have more precise meanings and are set in either the open(2) or fcntl(2) system calls. In Unix the O_NONBLOCK flag on a regular file descriptor causes read(2) to return promptly with an EAGAIN errno if there is no data available to be read. This driver's ioctl(SG_IORECEIVE) and read(2) will react in the same fashion. However this driver's ioctl(SG_IO) ignores the O_NONBLOCK flag. The O_ASYNC file descriptor flag causes signals to be sent to process owning the file descriptor whenever something 'interesting' happens (e.g. data arriving) to that file descriptor. When the term asynchronous is used in this description it is more likely referring to non-blocking behaviour rather than enabling signals.

The asynchronous interface in the context of the sg driver means issuing a SCSI command in one operation then at some later time a second operation retrieves the status of that SCSI command. Any data being transferred associated with the SCSI command is guaranteed to have occurred before that second operation succeeds. The synchronous interface can be viewed as combining these two operations into a single system call (e.g. ioctl(SG_IO) ).

The asynchronous interface starts with a call to ioctl(SG_IOSUBMIT) which takes a pointer to the sg v4 interface object. This object includes the SCSI command with data transfer information for either data-in (from device) or data-out (to device). Depending on the storage device accessed (identified by the sg file descriptor given as the first argument to the ioctl() system call) the SCSI command will take milliseconds or microseconds to complete. Chances are the ioctl(SG_IOSUBMIT) will complete in a sub-microsecond timescale (on a modern processor) and that will be done before the SCSI command completes. If further processing depends on the result of that SCSI command then the program must wait until that SCSI command is complete. When that completion occurs, the data-out is guaranteed to be on the nominated storage device (or in its cache). And if a data-in transfer was specified, that data is guaranteed to be in the user space as directed. How does the program find out when that SCSI command has completed?

The exact timing of the data-out and data-in transfers can be thought of as a negotiation between the HBA (Host Bus Adapter controlled by the LLD) and the storage device. The essential point is that the data transfer and the completion are asynchronous to the program that requested the SCSI command. Since the completion is guaranteed to follow any associated data transfer then the completion event is what we will concentrate on. Detecting asynchronous events depends on Operating System features such as signals and polling. Polling is the simpler technique. However the simplest approach is to call the final step in the process which is ioctl(SG_IORECEIVE) as soon as possible. In the likely case that the SCSI command completion has not occurred, then the ioctl(2) can do one of two things: it can wait until the completion does occur or yield an "error" called EAGAIN. Similar to SCSI sense data, a UNIX errno doesn't always imply a hard error. So EAGAIN is not a hard error, but it tells the program that the operation didn't occur but may happen later, so try again, but preferably don't retry immediately. What determines whether the ioctl() waits or returns EAGAIN is the presence of the O_NONBLOCK flag on the file descriptor.

Two file descriptor flags are important to the asynchronous interface of the sg driver: O_NONBLOCK and O_ASYNC. The file descriptor flags are defined in such a way that they can be OR-ed together. The normal place to define flags is in the open(2) system call (its second argument) but they can be changed (and added to) later with the fcntl(2) system call. If the O_NONBLOCK is given then it will typically be given in the open(2). The O_ASYNC flag is a bit more difficult to handle because it arms the SIGIO (also known as SIGPOLL) signal which if it occurs before a program has set up a handler for it, then the program will exit. Actually Linux ignores O_ASYNC is the open(2) call (see 'man 2 open' in the BUGS section), so fcntl(2) is the only way to set it. Below is a simplified example of adding the O_ASYNC flag to a file descriptor (sg_fd) that is already open:

flags = fcntl(sg_fd, F_GETFL, NULL);

fcntl(sg_fd, F_SETFL, flags | O_ASYNC);

It is possible to replace the classic Unix SIGIO signal with a POSIX real-time signal by making an additional call:

fcntl(sg_fd, F_SETSIG, SIGRTMIN + 1);

After that call the SIGRTMIN+1 real time signal will be used instead of SIGIO. Even though you could use hard numbers for the real-time signals the advice is to always use and offset from SIGRTMIN or SIGRTMAX (negative offset in the MAX case) because the library can (and does for its POSIX threads implementation) steal some of the lower real time signals and adjusts the SIGRTMIN value that the application program sees. Real time signals have improved semantic compared to the classic Unix signals (e.g. multiple instances of the same real time signal can be queued against a process where Unix signals would meld into one signal event in a similar situation).

In the diagram below the lifetime of an active sg_request object is shown from when it is created or retrieved from the free list in the top left to when the SCSI command has completed and the user space has been informed on the bottom right. It assumes that either the O_NONBLOCK flag is set on the file descriptor (assumed to be the same in all the system call boxes shown with the blue band at the top), or ioctl(SG_IORECEIVE) has SGV4_FLAG_IMMED or-ed into its flags. When the first ioctl(SG_IORECEIVE) is called the SCSI command has not completed so it gets rejected with EAGAIN. The first poll(2) system call indicates with POLLOUT that another SCSI command can be issued but there are no SCSI commands waiting for an ioctl(SG_IORECEIVE) on this file descriptor. Note that poll(2) description refers to a file descriptor, not this particular sg_request object but for simplicity we will assume there is only one outstanding SCSI command on this file descriptor. At some future time, preferably long before the command approaches its timeout (often 60 seconds or more) the storage device via its LLD informs the sg driver that a SCSI command belonging to this file descriptor has completed. If O_ASYNC has been set on this file descriptor then the sg driver will issue a SIGIO signal to the owning process. A poll(2) system call after the internal completion point yields (POLLIN | POLLOUT) [IOWs both POLLIN and POLLOUT]. That tells us that the next ioctl(SG_IORECEIVE) will be successful as is indicated in the diagram.




While it is useful to think and illustrate the above mentioned ioctl(2)s and poll(2)s as being in reference a single sg_request object, they are all actually against the file descriptor that is the parent of that sg_request object. This distinction matters when multiple sg_request objects are outstanding. In the absence of any selection information (e.g. a pack_id or a tag) the ioctl(SG_IORECEIVE) will fetch the oldest sg_request object since the active (and completed) command list acts as a FIFO. Instead of poll(2) the user may call the ioctl(SG_GET_NUM_WAITING) which yields the number of sg_request objects belonging to a file descriptor that have completed internally but are yet to have ioctl(SG_IORECEIVE) [or read(2) for the async v3 interface] called on them.

7 Request sharing

Request sharing refers to two requests, usually belonging to different storage devices (e.g. two disks), sharing the same in-kernel data buffer. Before request sharing can take place a share of two file descriptors belonging to those two storage devices needs to be set up. This is discussed in the previous Sharing file descriptors section.

The diagram below shows the synchronous sg driver interface using ioctl(SG_IO) which can take either the v3 or v4 interface. The synchronous interface can be seen as the combination of the various calls that make up the asynchronous interface discussed in the previous section. The time that the synchronous ioctl(SG_IO) takes is directly related to the access time of the underlying storage device. To stress that point the system call rectangles (with a blue band at the top) in the diagram below are shown as elongated rectangles with a beginning component to the left and a completion component to the right. The elongated system call boxes span the access time of the associated storage device.

A request share only takes place when a command request is issued and a SGV4_FLAG_SHARE flag is used (OR-ed with any other flags). This should be done first on the master side with a READ (like) command request. Other flags that might be combined with this are SG_FLAG_NO_DXFER or SG_FLAG_MMAP_IO flags (but not both). The SG_FLAG_NO_DXFER flag stops the copy from the in-kernel data buffer to user space. The SG_FLAG_MMAP_IO flag maps the in-kernel data buffer into the user space; that user space area is made available via a mmap(2) system call preceding the command request being sent. The diagram below shows the simpler case where the minimum number of flags are set. For brevity the leading SGV4_ is removed from the flag values in the following diagrams.





The slave may continue to send normal command requests but at some stage it should send the corresponding WRITE (like) command request with both the SGV4_FLAG_SHARE and SG_FLAG_NO_DXFER flags set. That will use the in-kernel data buffer from the preceding master share command request and send that data (i.e. data-out) to the slave's device. So a single, in-kernel data buffer is used for a master share request followed by a slave share request.

In the terminology of the block subsystem both the master and slave share requests have their own request object, each with their own bio object. However the sg driver provides the data storage for those bios and arranges for the slave share request to use the same data buffer as the preceding master request's bio. And this is the reason that the slave request must use the SG_FLAG_NO_DXFER flag, otherwise a transfer from the user space usually associated with a WRITE (like) command would overwrite the in-kernel data buffer.

Once the slave request has successfully completed another master share request may be issued. Sanity checks ensure that using the SG_FLAG_SHARE flag on non-shared file descriptor will cause an error, as will trying to send a master share request before a prior master share request is complete (which means its matching slave request has finished). Once a pair of file descriptors are shared, the master's side reserve request will only be used for command requests that have the SGV4_FLAG_SHARE flag set.

If the master share request fails (i.e. gives back any non zero status, or fails or warns at some other level) then the master request on completion will go to state "rs_inactive" (i.e. not "rs_swap"). Even if the master request succeeds, it is also possible that the application wants to stop the copy (e.g. because the user wants to abort the copy or there is something wrong with the data copied to the user space near the location marked "***" in the above diagram). This call: ioctl(master_fd, EXTENDED{MASTER_FINI}) manipulates a boolean which can be used to finish a share request after the master request has completed. What is needed here is setting this boolean to 1 (true) which changes changes the "rs_swap" to "rs_inactive" state. The inverse operation: setting that boolean to 0 (false) changes "rs_inactive" to "rs_swap" state which is used in the single read, multi write case below.

The brown arrow-ed lines in the above diagram show the movement of the "dataset" which is usually an integral number of logical blocks (e.g. each containing 512 or 4096 bytes). The brown arrow-ed lines that are vertical and horizontal do not involve copying (or DMA-ing) of that dataset. That leaves three brown arrow-ed lines at an angle: the DMA from the device being read, the DMA to the device being written, and an optional in-kernel to user space copy (annotated with "***").

A practical single READ, multiple WRITE solution needs the ability to have multiple slaves each associated with a different disk. Looking at the diagram above, two things need to happen to the master: it needs to adopt a new slave and it needs to get back into "rs_swap" state. A variant of the above mentioned ioctl(slave_fd, EXTENDED{SHARE_FD},) called ioctl(master_fd, EXTENDED{CHG_SHARE_FD},) has been added. As long as the new slave file descriptor meets requirements (e.g. it is not part of a file descriptor share already) then it will replace the existing slave file descriptor. To get back into "rs_swap" state the MASTER_FINI boolean in the EXTENDED ioctl, writing the value 0 (false) will do what is needed. The EXTENDED ioctl is a little tricky to use (because it essentially replaces many ioctls) but a side benefit is that multiple actions can be taken by a single EXTENDED ioctl call. So both the actions required to switch to another slave, ready to do another WRITE, can be done with a single invocation of the EXTENDED ioctl.

Here is a sequence of user space system calls to READ from /dev/sg1 (the master) and WRITE that same data to /dev/sg5, /dev/sg6 and /dev/sg7 (the slaves). Assume that fd1 is a file descriptor associated with /dev/sg1, fd5 with /dev/sg5, etc. In pseudocode that might be: ioctl(fd5, EXTENDED{SHARE_FD}, fd1); ioctl(fd1, SG_IO, FLAG_SHARE + READ); ioctl(fd5, SG_IO, FLAG_SHARE|NO_DXFER + WRITE); ioctl(fd1, EXTENDED{CHG_SHARE_FD=fd6 + MASTER_FINI=false}); ioctl(fd6, SG_IO, FLAG_SHARE|NO_DXFER + WRITE); ioctl(fd1, EXTENDED{CHG_SHARE_FD=fd7 + MASTER_FINI=false}); and ioctl(fd7, SG_IO, FLAG_SHARE|NO_DXFER + WRITE). So four ioctls to move data (one READ and three WRITEs) and three "housekeeping" ioctls. Notice that the WRITEs are done sequentially, they could theoretically be done in parallel but that would add complexity. Also note that a second READ cannot be done until the final WRITE from the previous sequence has completed, there is no easy way around that since only one, in-kernel buffer is being used (and a second READ would overwrite it). To make this sequence slightly faster (and hide the data from the user space) the flag in the second ioctl (the READ) can be expanded to FLAG_SHARE|NO_DXFER .

The sgh_dd utility in the sg3_utils testing directory (rev 803 or later) has been expanded to test the single READ, multiple WRITE feature. It has two extra "of" (output file) parameters: "of2=" and "ofreg=". The "of2=" is for a second WRITE sg device and the "ofreg=" takes a regular file or a pipe and will use the data that comes from the READ operation marked with "***" in the above diagram. If "ofreg=" is present among sgh_dd's operands then the READ's flag will be FLAG_SHARE, if "ofreg=" is not present its flags will be FLAG_SHARE|NO_DXFER . The latter should be slightly faster, and that difference can be reduced with "iflag=mmap". The "of2=" operand shares "oflag=" and "seek=" with "of=".

7.1 Slave waiting

A simple analysis of an asynchronous copy segment cycle based on the previous section starts with READ command being sent to the master's file descriptor [one user-to-kernel space context swap], followed by a signal or sequence of polls [one or more context swaps] followed by a read(2) or ioctl(SG_IORECEIVE) to get the result [so another context swap]. Assuming the response is good, then the same sequence is repeated, this time on the slave's file descriptor doing a WRITE. So that is at least six context swaps and importantly they must occur in that order. This is what the diagram in the previous section shows, but with synchronous rather than asynchronous calls.

An enhancement has been added to relax the strict ordering outlined in the previous paragraph. The slave's WRITE command can be sent to the driver in advance of its paired master READ command completing. Again the diagram below shows a copy segment: a READ from one disk followed by a WRITE of the data fetched to a second disk.




The important feature of this diagram is that the slave WRITE is started before the prior master READ has completed. Three synchronization points are shown: S1, S2 and S3. The S1 point is when the slave becomes aware that the master request (the READ) has been issued, but not necessarily completed. The slave request can be issued at any time following S1. If the slave request is in another thread or process, then the application needs a way of signalling to the slave thread/process that it can now issue the slave WRITE. The S2 synchronization point is purely internal (i.e. there is no code needed by the application). S2 is when the driver gets notification that the READ has finished. Assuming the READ was successful, and that S1 is before S2 then the slave WRITE, which has been held internally, can now be issued to the device [/dev/sg2]. Notice that the slave request is in rs_swait state between S1 and S2, indicating that it is being held. The S3 synchronization point is when the slave WRITE has finished and the master transitions from rs_slave to rs_inactive state. After S3 the next copy segment can be started.

Why show the master as an asynchronous request and the slave as a synchronous request? As a practical matter, the application needs to know when the master READ request has been issued so it can then issue the slave WRITE request. The simplest way to do that is to make the master READ asynchronous (a timer is another technique, but it may be too quick (e.g. occurring before S1) or too slow, wasting time). As for the slave WRITE request we are not interested in it until it has completed, hopefully successfully, hence the use of a synchronous request.

So this "slave waiting" approach decouples the strict ordering outlined in the first paragraph of this section into two loosely coupled sequences, the first for the master, the second for the slave. the only addition to application complexity is making the master request asynchronous. Notice that all completions (e.g. the ioctl(master_fd, SG_IORECEIVE,)) must still be processed and checks made for errors.

What about errors? Wouldn't code be simpler without error processing, but it would be a lot less interesting. The simpler case is the slave WRITE request failing, in which case the error is conveyed in the WRITE's completion in the normal manner. Then the application can decide whether to repeat the WRITE, or to WRITE somewhere else, or abort the copy. The more interesting case is when the master READ request fails as the notification of that may occur after the application has issued the slave WRITE request. In that case, a decision is made at the S2 synchronization point, not to issue the WRITE request to /dev/sg2 . Instead the ioctl(slave_fd, SG_IO,) completes just after S2 with a return value of 0 (so there is no error value in errno) but with sg_io_hdr::driver_status or sg_io_v4::driver_status set to DRIVER_SOFT. And whenever ::driver_status, ::device_status or ::transport_status are non-zero then the SG_INFO_CHECK flag is OR-ed into the ::info field. So that field is always worth checking on completion. The actual error is given in the master's completion in the normal fashion.

Can the master's call to ioctl(master_fd, SG_IORECEIVE) be after S3? Yes it can. That allows a single thread to do the following pseudo-code sequence:

ioctl(master_fd, SG_IOSUBMIT, <ptr_to_READ_pt_object>);

ioctl(slave_fd, SG_IO, <ptr_to_WRITE_pt_object>);

ioctl(master_fd, SG_IORECEIVE, <ptr_to_READ_pt_object>);

There is only three context swaps in that sequence with only ioctl(slave_fd, SG_IO,) taking the time required to actually do the READ followed by the WRITE. In real code, those three calls should have their return values checked plus, at the very least, a check that ::info does not have the SG_INFO_CHECK flag OR-ed into it.

The testing/sgh_dd.cpp utility (in sg3_utils-1.45 rev 811 or later, see main page) has an oflag=swait command line operand for exercising this feature.

Details: "error" can be a bit difficult to define in SCSI. The interesting ones are like: that READ worked but in the firmware's opinion this storage device will soon fail! You can ignore that if this is the final copy of data on that medium to something safer, but otherwise it is probably more serious than that READ failing. Anyway, when this driver is deciding internally whether a request has failed (e.g. that other requests are queued on), then any non-zero value in the SCSI status, or the driver or transport status is regarded as an error with the queued commands that have not been sent to the device getting DRIVER_SOFT as indicated above.

As the naming suggests the IOSUBMIT and IOSUBMIT_V3 ioctls are closely related. The same is true of SG_IORECEIVE and SG_IORECEIVE_V3. The '_V3' versions take a point to a v3 interface object (i.e. struct sg_io_hdr) as their third argument. These ioctl(s) have been separated y to simplify 32 bit to 64 bit compatibility handling. The v3 and v4 interface objects have different sizes. Further the v4 interface object is the same size in both 32 and 64 bit environments (by design) while the v3 interface object size differs between 32 and 64 bit environments (due to embedded pointers).

8 Sharing design considerations

The primary application of sharing is likely to be copying from one storage device to another storage device where both are SCSI devices (or translate the SCSI command set as SATA disks do in Linux). Lets assume the copy is large enough that it needs to be cut up into segments, implemented by READ(from source), WRITE(to destination) commands, each pair of which share the same data. Even with modern SSDs, maximum performance is usually obtained by queuing commands to storage devices. However the design of sharing in the sg driver requires sequential READ WRITE commands on a pair of shared file descriptors in a way that precludes queuing on those file descriptors. Worse still, the storage device that does the READ (i.e. the master side of the share) must wait, effectively doing nothing while its paired WRITE command is being done; it could be doing the next READ while it's waiting.

One relatively simple solution is to take advantage of threading which is well supported by the Linux kernel. Multi-threaded programs are typically multiple threads of execution running in a single process in which all threads share the same memory and other resources such as file descriptors. In the case of a copy using request sharing in the sg driver, a good approach would be to have one management thread and multiple worker threads. Each worker thread would go to a distribution centre where information about next segment offsets to be copied would be fetched; then the worker thread could go and do that copy segment using those offsets and return to the distribution centre for information on the next segment offsets to be copied, or be told there is nothing more to do in which case the thread could exit. The distribution centre needs to be stateful which in this context means that it needs to remember when it has given out copy segment offsets and not give them out again (unless the original thread reports an error). One way to protect this distribution centre from two worker threads accessing it at the same time is with a mutex shared between all worker threads. Finer grained threading mechanisms such as atomic integers may be able provide this protection in the place of a mutex.

With the sg driver there is no limit (in the driver modulo memory availability) to the number of file descriptors that there can be referring to a single storage device. So for this segmented copy using sg driver sharing, a good approach would be to do a separate open(2) system call on the source and another on the destination in each worker thread. Then each worker thread could set up a file descriptor share with the master being the copy source file descriptor and the slave which will be the copy destination file descriptor. The number of worker threads should be no more than the maximum queue depth the two devices can comfortably handle. That said, having hundreds of worker threads may consume a lot of the machine's resources. An advantage of this approach is that each worker thread can use the sg driver's simpler synchronous interface (i.e. ioctl(SG_IO) ). Then the reader might wonder, is there any command queuing taking place? The answer is yes, because one way of viewing the sg driver is that under the covers it is always asynchronously accessing the SCSI devices. So even when one thread is blocked on a ioctl(SG_IO) another thread can call ioctl(SG_IO) and that command will be forwarded to the device.

There is a big "gotcha" with this design (and almost any other design for segmented copy that isn't completely single threaded). The gotcha does not apply when the destination device is a SCSI device, or uses the pwrite(2) or writev(2) system calls but does apply to the write(2) system call, often used to write to a pipe or socket. The problem is that if a read is issued by one thread (or any asynchronous mechanism) called R1 and before it completes another thread issues a read called R2, then there is no guarantee that R1 will complete before R2. And if R2 does complete before R1 and the write(2) system call is called for W2 (i.e. the pair of R2) before W1 then those writes will be out of order. Detecting out-of-order writes when gigabytes are being copied can be painful. If the source and shuffled destination are available as files then a utility like sha1sum will show them as different (because they are) but an old school sum(1) (like from 'sum -s') will give the same value for both. There is a related issue associated with the atomicity of the Linux write(2) command. There is no corresponding atomicity issue with the SCSI WRITE command.

To save time and resources the master side shared READ request should be issued with SG_FLAG_NO_DXFER flag OR-ed with its other flags. That is assuming that the copy program does not need to "see" the data as it flies past. As a counter example, a copy program might want to do a sha256sum(1) on the data being copied in which case that program needs to "see" the inflight data.

The above design can be extended to the single reader, multiple writer case. In other words each worker thread would open file descriptors to the READ storage device and every WRITE storage device. Code to demonstrate these techniques can be found in the sg3_utils package's testing/sgh_dd.cpp utility. That code uses ioctl(SG_SET_GET_EXTENDED, {SG_SEIM_CHG_SHARE_FD}) to change the slave side of an existing share to the next writer.

SCSI storage devices optionally report a "Block limits" Vital Product Data (VPD) page which contains a field called "Optimal transfer length" whose units are Logical blocks (e.g. usually either 512 or 4096 bytes). There is also a "Maximum transfer length" whose units are the same. If that VPD page is present (fetched via the SCSI INQUIRY command) but those fields are 0 then no guidance is provided. Otherwise the segment size chosen for a copy should probably be the minimum of the source and destination Optimal transfer length fields. However if that implies a segment size in the Megabyte range (say over 4 MB) then the Linux kernel may object.

Other copy designs are possible, that instead of using threads, use separate processes . One practical problem with this is the ioctl(2) that sets up the share between a destination file descriptor (fd) and a source fd. That will be done in the process containing the destination fd but how does it find out about the source fd? One way is in a process containing the source file descriptor, to use the Unix fork(2) system command to spawn a new process. The child process will share the same file descriptors as its parent. So if the child then goes on to open the destination storage device then it has the two file descriptors it needs to set up the share. While that solution may look good on paper, it may require a radical rewrite of existing code to implement. Perhaps a better solution is to pass an open file descriptor from one process to another process using a Unix socket. The blog by Keith Packard outlines the technique. Code based on both techniques can be found in the sg3_utils package's testing/sg_tst_ioctl.c (with the '-f' option).

9 Multiple requests

The bsg write(2) based asynchronous interface (removed from kernel around lk 4.15) supported multiple sg_io_v4 objects in a single invocation . That is harder to do with an ioctl(2) based interface as the kernel favours pointers to fixed size objects passed as the third argument. Multiple requests (in one invocation) have been implemented in this driver using an extra level of indirection which is a common technique for solving software challenges.

A new sg v4 interface flag: SGV4_FLAG_MULTIPLE_REQS, has been added to sg_io_v4::flags . An instance of a sg_io_v4 object with the SGV4_FLAG_MULTIPLE_REQS flag set is termed as a controlling object which is abbreviated to ctl_obj below. A pointer to a ctl_obj can be given as the third argument to either ioctl(SG_IO), ioctl(SG_IOSUBMIT) or ioctl(SG_IORECEIVE). The members of a controlling object are interpreted a little differently from a normal sg v4 interface object:

control object's fields

input value

Notes (flags are written without the leading SGV4_FLAGS_ for brevity)

guard

'Q'

associated ctl_obj.protocol and ctl_obj.subprotocol fields must both be 0 implying SCSI command protocol. This is the same as the normal v4 interface object

request

0 or ptr-> array of cdbs

if 0 then ctl_obj.request_len field must be 0. If non-zero then it is a pointer to an array of cdbs (SCSI command descriptor blocks). The number of elements ('n') in this array is ctl_obj.dout_xfer_len divided by the size of a request object (sg_io_v4_sz). The actual length of each cdb in this array is given by the req->request_len field in the corresponding request array element. All actual cdb lengths must be less than or equal to ctl_obj.request_len divided by n.

request_len

0 or length of array of cdbs

if 0 then ctl_obj.request field must be 0. If non-zero then it is the length in bytes of the array of cdbs pointed to by ctl_obj.request

dout_xferp

ptr-> request array

request array is provided by the user space and copied into the driver for processing. In the case of ioctl(SG_IORECEIVE) it may be 0. The ioctl(2) fails with E2BIG if the size of the request array exceeds 2 MB

dout_xfer_len

length of request array

length in bytes of array pointed to by ctl_obj.dout_xferp . It must be an integer multiple of the size of a request object (sg_io_v4).

din_xferp

ptr-> space to receive response array

pointer to space that will have the response array written out to it. May be the same value as dout_xferp. In the case of ioctl(SG_IOSUBMIT) when MULTIPLE_REQS and IMMED flags are given, may be zero. Size cannot exceed 2 MB

din_xfer_len

length of response array

length in bytes which must be an integer multiple of the size of a response object which is the same size as the request object.

response

ptr-> space for sense data

this and the next field will be used to "stuff" (overwrite) any element in the request array that has zero in both corresponding fields. It is for SCSI command sense data

max_response_len

18 to 256

this relies on the assumption that it is unlikely that more than one of the multiple requests will yield sense data

flags

MULTIPLE_REQS

plus optionally the IMMED or STOP_IF flags.

dout_resid

<<output>>

number of requests implied by dout_xfer_len less the number of requests submitted. 0 is the expected value. Note: unit is v4 requests, not bytes.

din_resid

<<output>>

number of responses implied by din_xfer_len less the number actually written to din_xferp .

info

<<output>>

if ioctl(SG_IO) or ioctl(SG_IOSUBMIT) then the number of requests submitted is written. For ioctl(SG_IORECEIVE) the number of responses output to din_xferp is written.




<<all other input fields>>

0

for example: the ioctl(2) fails with ERANGE if either din_iovec_count or dout_iovec_count is non-zero

<<all other output fields>>

<<output>>

0 written



Note that 'din' and 'dout' maintain their data transfer direction sense which is with respect to the user space. The response array is a request array with the output fields written to it. However with ioctl(SG_IORECEIVE) the request array is not available and its response array has zero-ed 'in' fields. Further, in that case the response array's elements are in completion order which may be different from the request array which dictates the submission order. The size, in bytes, of the version 4 interface object (i.e. in C: sizeof(struct sg_io_v4) ) is shown as sg_io_v4_sz . Notice the controlling object can optionally provide an array of cdbs; if given the elements in that array of cdbs will override the cdbs pointed to in each request array element.

The benefit of multiple requests is to lessen the number of context switches and bulk up some transfers of meta-information so more information is transferred in fewer transfers. Three use cases were considered:

A table summarizing four different varieties of multiple requests follows with a more in depth explanation after the table:


ordered blocking

variable blocking

submit non-blocking

full non-blocking

ioctl arguments of first call

sg_fd, SG_IO, &ctl_obj

sg_fd, SG_IOSUBMIT, &ctl_obj

sg_fd, SG_IOSUBMIT, &ctl_obj

sg_fd, SG_IOSUBMIT, &ctl_obj

ctl_obj flags (without leading SGV4_FLAG_ )

required: MULTIPLE_REQS

optional: STOP_IF

excluded: IMMED

required: MULTIPLE_REQS

optional: STOP_IF

excluded: IMMED

required: MULTIPLE_REQS, IMMED

optional:

excluded: STOP_IF

required: MULTIPLE_REQS, IMMED

optional:

excluded: STOP_IF

req_arr element flags (without leading SGV4_FLAG_ ); MULTIPLE_REQS excluded on all

optional: SHARE, DO_ON_OTHER, NO_DXFER

optional: SHARE, DO_ON_OTHER, NO_DXFER, SIG_ON_OTHER, COMPLETE_B4

optional: SIG_ON_OTHER

excluded: SHARE, DO_ON_OTHER, COMPLETE_B4

optional: SIG_ON_OTHER

excluded: SHARE, DO_ON_OTHER






ioctl arguments of second call

<<everything completed in first call>>

<<everything completed in first call>>

sg_fd, SG_IORECEIVE, &ctl_obj

sg_fd, SG_IORECEIVE, &ctl_obj

ctl_obj flags



required: MULTIPLE_REQS

optional:

excluded: STOP_IF, IMMED

required: MULTIPLE_REQS, IMMED

optional:

excluded: STOP_IF

req_arr element flags; MULTIPLE_REQS excluded on all



optional: SIG_ON_OTHER

excluded: SHARE, DO_ON_OTHER, COMPLETE_B4

optional: SIG_ON_OTHER

excluded: SHARE, DO_ON_OTHER, COMPLETE_B4



The ordered blocking multiple request method submits every command found in req_arr (read into the driver via ctl_obj.dout_xferp) waiting for each request to complete before moving to the next request in req_arr. It will exit when all the requests have been completed or an error occurs. After (partial) success, the updated req_arr will be written out to ctl_obj.din_xferp. Each completed request will have SG_INFO_MRQ_FINI OR-ed into its req.info field. The updated ctl_obj is written out to the location indicated by the ioctl(SG_IO)'s third argument. The ctl_obj.dout_resid field will contain the number of requests in ctl_obj.dout_xferp less the number successfully submitted; so zero is the expected number. The order that requests appear in req_arr will be the same as the order of the response array written out on completion. The DO_ON_OTHER flag on requests instructs the driver to submit that request on the shared file descriptor rather than the one given in first argument of the ioctl(2). If there is no file descriptor share already established then the ioctl(2) fails with an errno of ERANGE. Most syntax violations in multiple request handling will yield an ERANGE error. The DO_ON_OTHER flag is only permitted with multiple requests, using it on single request methods will cause the ioctl(2) to fail with ERANGE.

The variable blocking multiple request method is similar to ordered blocking but by default requests are submitted without waiting for the previous submission to complete. This can be overridden on a request by request basis with either the SHARE or COMPLETE_B4 flags. With either of these flags given, the current request will complete before the next request (if any) is submitted. After the submission loop, all outstanding completions are fetched before ioctl(SG_IOSUBMIT) returns to the user. The same information is copied back to the user space as outlined in the previous paragraph.

These two blocking multiple request methods both can optionally take the STOP_IF flag on the controlling object. That will cause a check to be done at completion of each request for driver, transport or device (SCSI) errors or warnings. If any errors or warnings are detected then no more requests will be submitted. Notice that the STOP_IF flag has no effect in variable blocking if there are no SHARE or COMPLETE_B4 flags as all requests have already been submitted before any completions are checked. The action of the STOP_IF flag has been designed this way so as to not orphan requests that are inflight due to an error occurring on some other request.

The submit non-blocking and full non-blocking multiple request methods are the same on the submission side (i.e. the first call). They both call ioctl(SG_IOSUBMIT) with the MULTIPLE_REQS and IMMED flags set on the ctl_obj. All requests are submitted (which should not block, but could run out of resources) after which control is returned to the caller. Notice that many flags are now "excluded" apart from SIG_ON_OTHER "signal on other". Any command in the request array using those excluded flags will cause the ioctl(2) to fail with an errno of ERANGE and no requests will be submitted. File descriptor sharing may be used but this is not request sharing, rather it will allow some of the multiple requests to use the SIG_ON_OTHER flag. When SIG_ON_OTHER is given on a request, then after that request completes, the response array (in its current state) is flushed out (i.e. written to where ctl_obj.din_xferp points), then on the other file descriptor poll(2) will have POLLIN set and a signal will be issued if it is has been set up. The other file descriptor is just a convenient auxiliary that selected requests can trigger poll(2) and/or a signal on. The file descriptor given as the first argument to the ioctl(2) will have POLLIN set and optionally signal traffic for every completed request.

The second half of the submit non-blocking multiple request method is performed by calling ioctl(SG_IORECEIVE) with the MULTIPLE_REQS flag set on the control object. The ctl_obj.din_xferp and ctl_obj.din_xfer_len fields are expected to be non-zero. The ctl_obj.din_xfer_len field divided by sg_io_v4_sz is the number of request completions this ioctl(2) will attempt to yield. As an example: if that division yields 5 and 3 requests are already completed then this ioctl(2) will wait for those other two requests to complete before returning with all 5 responses. And if the number already completed was 6 then the first 5 would be written out to ctl_obj.din_xferp and the ioctl(2) would return without blocking, leaving one completed request for another ioctl(SG_IORECEIVE) invocation to "pick up". If there are no requests waiting (i.e. completed) and no requests are submitted, pending completion, then this ioctl(2) fails with an errno of ENODATA. The response array output to ctl_obj.din_xferp is zero filled with only output fields (including the usr_ptr field) filled for those requests that have completed. They are (roughly) in the order that the completions occurred which may differ from the order in which they were submitted. Each completed request will have SG_INFO_MRQ_FINI OR-ed into its req.info field. The ctl_obj.din_resid field is set to the ctl_obj.din_xfer_len / sg_io_v4_sz less the number of completions reported.

The second half of the full non-blocking multiple request method is performed by calling ioctl(SG_IORECEIVE) with the MULTIPLE_REQS and IMMED flags set on the control object. The ctl_obj.din_xferp and din_xfer_len fields are expected to be non-zero. The ctl_obj.din_xfer_len field divided by sg_io_v4_sz is the maximum number of request completions this ioctl(2) will yield. As an example: if that number is 5 and 3 requests are already completed then the ioctl(2) will only yield those 3 completed requests and then return to the caller. If there are no requests waiting (i.e. completed) and no requests are submitted, pending completion, then this ioctl(2) fails with an errno of ENODATA. If there are no requests waiting (i.e. completed) and there is one or more submitted requests still inflight then the response array output to ctl_obj.din_xferp will be all zeros.

Only the ordered blocking and variable blocking multiple request methods (and not the two non-blocking methods) can additionally use request sharing with the following modification. Since all multiple request methods use a single file descriptor (i.e. the first argument of the ioctl(2) ), then there needs to be another way of indicating a particular request should use the other (i.e. shared) file descriptor. This is done with the DO_ON_OTHER flag. File descriptor sharing can be used with all four multiple request methods either to support request sharing or to nominate another file descriptor to which some POLL_IN and signal indications are sent to, triggered by the SIG_ON_OTHER flag.

With the non-blocking multiple requests methods, rather than use the poll(2) command or signals, ioctl(sg_fd, SG_GET_NUM_WAITING, &an_integer) can be used. It will place the number that are completed but not "picked up" into an_integer with little overhead and it won't block. The user can also find out how many requests are active on the given file descriptor; this includes those requests that are inflight plus those that are waiting to be "picked up". That number can be found with ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_WAITING}). With the non-blocking multiple requests methods there is no ability to fetch the response of a particular request using a pack_id or tag. However with a normal ioctl(SG_IORECEIVE) a request submitted via a multiple request ioctl(SG_IOSUBMIT) can be found by pack_id or tag.

The O_NONBLOCK flag can be set on a sg driver file descriptor with the open(2) or fcntl(2) system calls. [Note that the related O_ASYNC file descriptor flag for enabling signals can only be set with the fcntl(2) system call.] If the O_NONBLOCK flag is set on the sg_fd given as the first argument of ioctl(SG_IOSUBMIT) or ioctl(SG_IORECEIVE) then it has a similar effect to giving the IMMED flag to ctl_obj.flags . If the O_NONBLOCK flag is set on the sg_fd given as the first argument of ioctl(SG_IO) then the O_NONBLOCK flag is ignored and ioctl(SG_IO) is fully blocking as described above.

Typically few SCSI commands yield sense data and when they do, it is not necessarily related directly to the command response that it is attached to. For example after a WRITE command a SSD may decide to yield sense data indicating that it has run out of resources to do further WRITEs and that the SSD will soon become a read-only device! So it is never a good idea to ignore sense data. On the other hand allocating (and freeing) buffers for each command's possible sense data can be burdensome and error prone. To simplify this a little the controlling objects can be given a sense data pointer and its length in bytes (ctl_obj.response and ctl_obj.max_response_len respectively) and that will be used for any associated command request in the request array that has zero in those two fields. The downside of doing this is that if two or more commands yield sense data, only the last one will be seen.

The following diagram illustrates some transactions in part of the ordered blocking method that is also using request sharing. Notice that the master share file descriptor is the one given to ioctl(SG_IO) and that requests (i.e. WRITE commands) for the slave file descriptor use the DO_ON_OTHER flag.




The sequence points, shown as blue circles in the above diagram, are where the driver notionally changes its attention from one file descriptor to the other, with the prime (i.e. the trailing quote) showing the receiving end of that attention. SQ1 is where the share between the two file descriptors is established and that does not necessarily need to be immediately before the main multiple request ioctl(2). SQ2 on the master is at the completion of the first READ (i.e. 'ind 0') and at this point the driver starts the first WRITE (i.e. 'ind1') on the other file descriptor which is the slave. The performance win here is that there is no return to the user space to check the just completed command and issue the next command. At SQ3 the WRITE has completed and this causes the second READ (i.e. 'ind 2') to start. If the SGV4_FLAG_STOP_IF flag has been OR-ed into the ctl_obj.flag field then at SQ2, SQ3 and SQ4 an additional check is made to see if an error or warning has been issued by the storage device, the transport to it, or the LLD (and its associated HBA); if so ioctl(SG_IO) will exit.

Note that multiple requests are not available using the v3 interface object: neither with ioctl(SG_IO) nor ioctl(SG_IOSUBMIT_V3)+ioctl(SG_IORECEIVE_V3) .

9.1 Single/multiple (non-)blocking requests

Almost all interactions between a user space program and the sg driver involve using a sg driver file descriptor. Each sg driver file descriptor belongs to a sg device. [And optionally each file descriptor may be paired (shared) with another sg file descriptor which may belong to the same or a different sg device.] More precisely within the sg driver a file descriptor corresponds to a kernel object of type 'struct file'. Using the terminology found in 'man 2 dup' (i.e. the manpage of the dup system call) that kernel object is an open file description containing a set of flags and a file offset, among other things. In a user space process an open(2) system call returns an integer (zero or greater) which refers to that open file description. That integer is often termed a file descriptor. The dup(2) system call creates a second reference to the same open file description, as does passing a file descriptor to another process using Unix sockets. Since such operations are relatively uncommon, an open file description in this driver and a file descriptor created by using open(2) on a sg device will be regarded as the same thing.

Each sg driver file descriptor has one active request list (and an associated free list). All commands/requests are issued by this driver to lower levels (i.e. levels that are closer to the storage devices) using a non-blocking, asynchronous interface with completion flagged using a software interrupt mechanism. So blocking requests are managed by this driver. Given one file descriptor, any two commands whose execution overlap, then at some point both commands will have entries on that file descriptor's active request list. It is important to match the correct response with each request. If both requests were blocking then this matching is relatively simple since the identity of each request is known and can be searched for on the active request list. If one request was blocking and the other non-blocking then handling the active request list is still relatively simple. Non-blocking request completions are processed in FIFO order (first (completion) in becomes first out (to the user space)). If two or more non-blocking requests are on the same request list then the problem of matching the responses with their corresponding requests is left up to the user space! To aid the user space to do this matching, the pack_id, tag and usr_ptr fields are provided. The driver does do some work in this regard: all blocking requests on an active request list are marked so that they will never be seen by non-blocking mechanisms such as poll(2), ioctl(SG_IORECEIVE), ioctl(SG_GET_NUM_WAITING), or ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_WAITING}).

No other distinction (other than between requests submitted as blocking or non-blocking) is made on a file descriptor's active request list. This means single non-blocking requests and multiple non-blocking requests can be submitted on the same file descriptor and they are all treated the same way on the active queue. Their responses can be fetched (in FIFO order of completion) by any combination of single and multiple request calls, using SIGPOLL (or RT signals), poll(2) and ioctl(SG_GET_NUM_WAITING) to detect completion, and either read(2) or ioctl(SG_IORECEIVE) to fetch the response once a completion has occurred.

When there are no active requests on a sg file descriptor, its associated free list will have at least one entry which will be the inactive reserve request created when that file descriptor was open(2)-ed. There may be other entries on the free list which reflects at some earlier time (in the lifetime of that file descriptor) a newly issued request found that the reserve request was busy, its data buffer was not big enough, or it was unavailable. On the master side of a file descriptor share, the reserve request is only used for requests that have the SGV4_FLAG_SHARE flag set, so the reserve request is unavailable for new requests that don't use the share flag. The number of inactive requests on a file descriptor's free list can be found with ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_FL_RQS}). The total number of inactive requests of the given file descriptor and all file descriptors that have the same owning sg device, can be found with ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_DEV_FL_RQS}).

10 pack_id or tag

When doing asynchronous IO with the sg driver there needs to be a way to wait for a particular response, not just the response that is the oldest. [By oldest is meant the command request in the active queue (a per file descriptor queue) whose callback occurred at the earliest time; this will usually be the first one in the active queue.] A common example would be a multi-thread application where each worker thread shares the same file descriptor and issues one command request and waits for the response to that request before issuing another command request.

Historically the way to do this with the sg driver is with a pack_id (short for packet identifier) which is a 32 bit integer. The pack_id is generated by the user application and passed into the interface structure (and in the v4 interface the pack_id is place in request_extra). The pack_id doesn't have to be unique (per file descriptor) but it is practical that it is unique (and the sg driver does not check its uniqueness). The user application should then call ioctl(SG_SET_FORCE_PACK_ID, 1) which alerts the sg driver to read (from the user space) the pack_id given to ioctl(SG_IORECEIVE) or read(2) and then get the (first) matching request on the active queue or wait for it to arrive. The pack_id value -1 (or 0xffffffff if viewed as an unsigned integer) is used as a wildcard or to report nothing is available, depending on the context. The pack_id method has worked well and generated few error or queries over the years and will continue to be supported in the sg v4 driver.

So what is a tag in this context? It is also a 32 bit integer but instead of being generated by the user application, it is generated by the block system. So instead of being given via the v4 interface structure to SG_IOSUBMIT, it is returned in the interface structure at the completion of ioctl(SG_IOSUBMIT) in the request_tag field (which is a 64 bit integer). Notice that the tag is only available in the v4 interface structure and via the two new async ioctls: SG_IOSUBMIT and SG_IORECEIVE. Using the tag to find a command response is very similar to the way it is done with pack_id described above. As currently implemented the tag logic does not work all the time, its reliability will most likely depend on the SCSI host (HBA driver (LLD)) that the target device belongs to. There seems no reliable way for this driver to fetch the tag from the block infrastructure. Currently this driver simply asks for it after forwarding the command request to the block code. However 3 cases have then been observed: it gets a tag; it doesn't get the tag (it is too early); it doesn't get the tag (it is too late), the request has already finished! The third case may only occur with the scsi_debug driver which can complete requests in a microsecond or less (that is configurable). The tag wildcard is also -1 (or all "f"s in hex when viewed as an unsigned integer) so again the logic is very similar to pack_id.

So given the above, the default remains what it was in v3 of the sg driver, namely, using pack_id unless another indication is given. To use tags to choose a response ioctl(SG_SET_FORCE_PACK_ID, 1) is needed first on the file descriptor. Then the v4 interface object given to ioctl(SG_IOSUBMIT) should OR SGV4_FLAG_YIELD_TAG with other flags in that interface object. Then after that ioctl has finished successfully, the request_tag field in that object should be set. If it is -1 then no tag was found (as discussed in the previous paragraph). The match ioctl(SG_IORECEIVE) call should make sure the request_tag field is set as appropriate and the SGV4_FLAG_FIND_BY_TAG flag should be OR-ed with other flags.

11 Bi-directional command support

N.B. Support for SCSI bidirectional commands has been removed from the Linux kernel in version 5.1 . To allow the driver to merge post lk 5.1, bidi support has been removed from this driver. That bidi support is available as a separate patch if the driver is used with kernel's prior to bidi support being removed.

One of the main reasons for designing the sg V4 interface was to handle SCSI (or other storage protocols) bi-directional commands (abbreviated here to bidi). In the SCSI command sets, bidi commands are mainly found in block commands that support RAID (e.g. XDWRITEREAD(10)) and many of the Object Storage Device (OSD) commands. Linux contains a "osd" upper level driver (ULD) and an object based file system called exofs. New SCSI commands are being considered such as READ GATHERED which would most likely be a bidi command. The NVMe command set (NVM) extends the bidi commands concept to "quad-di": data-in and data-out plus metadata-in and metadata-out.

Synchronous SCSI bidi commands have been available in the bsg driver for more than 12 years using ioctl(<bsg_dev_fd>, SG_IO) using the sg V4 interface (i.e. struct sg_io_v4) and are now available with the sg V4 driver where <bsg_dev_fd> is replaced by <sg_dev_fd>. Asynchronous SCSI bidi commands were available for the same period but were withdrawn around Linux kernel 4.15 due to problems with the bsg driver. Those asynchronous commands were submitted via the Unix write(2) call and the response was received using a Unix read(2) call. In the sg v4 driver the submitted and received object structure remains the same but the Unix write(2) and read(2) system calls can no longer be used. Instead two new ioctl(2)s have been introduced called SG_IOSUBMIT and SG_IORECEIVE to replace write(2) and read(2) respectively. The functionality is almost identical, read on for details.

In the sg driver the direct IO flag has the effect of letting the block layer manage the data buffers associated with a command. The effect of indirect IO in the sg driver is to let the sg driver manage the data buffers. Indirect IO is the default for the sg driver with the other options being mmap IO (memory mapped IO) and direct IO. Indirect IO is the most flexible with the sg driver, it can be used by both uni-directional and bidi commands and has no alignment requirements on the user space buffers. Request sharing discussed above cannot be used with direct IO (because the sg driver needs control of the data buffers to implement the share) while mmap IO is not implemented for bidi commands. Also a user space scatter gather list cannot be used for either the data-out or data-in transfers associated with a bidi command.

Other than the exclusions in the previous paragraph, all other capabilities of the sg driver are available to bidi commands. The completion is sent when the second transfer (usually a data-in transfer) has completed. pack_id and/or tags can be used as discussed in the previous section. Signal on completion, polling for completion and multi-threading should also work on bidi commands without issues.

12 SG interface support changes

In the following table, a comparison is made between the supported interfaces of the sg driver found in lk 4.20 (V3.5.36) and the proposed V4 sg driver. The movement of the main header file from the include/scsi directory to include/uapi/scsi/sg should not impact user space programs since modern Linux distributions should check both and the stub header now in include/scsi/sg.h includes the other one. There is a chance the GNU libc maintainers don't pick up this change/addition, but if so the author would expect that to be a transient problem. The sg3_utils/testing directory in the sg3_utils package gets around this problem with a local copy of the "real" new sg header in a file named uapi_sg.h .


 Table 1. sg interfaces supported by various sg drivers

interface support/



sg driver version

v1+v2 interfaces

Non-blocking

struct sg_header

v3 interface

Non-blocking

struct sg_io_hdr

v3 interface

Blocking

struct sg_io_hdr

v4 interface

Non-blocking

struct sg_io_v4 (bsg.h)

v4 interface

Blocking

struct sg_io_v4 (bsg.h)

sg driver V3.5.36

lk 2.6, 3, 4 and 5.0

interface header ==>

write(2)+read(2)

include/scsi/sg.h

write(2)+read(2)

include/scsi/sg.h

ioctl(SG_IO)

include/scsi/sg.h

not available ^^^

not available ***

sg driver V4.0.x

lk ?

interface header ==>

write(2)+read(2) ****



include/uapi/scsi/sg.h

ioctl(SG_IOSUBMIT_V3)+

ioctl(SG_IORECEIVE_V3) or

write(2)+read(2)



include/uapi/scsi/sg.h

ioctl(SG_IO)



include/uapi/scsi/sg.h

ioctl(SG_IOSUBMIT)+

ioctl(SG_IORECEIVE)



include/uapi/scsi/sg.h +

include/uapi/linux/bsg.h

ioctl(SG_IO)


include/uapi/scsi/sg.h +

include/uapi/linux/bsg.h


*** available via the bsg driver; ^^^ removed from the bsg driver in lk 4.15; **** the plan is to deprecate the write(2)/read(2) based interfaces which would leave v1+v2 interfaces unsupported.

Note that there is no v1+v2 blocking interface. Rather than completely drop the write(2)+read(2) interface, it could be kept alive for only v1+v2 interfaces. Applications based on the v1+v2 interfaces would be written around 20 years ago and need a low level re-write to use the v3 or v4 non-blocking interfaces. So what might be dropped is the ability of the v3 interface to use the write(2)+read(2) interface as the only code change required should be to change the write(2) to an ioctl(SG_IOSUBMIT_V3) and the read(2) to an ioctl(SG_IORECEIVE_V3).

13 IOCTLs

Traditionally character device drivers in Unix have had a open(2), close(2), read(2), write(2), ioctl(2) interface to the user space. As well as those system calls this driver supports mmap(2), poll(2) and fasync(). The fasync() driver call is related to the fcntl(2) system call when the file descriptor flags are being changed to add O_AYSNC (e.g. fcntl(SET_FL(flags | O_ASYNC)) ) .

It may help in understanding this driver by adding a little history. This driver was present in Linux kernel 1.0.0 released in 1992. It supported just two ioctl(2)s at the time: SG_SET_TIMEOUT and SG_GET_TIMEOUT plus some "pass-through" ioctl(2)s that started with "SCSI_IOCTL_" that were in common with other ULDs (e.g. sd and st) and implemented by the Linux SCSI mid-level. The only method of sending a SCSI command by this driver was with the async write(2) and read(2) system calls (that neglects counting the "pass-through" pass-through ioctl(2): SCSI_IOCTL_SEND_COMMAND). Over time there has been a transfer of functionality from the write(2) and read(2) system calls to various ioctl(2)s which are listed below. Using the write(2) and read(2) system calls in the way that this driver does is frowned upon by the Linux kernel architects, as is adding new ioctl(2)s! Only 4 new ioctl(2)s have been added in the sg v4 driver as noted in the status column of the table below. Two of those ioctl(2)s were proposed in this post by a Linux architect (L. Torvalds). However a lot of extra information exchanged between the user space and the driver is needed to support the new functionality added in v4 of this driver. That is nearly all done via one new omnibus ioctl(2): SG_SET_GET_EXTENDED using a 96 byte structure and flags listed in the second table below.

Another historical note, the v1 SCSI pass-through interface was based on this structure in Linux kernel 1.0.0:

struct sg_header
 {
  int pack_len;    /* length of incoming packet <4096 (including header) */
  int reply_len;   /* maximum length <4096 of expected reply */
  int pack_id;     /* id number of packet */
  int result;      /* 0==ok, otherwise refer to errno codes */
  /* command follows then data for command */
 };

Only the pack_id field is found in all versions of the sg driver interface and its semantics remain the same.

The following table lists the ioctl(2)s that the sg v4 driver processes. They are in the alphabetical order of the name of the second ioctl(2) argument. In most cases the scope of the action of the ioctl(2) is that of the file descriptor, given as the first argument and referred to below as the current file descriptor. If the scope is other than the current file descriptor, that is noted in the second column. Note that there is a "fall-through" in the last row of this table, so any ioctl(2)s not processed by this driver will be passed to the SCSI mid-level and if it doesn't process them and thence onto the LLD (SCSI low level driver) that owns the "host" that the file descriptor's device is connected to. If no driver processes an ioctl(2) then it should return -1 with an errno of ENOTTY (according to POSIX) but the sometimes other error codes given, depending on the LLD.

ioctl name [hex value]

second argument to ioctl(2) call

Status

output via 3rd arg ptr unless noted

Notes

BLKSECTGET [0x1267]

active

scope: host (HBA)

this ioctl value replicates what a block layer device file (e.g. /dev/sda) will do with the same value. It calls the queue_max_sectors() helper on the owning device's command queue. The resulting number is multiplied by 512 to get count in bytes and output where the third argument points, assumed to be a pointer to int (so a maximum of about 2 GB). It represent the maximum data size of a single request that the block layer will accept.

BLKTRACESETUP [0xc0481273]

active

scope: device

third argument of ioctl(2) is pointer to a struct blk_user_trace_setup object. Needs a kernel with CONFIG_BLK_DEV_IO_TRACE=y . This ioctl(2) and its siblings are passed through to the block layer which implements them: a pass-through inside a pass-through

BLKTRACESTART [0x1274]

active

scope: device

ignores third argument of ioctl(2). See blktrace and blkparse utilities in the blktrace package.

BLKTRACESTOP [0x1275]

active

scope: device

ignores third argument of ioctl(2). Part of blktrace support.

BLKTRACETEARDOWN [0x1276]

active

scope: device

ignores third argument of ioctl(2). Part of blktrace support.

SCSI_IOCTL_GET_BUS_NUMBER

[0x5386]

active, deprecated

scope: host

implemented by the SCSI mid-level. Assumes the third argument is pointer to int (32 bit) and places a field called 'host_no' in it. host_no is an index of SCSI HBAs (host bus adapters) in the system. In this case it will the host number that the SCSI device is connected to. That SCSI device has been open(2)-ed to yield the file descriptor that this ioctl(2) uses. In modern Linux usage, this information is better obtained from sysfs. Alternatively ioctl(SG_GET_SCSI_ID) can be used (see below).

SCSI_IOCTL_GET_IDLUN [0x5382]

active, deprecated

scope: device

implemented by the SCSI mid-level. Assumes the third argument is pointer to int (32 bit) and places a packed integer (with 4 components) in it. The lower 8 bits are a target device number, the next 8 bits are the LUN, the next 8 bits are the channel number, and the top 8 bits are the host_no mentioned in the previous item. The are many things wrong with this from a modern SCSI perspective. In modern Linux usage, this information is better obtained from sysfs.

SCSI_IOCTL_PROBE_HOST [0x5385]

active, deprecated

scope: host

implemented by the SCSI mid-level. Yields an identifying string associated with the host. Assumes the third argument is a pointer to a byte array whose length in placed in a (32 bit) int in the first 4 bytes. That length will be overwritten by the ASCII byte array output. This information can also be obtained from sysfs.

SCSI_IOCTL_SEND_COMMAND [0x1]

active, deprecated

this is the SCSI mid-level pass-through which is very old, found in lk 1.0 with sg v1 interface vintage and even worse. Please do not use.

SG_EMULATED_HOST [0x2203]

seems to be "dead"

originally indicated a host emulated SCSI (e.g. ATAPI) but libata does not seem to set this value in the host template provided by each LLD.

SG_GET_ACCESS_COUNT [0x2289]

not supported

returns 1 [unless the owning sg device is missing in which case 0 is returned, very unlikely]

SG_GET_COMMAND_Q [0x2270]

active

see SG_SET_COMMAND_Q notes below. Yields current state of the COMMAND_Q flag held by the this file descriptor.

SG_GET_KEEP_ORPHAN [0x2288]

active

when a synchronous ioctl(SG_IO) is interrupted (e.g. by a signal from another process) the default action (depending on the signal) may be to terminated the ioctl(2) with an errno of EINTR. The driver terms such an inflight command/request an "orphan". The default action is to "throw away" the response from the device and clean up the request's resources. This loses information such as whether the command succeeded. This ioctl return 0 (the default) or 1 depending on whether the request belonging to this file descriptor will throw away (when 0) or keep (when 1) the response to interrupted requests. Note that closing a sg file descriptor will clean-up any outstanding request resources this file descriptor is using at the time of the close(2) [in reality that takes place a little later (when the last response "lands") because nothing is permitted to suspend a close(2)].

SG_GET_LOW_DMA [0x227a]

active, deprecated

scope: host

Yields the host's unchecked_isa_dma flag (0 or 1) via the third argument. The 'host' is typically the host bus adapter (HBA) that this sg device (the parent of the current file descriptor) is connected to.

SG_GET_NUM_WAITING [0x227d]

active

Number of non-blocking requests on the active list that are waiting to be read. That "read" can be done with either an ioctl(SG_IORECEIVE) or a read(2) system call. Requests that are inflight are not counted. If there are any blocking requests waiting on the list, they are not counted. Similar to ioctl(SG_SET_GET_EXTENDED, {SG_SEIRV_SUBMITTED}) which additionally counts (non-blocking) inflight request. When using non-blocking multiple requests this will be the expected number of responses that ioctl(SG_IORECEIVE, FLAG_MULTIPLE_REQS | FLAG_IMMED) will receive. This ioctl(2) holds no locks in the sg driver and accesses an atomic integer. So it is fast and should never block making it suitable for polling. In the presence of other producers or consumers the number waiting may change before a user has time to act on the result of this call.

SG_GET_PACK_ID [0x227c]

active

the third argument is expected to be a pointer to int. By default it will set that int to the pack_id of the first (oldest) command that is completed internally but still awaits ioctl(SG_IORECEIVE) or read(2) to finish. If no requests are waiting -1 (i.e. the wildcard value) is place din that int. This ioctl(2) yields the pack_id by default, unless the SG_CTL_FLAGM_TAG_FOR_PACK_ID boolean has been set on this file descriptor.

SG_GET_REQUEST_TABLE [0x2286]

active

The third argument is assumed to point to an array of 16 struct sg_req_info objects (that struct is defined in include/uapi/scsi/sg.h). First the array is zeroed making all req_state fields zero which corresponds to INACTIVE state. Then any requests that are active have fields placed in the sg_req_info elements. Then if there is still room requests from the free list are placed in sg_req_info elements. This action stops when either 16 elements are filled or there are no more requests associated with the current file descriptor to transfer.

SG_GET_RESERVED_SIZE [0x2272]

active

this is the size, in bytes, that the reserve request associated with this file descriptor currently has. The third argument is assumed to be a pointer to an int that receives this value.

SG_GET_SCSI_ID [0x2276]

active, enhanced in v4

the third argument should by a pointer to an object of type struct sg_scsi_id . This ioctl(2) fills the fields in that structure. The extension in v4 is to use two 'unused' 32 bit integers at the end of that struct as an array of 8 bytes to which the SCSI LUN is written. This is the preferred LUN format from t10.org . This extension does not change the size of struct sg_scsi_id . For those looking for the corresponding HCTL tuple for the device this file descriptor belongs to ,this ioctl(2) is one way: H --> sg_scsi_id::host_no; C --> sg_scsi_id::channel, T --> sg_scsi_id::scsi_id and L --> sg_scsi_id::scsi_lun[8] . Another way is to use 'lsscsi -g' which data-mines in sysfs or the user can write their own sysfs data-mining code.

SG_GET_SG_TABLESIZE [0x227F]

active

yields the maximum number of scatter gather elements that the associated host (HBA) supports. That is the host through which the sg device is attached, that "owns" the given file descriptor. The third argument is assumed to point to an int.

SG_GET_TIMEOUT [0x2201]

active, deprecated; timeout

in seconds is return value

the v1 and v2 interfaces did not contain a command timeout field so this was a substitute. Both the v3 and v4 interface have a command timeout field which is better than using this ioctl

SG_GET_TRANSFORM [0x2205]

seems to be "dead"

this driver passes this ioctl value through to the SCSI mid-level which seems to do nothing with it. Testing reveals that it yields an errno of EINVAL

SG_GET_VERSION_NUM [0x2282]

active

uses the third argument as a pointer to write out a 32 bit integer whose latter digits went seen in decimal are in the form [x]xyyzz . [x] means blank (space) if zero. This is usually expressed as an ASCII string as '[x]x.[y]y.zz' .

SG_IO [0x2285]

active, added functionality

in v4 driver

both v3 and v4 interface blocking commands can be issued with this ioctl(2). Only returns -1 and sets errno when the preparation for submitting the command/request encounters an problem. Thereafter any problems encountered set the out fields in the v3 or v4 interface object. So both should be checked.

SG_IOABORT [0x40a02243]


new in v4

only the v4 interface can use this ioctl(2) to abort a command in process, using either the pack_id (in the request_extra field) or the tag. The pack_id is used by default, unless the SG_CTL_FLAGM_TAG_FOR_PACK_ID boolean has been set on this file descriptor. If no corresponding request is found (capable of being aborted) then errno is set to ENODATA. The completion on an aborted command will have DRIVER_SOFT set in the driver_status field.

SG_IORECEIVE [0xc0a02242]


new in v4 driver

only the v4 interface can use this ioctl(2) to complete a command/request started with asynchronous ioctl(SG_IOSUBMIT_V3) on the same file descriptor. If multiple requests are outstanding on the same file descriptor, then setting ioctl(SG_SET_FORCE_PACK_ID) indicates that subsequent requests on this file descriptor should take account of the pack_id (in the ::request_extra field) or the tag (in the ::request_tag) field to choose a matching response.

SG_IORECEIVE_V3 [0xc0582246]


only the v3 interface can use this ioctl(2) to complete a command/request started with asynchronous ioctl(SG_IOSUBMIT_V3) on the same file descriptor. If multiple requests are outstanding on the same file descriptor, then setting ioctl(SG_SET_FORCE_PACK_ID) indicates that subsequent requests on this file descriptor should take account of the pack_id field to choose a matching response.

SG_IOSUBMIT [0xc0a02241]


new in v4 driver

only the v4 interface can use this ioctl(2) to issue (submit) new commands. This ioctl(2) will return relatively quickly potentially well before the command has completed. Each call to ioctl(SG_IOSUBMIT) needs to be paired with a call to ioctl(SG_IORECEIVE) using the same (sg) file descriptor. This call is part of the v4 asynchronous (non-blocking) interface.

SG_IOSUBMIT_V3 [0xc0582245]

new in v4 driver

only the v3 interface can use this ioctl(2) to issue (submit) new commands. This ioctl(2) will return relatively quickly potentially well before the command has completed. Each call to ioctl(SG_IOSUBMIT_V3) needs to be paired with a call to ioctl(SG_IORECEIVE_V3) using the same (sg) file descriptor. This call is part of the v3 asynchronous (non-blocking) interface.

SG_NEXT_CMD_LEN [0x2283]

active, deprecated

only applies to the v2 interface which does not include a command (cdb) length field. That assumes the driver can work out what the cdb length. While that works for standard cdbs (from T10) it may not work for vendor specific commands, hence this ioctl(2).

SG_SET_COMMAND_Q [0x2271]

active

in the v1 and v2 drivers the default was 0 (so no command queuing on this file descriptor). In the v3 driver it was 0 until a v3 interface structure was presented, in which case it was turned on (1) for this file descriptor. In the v4 driver it is on (1) by default. 0 --> only allow one command per fd; 1 --> allow command queuing. When command queuing is off, if a second command is presented before the previous has finished and errno of EDOM will result.

SG_SET_DEBUG [0x227e]

active, scope=device

0 --> turn off (def), 1 --> turn on . Currently the only impact of setting this is to print out sense data (to the log) of any request on all fds that belong to the current device. Typically only requests that yield a SCSI status of "Check condition" provide sense data.

SG_SET_FORCE_LOW_DMA [0x2279]

does nothing

users of modern Linux systems should not concern themselves with "low DMA", this comes from the ISA era. 0 --> use adapter setting (def); 1 --> force "low dma". However this ioctl(2) has since been neutered and does nothing.

SG_SET_FORCE_PACK_ID [0x227b]

active

when activated, a non-blocking response is only accepted if it has a matching pack_id (or tag). A pack_id (or tag) of -1 is treated as a wildcard. In the v4 interface the request_extra field is used for the pack_id. A non-blocking request is finished with either ioctl(SG_IORECEIVE[_V3]) or read(2). The third argument to this ioctl(2) is assumed to be a pointer to a 32 bit integer. 0 --> take the oldest available response (def); 1 --> match on pack_id (or tag) given in each subsequent request on this fd. Even though the third argument is a pointer to int, this ioctl(2) is effectively boolean. The default is to use the pack_id rather than the tag unless SG_SET_GET_EXTENDED{SG_CTL_FLAGM_TAG_FOR_PACK_ID} is active on this file descriptor.

SG_SET_GET_EXTENDED [0xc0602251]

new in v4

takes pointer to 96 byte sg_extended_info structure; it can set and get 32 bit values and it can set and get boolean values. Each ioctl(2) can perform more than one action. Explained below.

SG_SET_KEEP_ORPHAN [0x2287]

active

how to treat a SCSI response when a ioctl(SG_IO), read(2) or ioctl(SG_IORECEIVE) that is waiting is interrupted. 0 --> drop it (def); 1 -> hold it so the response can be fetched with either another read(2) or ioctl(SG_IORECEIVE) call

SG_SET_RESERVED_SIZE [0x2275]

active

sets or resets the size of the reserve request data buffer size of this file descriptor to the given value (in bytes). If this file descriptor is in use (i.e. sending a SCSI command) then this ioctl(2) will fail with an errno of EBUSY.

SG_SET_TIMEOUT [0x2201]

active, deprecated

command timeout in seconds (pointed to by third argument). See "_GET_" notes.

SG_SET_TRANSFORM [0x2204]

seems to be "dead"

this driver passes this ioctl value through to the SCSI mid-level which seems to do nothing with it. Testing reveals that it yields an errno of EINVAL

<< any others>>

??

sent through to the SCSI mid-level (and then to the LLD associated with the device the fd belongs to) for further processing.



The third argument to ioctl(SG_SET_GET_EXTENDED) is a pointer to an object of type struct sg_extended_info . That structure is found in the<scsi/sg.h> header and is shown here:

struct sg_extended_info {
        uint32_t sei_wr_mask;       /* OR-ed SG_SEIM_* user->driver values */
        uint32_t sei_rd_mask;       /* OR-ed SG_SEIM_* driver->user values */
        uint32_t ctl_flags_wr_mask; /* OR-ed SG_CTL_FLAGM_* values */
        uint32_t ctl_flags_rd_mask; /* OR-ed SG_CTL_FLAGM_* values */
        uint32_t ctl_flags;         /* bit values OR-ed, see SG_CTL_FLAGM_* */
        uint32_t read_value;        /* write SG_SEIRV_*, read back related */
        uint32_t reserved_sz;       /* data/sgl size of pre-allocated request */
        uint32_t tot_fd_thresh;     /* total data/sgat for this fd, 0: no limit */
        uint32_t minor_index;       /* rd: kernel's sg device minor number */
        uint32_t share_fd;          /* SHARE_FD and CHG_SHARE_FD use this */
        uint32_t sgat_elem_sz;      /* sgat element size (must be power of 2) */
        uint32_t pad_to_96[52];     /* pad so struct is 96 bytes long */
};

If both the sei_wr_mask and sei_rd_mask fields are zero then ioctl(SG_SET_GET_EXTENDED) does nothing. If those fields are non-zero then they should contain one or more of the following mask values OR-ed together. The field names of struct sei_extended_info are shown in italics:



SG_SET_GET_EXTENDED

sei_wr_mask and sei_rd_mask values

Associated field(s)


Notes [fd: file descriptor given as 1st arg to ioctl(2)]

SG_SEIM_CHG_SHARE_FD [0x40]

share_fd 'read before write' [rbw]

when written, this is only valid if fd is the master side of a share. If so share_fd replaces the prior slave fd (which is the value read back) so that share_fd becomes the new slave side of a fd share.

SG_SEIM_CTL_FLAGS [0x1]

ctl_flags, ctl_flags_wr_mask and

ctl_flags_rd_mask

three fields in a sg_extended_info object are associated with this variant of the ioctl(2), a value mask, a write mask and a read mask. The mask value are the SG_CTL_FLAGM_* values shown in a following table.

SG_SEIM_MINOR_INDEX [0x10]

minor_index 'read only' [ro]

when read places the minor number of the sg device that this fd is associated with in minor_index . For example after open(2)-ing "/dev/sg3" that fd should place 3 in the minor_index field.

SG_SEIM_READ_VAL [0x2]

read_value 'read after write' [raw]

when a known value (see SG_SEIRV_* entries in table below) is written to read_value then after this ioctl(2) the corresponding value will be in the read_value field. For this action, SG_SEIM_READ_VAL should be OR-ed into both sei_wr_mask and sei_rd_mask fields.

SG_SEIM_RESERVED_SIZE [0x4]

reserved_sz [raw]

when written, this fd's reserve request's data buffer will be resized to reserved_sz bytes.The given value may be trimmed down by system limits. When read, the actual size of this fd's (resized) data buffer will be placed in reserved_sz when this ioctl(2) completes. So when both written and read, this ioctl(2) is very similar to ioctl(SG_SET_RESERVED_SIZE) combined with ioctl(SG_GET_RESERVED_SIZE) .

SG_SEIM_SGAT_ELEM_SZ [0x80]

sgat_elem_sz [rbw]

when the driver builds a scatter gather list for a request's data buffer a fixed element size is used which is a power of 2 and greater than or equal to the machine's page size (often 4 KB). The default size is currently 32 KB (2**15). When written, sgat_elem_sz will replace the prior element size. When read the prior element size is placed in sgat_elem_sz . Effects future requests on this fd that use data-in or data-out.

SG_SEIM_SHARE_FD [0x20]

share_fd [rbw]

when written, a shared fd relationship is set up by this ioctl(2). The fd that is the first argument of the ioctl(2) should be the future slave (i.e. the WRITE side of a copy) and share_fd identifies the future master. Neither fd can already be part of a share. When read (read before write), if successful share_fd should yield 0xffffffff which indicates (internally) both fds were not previously part of a share.

When read, but not written, then share_fd will yield: 0xffffffff (-1) if the first argument is not part of a share; 0xfffffffe (-2) if the first argument is the master side of a share; or the master's fd if the first argument is the slave side of a share.

SG_SEIM_TOT_FD_THRESH [0x8]

tot_fd_thresh [raw]

By default, a limit of all data buffers that can be active on a fd is set at 16 MB. A request that tries to exceed this will be rejected with an errno of E2BIG. The default can be changed by writing to tot_fd_thresh . A value of 0 is taken as unlimited.


An example follows of changing the scatter gather list element size to 64 KB and reading prior value:

        sei.sei_wr_mask |= SG_SEIM_SGAT_ELEM_SZ;
        sei.sei_rd_mask |= SG_SEIM_SGAT_ELEM_SZ;
        sei.sgat_elem_sz = 64 * 1024 * 1024;
        if (ioctl(sg_fd, SG_SET_GET_EXTENDED, &sei) < 0) {
                err = errno;
                goto error_processing;
        }
        prev_sgat_elem_sz = sei.sgat_elem_sz;
        /* success */

The ctl_flags field can be viewed as 32 boolean (i.e. 1 bit) fields. If both the ctl_flags_wr_mask and ctl_flags_rd_mask fields are zero then ioctl(SG_SET_GET_EXTENDED) does nothing with the ctl_flags field. All three fields should contain one or more of the following mask values OR-ed together:

SG_SET_GET_EXTENDED

ctl_flags, ctl_flags_wr_mask + ctl_flags_rd_mask values

Type


Notes

SG_CTL_FLAGM_IS_MASTER [0x40]

read-only [ro]

when read, if set implies this fd is part of a file share and this is the master side

SG_CTL_FLAGM_IS_SHARE [0x20]

read-only [ro]

when read, if set implies this fd is part of a file share.

SG_CTL_FLAGM_MASTER_ERR [0x200]

[ro]

when read, if set implies the master's request has completed with a non-zero SCSI status or other driver error. In this case the shared request state is terminated (i.e. the slave side will not be able an associated slave request). This may be used either on the master's or slave's fd

SG_CTL_FLAGM_MASTER_FINI [0x100]

[ro]

when read, if set implies the master's request has completed and is waiting for the slave request to start. This may be used either on the master's or slave's fd

SG_CTL_FLAGM_MORE_ASYNC [0x400]

[rbw]

The blk_get_request() can still block in standard async mode. When this is written to 1 (true) that call is made non-blocking and SG_IOSUBMIT will yield EBUSY if the case when it would block.

SG_CTL_FLAGM_NO_DURATION

[0x400]

[rbw]

when written to 1 (true) instructs driver not to calculate command duration. This saves two ktime_get_bootime() calls per command. The default (and when 0 is written) is to always calculate command/request duration.

SG_CTL_FLAGM_ORPHANS [0x8]

[ro]

when read, if set implies there is one or more orphaned commands/request associated with this fd.

SG_CTL_FLAGM_OTHER_OPENS [0x4]

[ro]

when read, if set implies there are other sg driver open(2)s active of this sg device.

SG_CTL_FLAGM_Q_TAIL [0x10]

read-after-write [raw]

when written, set causes the following commands/requests on this fd to be queued to the block layer at the tail of its queue; clear causes them to be queued at head (the default). Each v3 and v4 command can use the SG_FLAG_Q_AT_TAIL or SG_FLAG_Q_AT_HEAD to override this setting.

SG_CTL_FLAGM_TAG_FOR_PACK_ID [0x2]

[raw]

when written, set causes the following commands/requests on this fd to use the tag field rather than pack_id (or sg_io_v4::request_extra)

SG_CTL_FLAGM_TIME_IN_NS [0x1]

[raw]

when written, set causes the following commands/requests to this fd to use command/request duration calculations to be done in nanoseconds; clear causes duration calculations to be done in milliseconds which is the default.

SG_CTL_FLAGM_UNSHARE [0x80]

[w, rd-->0]

this will undo the share relationship between a master fd and a slave fd. It can be sent to either fd. If a shared command/request is active using either fd then this ioctl(2) will fail with an errno of EBUSY. If no share relationship exists for the given fd this ioctl(2) will return 0 and do nothing.



For example to set command duration time to nanoseconds, the following snippet of code could be used. It is assumed that sei is an object of type struct sg_extended_info and that it has been zeroed out:

        sei.sei_wr_mask |= SG_SEIM_CTL_FLAGS;
        sei.ctl_flags_wr_mask |= SG_CTL_FLAGM_TIME_IN_NS;
        sei.ctl_flags |= SG_CTL_FLAGM_TIME_IN_NS;
        if (ioctl(sg_fd, SG_SET_GET_EXTENDED, &sei) < 0) {
                err = errno;
                goto error_processing;
        }
        /* success */

Finally it was noticed that there are many more "interesting" values to read from the driver (e.g. about its state) than values to write to the driver. So rather than potentially fill struct sg_extended_info with 32 bit values that are only read, the read_value field was introduced. One of the following constants is written to the read_value field, then the associated value can be read from the same field when the ioctl(2) finishes successfully.

SG_SET_GET_EXTENDED

value written to read_value

scope

Notes

SG_SEIRV_BOOL_MASK [0x1]

fd

with read_value set to SG_SEIRV_BOOL_MASK, after ioctl(SG_SET_GET_EXTENDED{SG_SEIM_READ_VAL) read_value has a 32 bit mask of bit positions that are used in ctl_flags (and ctl_flags_wr_mask and ctl_flags_rd_mask). That value is currently 0xfff .

SG_SEIRV_DEV_FL_RQS [0x4]

SCSI_device

sum of number of free list requests on each fd belonging to the SCSI device (e.g. a SSD) that owns the given fd.

SG_SEIRV_DEV_SUBMITTED [0x6]

SCSI_device

sum of number of active list elements, excluding those associated with synchronous (blocking) invocation, on each fd belonging to the SCSI device that owns fd given as the first argument to the ioctl(2)

SG_SEIRV_FL_RQS [0x3]

fd

number of "inactive" request objects currently on this fd's free list. When there are no active command/requests, this value should be 1 and that entry should be this fd's reserved request (waiting for a user request to commence).

SG_SEIRV_INT_MASK [0x0]

fd

after ioctl(2) read_value has a 32 bit mask of bit positions that are used in sei_wr_mask and sei_rd_mask . That value is currently 0xff .

SG_SEIRV_SUBMITTED [0x5]

fd

after ioctl(2), read_value has a 32 bit integer which is the number requests on the active list; this includes all submitted non-blocking requests that have not yet been completed and read (and hence placed on the free list). So this includes requests that are inflight. ioctl(SG_GET_NUM_WAITING) is similar but it does not include inflight requests. This ioctl(2) holds no locks in the sg driver and accesses an atomic integer. So it is fast and should never block making it suitable for polling. In the presence of other producers or consumers the number submitted may change before a user has time to act on the result of this call.

SG_SEIRV_VERS_NUM [0x2]

driver

after ioctl(2) read_value has a 32 bit integer whose latter digits went seen in decimal are in the form [x]xyyzz . [x] means blank (space) if zero. This is usually expressed as an ASCII string as '[x]x.[y]y.zz' .



For example, to find out the number of commands/requests submitted (but not yet finished) on the device (e.g. /dev/sg3) associated with file descriptor sg_fd:

        sei.sei_wr_mask |= SG_SEIM_READ_VAL;
        sei.sei_rd_mask |= SG_SEIM_READ_VAL;
        sei.read_value = SG_SEIRV_DEV_SUBMITTED;
        if (ioctl(sg_fd, SG_SET_GET_EXTENDED, &sei) < 0) {
                err = errno;
                goto error_processing;
        }
        tot_num_submitted = sei.read_value;

Note that this number only counts non-blocking requests submitted through the sg driver. If, for example, /dev/sdc and /dev/sg3 were the same device then it doesn't count any requests that might be submitted by the sd driver through /dev/sdc .

14 Downloads and testing

This tarball sgv4_20190423 has three parts. Directories are named lk5.1, lk5.0 and lk_le4.20 (for less than or equal to lk 4.20). The difference in lk5.1 is the removal of SCSI bidi support (a tree wide regression). The difference in lk 5.0 is a kernel wide patch by Linus Torvalds changing the number of parameters to the access_ok() function. Since the sg driver uses that call over 10 times, it broke a lot of patches making it difficult to maintain a single set of patches. Those directories have a sub-directory called sgv4_20190412[bidi] which contains a series of 24 or 25 patches. Those directories also contain the 3 files that represent the sg v4 driver in the kernel: drivers/scsi/sg.c and include/scsi/sg.h and include/uapi/scsi/sg.h . The last file is new (i.e. it is not found in the production (v3) sg driver). If those 3 files are copied into the corresponding locations in a kernel source tree then a subsequent kernel build will generate the sg v4 driver. It might be a good idea to take a copy of driver/scsi/sg.c and include/scsi/sg.h before copying those files to simplify reverting to the sg v3 driver currently in the kernel.

The patches are against Martin Petersen's 5.2/scsi_queue branch (the part under lk5.1), his 5.1/scsi_queue branch (the part under lk5.0) and his 4.21/scsi-queue branch (the part under lk_le4.20). It should apply against lk 4.18 and later. The recent patches on the sg driver that might interfere (or cause fuzz) are:

96d4f267e40f9 (Linus Torvalds 2019-01-03 18:57:57 -0800) access_ok() [3 -->2 function arguments] appeared in lk 5.0-rc1

92bc5a24844ad (Jens Axboe 2018-10-24 13:52:28 -0600) remove double underscore version of blk_put_request(), appeared in lk 5.0-rc1

abaf75dd610cc (Jens Axboe 2018-10-16 08:38:47 -0600) blk_put_request(srp->rq) addition, first appeared in lk 4.20-rc1

The sg driver patch prior to that was 8e4a4189ce02f (Tony Battersby 2018-07-12), first appeared in v4.18-rc8

The sg3_utils was originally written to test the v3 sg driver interface when it was introduced, circa 2000. So where better to put sg v4 test code? Since the sg3_utils is well established, the author sees no benefit in introducing a sg4_utils package in which less than an estimated 5% of the code would change, much easier to incorporate that code change/addition in the existing package. The latest sg3_utils beta on the main page (revision 818 (a beta of version 1.45) as this is written) contains utilities for testing the sg v4 interface. The underlying support library has been using the sg v4 header for many years as a common (i.e. intermediate) format (API). If the given device was a bsg device node then the sg v4 interface was used; otherwise (e.g. for sg and block devices) the sg v4 header was translated down to a v3 header and forwarded on. In the current beta, the sg3_utils will use ioctl(SG_GET_VERSION_NUM) on sg devices and if it is a v4 driver then it will send a v4 header, otherwise it will do as it does now. [That v4 interface usage can be defeated by './configure --disable-linux-sgv4' .]

The presence of the environment variable SG3_UTILS_LINUX_NANO (typically with 1 assigned into it) in the shell executing sg3_utils package utilities will cause the elapsed time of SCSI commands to be calculated in nanoseconds if the v4 sg driver is active. Typically command times are only shown when the --verbose option is given (or several of them). The duration is measured starting from the point the sg driver sends a command to the block layer to the point when the sg driver receives a (soft) interrupt indicating that command has finished. Note that user space measures of a command duration should always be greater than the duration the sg driver calculates. Most of the test utilities in the next paragraph also act on SG3_UTILS_LINUX_NANO .

In the testing directory of that beta are several utilities that are "v4" driver aware:

These test utilities are not built by default since they are not part of the automake setup; instead an old school Makefile in the testing directory is used. And sg_tst_async and sgh_dd are C++ programs and can be built with 'make -f Makefile.cplus' . Prior to building these test utilities the sg3_utils library needs to be built. That can be done with 'cd <root_of_sg3_utils> ; ./configure ; cd lib ; make ; cd ../testing' . There is a 'make install' which will place the C test utilities in /usr/local/bin ; there is also a 'make -f Makefile.cplus install' for placing the C++ utilities in /usr/local/bin .

Since the sg v4 driver may or may not be present in the kernel that the above utilities are built and run in, a local copy of the new <kernel_src>/include/uapi/scsi/sg.h header needed for the sg v4 driver is kept in the testing directory. It has the name 'uapi_sg.h' so it won't collide with the "real" header if it is present.

15 Sg driver and the block layer

One might think that a SCSI pass-through such as the sg driver would inject user supplied SCSI commands and associated data into the SCSI mid-level which would be routed down through a SCSI low level driver (LLD) to the (virtual) SCSI Host Bus adapter (HBA) and onto a SCSI device (e.g. a SAS disk). So that path doesn't involve the Linux block layer, right? Wrong, in Linux those commands are injected into the block layer as pass-through commands and of course block layer won't interfere with them before forwarding them to the SCSI mid-level, surely? Wrong again. If the SCSI device being accessed has reached its queue limit or the HBA that the command passes through has run out of resources then a SCSI pass-through driver should tell this to the user in the author's opinion. Not in Linux. The block layer treats the SCSI pass-through as another disk user (even if the SCSI device is a tape unit which is not a block device) and queues up the injected commands assuming the resource problem is temporary.

Well that temporary problem may need a SCSI administrative command or task management function issued from the user space to fix. However in Linux a user program may need to resort to desperate methods such as resetting the logical unit (LU), target, HBA or even rebooting the machine to clear those other commands in the block layer's stalled queue.

The block layer's queue has been observed with the sg_tst_async test utility with an option favouring submissions over completions (e.g. 'sg_tst_async --qfav=2'). For long running tests the number of non-blocking requests a thread has outstanding will grow without bound, finally invoking the OOM ("out of memory") killer. And the OOM killer isn't particularly accurate, in tests it only killed the culprit process about 60% of the time. The OOM killer can be configured to bring down (i.e. reboot) the machine and that may lead to more predictable outcomes than letting it kill the process it thinks is the culprit. And it can't kill the block layer :-) The '--override=OVN' option has been added to the sg_tst_async utility in order to put an upper limit on how large a queue can become. Note that tis was not a problem with the V3 sg driver since it limited the number of outstanding commands to 16 on each file descriptor.

16 Other documents

The original sg driver documentation is here: SCSI-Generic-HOWTO and a more recent discussion of ioctl(SG_IO) is here: sg_io .

17 Conclusion

The sg v4 driver is designed to be backwardly compatible with the v3 driver. This simplest way for an application to find which driver it has is with the ioctl(SG_GET_VERSION_NUM). Removing a restriction such as 16 outstanding commands per file descriptor can catch out programs that rely on hitting that limit. If the need arises, driver parameters to re-impose that limit and any other differing behaviour can be added. The best way to test backward compatibility is to place this new driver "under" existing apps that use sg driver nodes and check their functionality.

Return to main page.

Douglas Gilbert

Last updated: 16th May 2019 20:00