The Linux SG driver version 4.0


1 Introduction

2 Changes to sg driver between version 3.5.36 and 4.0

3 Architecture of the sg driver

4 Synchronous usage

5 Sharing file descriptors

6 Async usage in v4

7 Request sharing

7.1 Slave waiting

8 Sharing design considerations

9 Multiple requests

10 pack_id or tag

11 Bi-directional command support

12 SG interface support changes

13 IOCTLs

14 Downloads

15 Other documents

16 Conclusion


1 Introduction

The SCSI Generic (sg) driver in Linux has been present since version 1.0 of the kernel in 1992. In the 26 years since then the driver has had 3 interfaces to the user space and now a fourth is going to be added. The first and second interfaces (v1 and v2) use the same header: 'struct sg_header' with only v2 now fully supported. The "v3" interface is based on 'struct sg_io_hdr'. Both these structures are defined in include/scsi/sg.h the bulk of whose contents will move to include/uapi/scsi/sg.h as part of this upgrade. Prior to the changes now proposed, the "v4" interface is only implemented in the block layer's bsg driver ("block SCSI generic" which is around 15 years old) . The bsg driver's user interface is found in include/uapi/linux/bsg.h . These changes propose adding support for the "v4" interface via the SG_IO ioctl(2) for synchronous use, and new SG_IOSUBMIT and SG_IORECEIVE ioctl(2)s for asynchronous use. The plan is to deprecate and finally remove (or severely restrict) the write(2)/read(2) based asynchronous interface used currently by the v1, v2 and v3 interfaces. The v3 asynchronous interface is supported by SG_IOSUBMIT and SG_IORECEIVE.

If the driver changes are accepted the driver version which is visible via an ioctl(2), will be bumped from 3.5.36 (in lk 5.0) to 4.0.x . The opportunity is being taken to clean the driver after 20 years of piecemeal patches. Those patches have left the driver with misleading variable names and nonsensical comments. Plus there are new kernel facilities that the driver can take advantage of. Also of note is that much or the low level code once in the sg driver (and remnants remain) have been moved to the block layer and SCSI mid-level. This upgrade has been done as a two stage process: first clean the driver up, remove some restrictions and re-instate some features that have been accidentally lost. Three versions of a patchset were sent to the linux-scsi list in October 2018. That patchset took the sg driver to version 3.9.01 . Now the v4 interface is supported as described here, so the sg driver number has been bumped to 4.0.06 .

Note that the Linux block layer implements the synchronous sg v3 interface via ioctl(SG_IO) on all block devices that use the SCSI subsystem, directly or via translation (e.g. SATA disks use libata which implements the SAT T10 standard). In pseudocode an example like this ' ioctl(open("/dev/sdc"), SG_IO, ptr_to_sg_io_hdr)' works as expected. This is not implemented by the sg driver so it is important that the sg driver's implementation of ioctl(SG_IO) remains consistent with other driver implementations (mainly the one found in block/scsi_ioctl.c kernel source code).

2 Changes to sg driver between version 3.5.36 and 4.0

A summary is given as bullet points:

There are still some things to do:

3 Architecture of the sg driver

Nothing much has changed in the overall architecture of the sg driver between version 3 (v3) and version 4. It may help later explanations if a pictorial summary of the drivers object tree:




The sg driver is shown as a laptop at the top of the object tree. The arrow end of solid lines shows objects that are created automatically or by actions outside the user interface to the sg driver. So the disk-like objects created at the second level come from the device scanning logic undertaken by the SCSI mid-level driver in Linux. Note that there are SCSI devices other than disks such as tape units and SCSI enclosures. Also note that not all storage devices in Linux use the SCSI subsystem, examples of these are NVME SSDs and SD cards that are not attached via USB. The type of SCSI device objects is sg_device (and in the driver code they appear as objects of C type 'struct sg_device'). Even though the sg driver's implementation is closely associated with the block subsystem, the sg driver's device nodes are character devices in Linux (e.g. /dev/sg1). The nodes are also known as character special devices.

At the third level are file descriptors which the user creates via the open(2) system call (e.g. 'fd = open("/dev/sg1", O_RDWR);') . Various other system calls such as close(2), write(2), read(2), ioctl(2) and mmap(2) can use that file descriptor made by open(2). The file descriptor will stay in existence until the process containing the code that opened it exits or the user closes it (e.g. 'close(fd);'). A dotted line is shown from the "owning" device to each file descriptor in order to indicate that it was created by direct user action via the sg interface. The type of file descriptor objects is sg_fd. BTW most system calls have "man pages" and the form open(2) indicates that there is a manpage in section 2 which is for system calls. Other common manpage sections are "1" for commands and utilities (e.g. 'man 1 cp' explaining the copy command); "3" for system libraries (e.g. 'man 3 snprintf') and "8" for system administration commands.

At the lowest level are the sg_request objects each of which carry a user provided SCSI command to the target device which is its grandparent in the object tree. These requests are then sent via the block and SCSI mid-level to a Low Level Driver (LLD) and then across the transport (with iSCSI that can be a long way) to the target device (e.g. a SSD). User data that moves in the same direction as the request is termed as "data-out" and the SCSI WRITE command is an example. In nearly all cases (an exception is a command timeout) a response traverses the same route as the request, but in the reverse direction. Optionally it may be accompanied by user data which is termed as "data-in" and the SCSI READ command is an example. Notice that a heavy (thicker) line is associated with the first request of each file descriptor; it points to a reserve request (in earlier sg documentation this was referred to as the reserve buffer). That reserve request is built after each file descriptor is created and before the user has a chance to send a SCSI command/request on that file descriptor. This reserve request was originally created to make sure CD writing programs didn't run out of kernel memory in the middle of a "burn". That is no longer a major concern but the reserve buffer has found other uses: for mmap-ed and direct IO. So when the mmap(2) system call is used on a sg device, it is the associated file descriptor's reserve request's buffer that is being mapped into the user space.

The lifetime of sg_request objects is worth noting. When a sg_request object is active ("inflight" is the term used in the driver) it has both an associated block request and a SCSI mid-level object. They have similar roles and overlap somewhat. However once the response is received (and typically before the user has seen that response or any "data-in") the block request and the SCSI mid-level objects are freed up. The sg_request object lives on, along with the data carrying part of the block request called the bio as that may be carrying "data-in" that has yet to be delivered to the user space. That is because the default user data handling (termed as "indirect IO") is a two stage process. For data-in that will be first DMA-ed from the target device into kernel memory, typically under the control of the LLD; the second stage is copying from that kernel memory to user space, under the control of this driver. Even after the user has fetched the response and any data-in, the sg_request continues to live. [However once any data-in has been fetched the block request bio is freed.] The sg_request object is then marked "inactive" and placed on a sg_request object free list, one of which is maintained for each file descriptor. So each sg file descriptor contains two request lists: one for any command that is active and the other one is a free list for inactive requests (there is an exception). The next time a user tries to send a SCSI command through that file descriptor, its free list will be checked to see if any inactive sg_request objects has a large enough data buffer suitable for the new request; if so that object will be (re-)used for the new request. Only when the user calls close(2) on that file descriptor will all the requests on the free list be truly freed. Note that in Unix, and thus Linux, the OS guarantees that it will call the close(2) command (called release() in the kernel and sg_release() in this driver) in this driver for every file descriptor that the user has opened in this process, irrespective of what the code in that process does. This is important because processes can be shut down by signals from other processes or drivers, segmentation violations (i.e. bad code) or the kernel's OOM (out-of-memory) killer.

The above description is setting the stage for a newly added feature called "sharing" introduced in the sg v4 driver. It also uses the reserve request.

4 Synchronous usage

These two forms: ioctl(sg_fd, SG_IO, ptr_to_v3_obj) and ioctl(sg_fd, SG_IO, ptr_to_v4_obj) can be used for submitting SCSI commands (requests) and waiting for the response before returning to the calling thread. This action is termed as synchronous in this driver. Most block devices that use or can translate the SCSI command set also support the first form (i.e. the ioctl(2) that takes a pointer to a v3 interface object as its third argument). So this pseudo code will work: ioctl(open("/dev/sdc"), SG_IO, ptr_to_v3_obj) but not if the third argument is a ptr_to_v4_obj. Some storage related character devices (e.g. /dev/st2 and /dev/ses3) will also accept the first form.

Only two drivers currently support the second form (i.e. whose third argument is a ptr_to_v4_obj): this driver and the bsg driver.

It is important to understand that the use of ioctl(SG_IO) is only synchronous seen from the perspective of the calling thread/task/process. It is only the calling thread that waits for completion of the request. Any other thread or process submitting requests to the same or other devices associated with the sg driver will not be impeded by that wait. This assumes that the underlying devices can queue SCSI commands which most current SCSI devices are capable of doing. As an example: a large copy between two storage devices can be broken up into multiple copy segments, with each copy segment copying a comfortable amount of data (say 1 MByte); then multiple threads can each take a copy segment from a pool and fulfill them by doing a READ then a WRITE SCSI command. Each READ/WRITE pair of commands seems synchronous but overall the threads are doing asynchronous READs and WRITEs with respect to one another.

Apart from some special cases (one shown below), it isn't generally useful to mix synchronous and asynchronous commands/requests on the same thread. An asynchronous command/request could be submitted followed by a second synchronous command which will go through to completion before it returns; then the first command's completion can be fetched. Care is taken within the driver so that an asynchronous completion, even if it is pending will not be incorrectly supplied as the result of a synchronous command.

The simplest way to issue SCSI commands to any device is with a synchronous ioctl(SG_IO). Asynchronous commands have some advantages (mainly performance) but that comes at the expense of more complexity for the user application. When a program is juggling multiple asynchronous submissions and completions it needs to track either pack_id, tag or a user pointer to correctly match completions with submissions. Since the sg driver maintains strong per file descriptor context, one way to simplify the matching problem is to have one file descriptor per submission/completion. However then multiple file descriptors need to be juggled, which is not so onerous.




In the diagram above a synchronous ioctl(SG_IO) is shown. As a general rule the ioctl(2) will return -1 with a positive errno value if there is a problem creating the object of type sg_request in the top left of the diagram. Examples of this are syntax or contradictory information in the v3 or v4 interface object. Another cause could be out of resources. Once the sg_request object is "inflight" any errors will be reported via the v3 or v4 interface object. As noted in the diagram the user thread is placed in a interruptible wait state, awaiting command/request completion. If the command takes some time the user may use a keyboard interrupt (e.g. control-C) or to "kill" the containing process from another terminal (e.g. with kill(1)). Another abnormal situation is the kernel OOM killer ("out of memory"). This will cause the shown sg_request object to become an orphan. The default action is to remove orphan sg_request objects as soon as practical. However if the file descriptor has the "keep orphan" flag set (see ioctl(SG_SET_KEEP_ORPHAN) below) a further read(2) or ioctl(SG_IORECEIVE) will fetch the response information from the orphan which will then be freed.

The main context that a user space application controls in this driver is the file descriptor, shown as a sg_fd object in the earlier object tree diagram. Roughly speaking a file descriptor object is created when sg_fd=open(<sg_device_name>) succeeds and is destroyed by a close(sg_fd). Again, roughly speaking a file descriptor is confined to a user process. In multi-threaded programs it is often a good idea to have separate sg file descriptors in each thread. Some exceptions to these generalizations are discussed on the next section.

Another feature of the file descriptor object in the sg driver is that each one has a reserve request created at the same time as the file descriptor. This reserve request is immediately placed on the new sg file descriptor's free list. Any new command/request on that file descriptor will use that reserve request if :

When a command request is completed, its sg_request object is placed (or replaced) on the free list. So no sg_request objects are actually deleted until the owning file descriptors is close(2)d. In the case where there are copies of the file descriptor (e.g. a forked process) when the last close(2) is done.

5 Sharing file descriptors

First a rationale. Copying data between storage devices is a relatively common operation. It can be both time and resource consuming. The best approach is to avoid copying all together. Another approach is to defer copies (or part of them) until they are really necessary which is the basis of COW (i.e. copy on write). Then there are offloaded copies, for example where the source and destination are disks in the same array, then a "third party copy" program (e.g. based on SCSI EXTENDED COPY and its related commands) can tell the array to do the copy itself and inform you if it finishes successfully or not. However in many cases copies are unavoidable.

If the dd program is considered, copying one part of a normal block storage device to to another storage device involves a surprising number of copies. Copies of large amounts of data are typically done in a staggered fashion to lessen the impact on other things the system may be doing. So typically 1 MegaByte (say) is read from the source device into a buffer, followed by a write of that buffer to the destination device; if no error occurs, repeat until finished. Copies between a target device and kernel memory are typically done by DMA (direct memory access) controlled by the LLDs owning the storage devices. So another copy is needed on each side of the copy to get the data in and out of kernel buffers to the user space. Moving data between a user space process and the kernel space has a little extra overhead to deal with the situations like the process being killed while data is being copied to and from it. So a reasonable implementation of dd has three buffers (2 in the kernel space) and performs 2 DMAs then 2 copies between the user space and the kernel space. As storage devices and transports get quicker, the time taken to do those copies may become significant compared to the device access time.

Another aspect of the sharing being proposed is security. Often a user has the right to copy data but not see it. This is usually accomplished by encrypting the data. Another approach might be to make sure the copy's data is kept in kernel buffers and thus hidden from the user who is copying it. While the v4 sg driver can do this, the sg driver is not written with a view to security, since it offers a pass-through interface which, by definition, is a method to circumvent an Operating System. Those building a highly secure computer systems might consider removing the sg driver or restricting its access to highly privileged users.

Sharing is a new technique added to the sg v4 driver to speed copy operations. The user first sets up a sharing relationship between two sg file descriptors, one that will be used for doing SCSI READ commands (more generally any data-in SCSI command), and the other that will be used for doing SCSI WRITE commands using the data received by the previous READ. Any data-out command can be used so, for example, the SCSI WRITE command could be replaced by WRITE AND VERIFY or WRITE SCATTERED. The file descriptor that does the READ is called the master side by the driver and the file descriptor that does the WRITE is called the slave side. The following diagram shows how one share between two file descriptors is set up.




Here the master side is /dev/sg1 and has 4 open file descriptors (fd_s 1 through 4). The slave side is /dev/sg2 has 3 open file descriptors (fd_s 5 through 7). The share shown is set up when the thread or process containing fd5 calls the "EXTENDED" ioctl on the fd5 file descriptor (i.e. the ioctl's first parameter) with a pointer to an integer containing fd1 as the ioctl's third parameter. The C code is a little more complicated than that.

How does the thread or process containing fd5 know about fd1? That is up to the design of the user space application. If they are both in the same thread then it should be obvious. If they are in different threads within the same process then it should be relatively simple to find out. The interesting case is when they are in different processes. A child process inherits all open file descriptors (including those belonging to the sg driver) from its parent in the Linux fork() system call. For processes that don't have a parent child relationship, UNIX domain sockets can be used to "send" an open file descriptor from one process to another. Note that in this case the file descriptor number might differ (e.g. because the receiver side already is using the same file descriptor number as the sender's number) but they will still logically refer to the same thing. Also that statement above about process termination leading to sg_release() being called for any sg file descriptors open(2)-ed in that process needs qualification: in this case the last process to hold an open file descriptor being terminated causes the driver's sg_release() to be called. In short the last close(2) on a file descriptor causes sg_release() to be called.

The sg driver's file descriptors can only be part of one share (pair). Given this restriction, in the above diagram, fd5 cannot also be in a share with fd4. fd6 may be in a share with fd7; that would imply that the share could be used for a copy from /dev/sg2 to /dev/sg2 . The master side of the share monopolizes that file descriptor's reserve request hence there can only be one outstanding share request per pair of shared file descriptors. Given this restriction one way to do a copy using queued command is to use POSIX threads. As an example from the above diagram, if 3 copy worker threads were used then the first thread could utilize fd1 and fd5, the second thread could utilize fd3 and fd6 while the last thread could utilize fd4 and fd7. This is what the sgh_dd test utility does (see below).

After a share of two file descriptors is established command requests can still be sent to both file descriptors in the normal fashion. Only when the new flag SGV4_FLAG_SHARE is given, or OR-ed in with other flags, is request sharing performed. See the 6 Request sharing section below.

6 Async usage in v4

The asynchronous interface in the context of the sg driver means issuing a SCSI command in one operation then at some later time a second operation retrieves the status of that SCSI command. Any data being transferred associated with the SCSI command is guaranteed to have occurred before that second operation succeeds. The synchronous interface can be viewed as combining these two operations into a single system call (e.g. ioctl(SG_IO) ).

The asynchronous interface starts with a call to ioctl(SG_IOSUBMIT) which takes a pointer to the sg v3 or v4 interface object. This 0bject includes the SCSI command with data transfer information for either data-in (from device) or data-out (to device). Depending on the storage device accessed (identified by the sg file descriptor given as the first argument to the ioctl() system call) the SCSI command will take milliseconds or microseconds to complete. Chances are the ioctl(SG_IOSUBMIT) will complete in a sub-microsecond timescale (on a modern processor) and that will be done before the SCSI command completes. If further processing depends on the result of that SCSI command then the program must wait until that SCSI command is complete. When that completion occurs, the data-out is guaranteed to be on the nominated storage device (or in its cache). And if a data-in transfer was specified, that data is guaranteed to be in the user space as directed. How does the program find out when that SCSI command has completed?

The exact timing of the data-out and data-in transfers can be thought of as a negotiation between the HBA (Host Bus Adapter controlled by the LLD) and the storage device. The essential point is that the data transfer and the completion are asynchronous to the program that requested the SCSI command. Since the completion is guaranteed to follow any associated data transfer then the completion event is what we will concentrate on. Detecting asynchronous events depends on Operating System features such as signals and polling. Polling is the simpler technique. However the simplest approach is to call the final step in the process which is ioctl(SG_IORECEIVE) as soon as possible. In the likely case that the SCSI command completion has not occurred, then the ioctl() can do one of two things: it can wait until the completion does occur or yield an "error" called EAGAIN. Similar to SCSI sense data, a UNIX errno doesn't always imply a hard error. So EAGAIN is not a hard error, but it tells the program that the operation didn't occur but may happen later, so try again, but preferably don't retry immediately. What determines whether the ioctl() waits or returns EAGAIN is the presence of the O_NONBLOCK flag on the file descriptor.

Two file descriptor flags are important to the asynchronous interface of the sg driver: O_NONBLOCK and O_ASYNC. The file descriptor flags are defined in such a way that they can be OR-ed together. The normal place to define flags is in the open(2) system call (its second argument) but they can be changed (and added to) later with the fcntl(2) system call. If the O_NONBLOCK is given then it will typically be given in the open(2). The O_ASYNC flag is a bit more difficult to handle because it arms the the SIGIO (also known as SIGPOLL) signal which if it occurs before a program has set up a handler for it, then the program will exit. Actually Linux ignores O_ASYNC is the open(2) call (see 'man 2 open' in the BUGS section), so fcntl(2) is the only way to set it. Below is a simplified example of adding the O_ASYNC flag to a file descriptor (sg_fd) that is already open:

flags = fcntl(sg_fd, F_GETFL, NULL);

fcntl(sg_fd, F_SETFL, flags | O_ASYNC);

It is possible to replace the classic Unix SIGIO signal with a POSIX real-time signal by making an additional call:

fcntl(sg_fd, F_SETSIG, SIGRTMIN + 1);

After that call the SIGRTMIN+1 real time signal will be used instead of SIGIO. Even though you could use hard numbers for the real-time signals the advice is to always use and offset from SIGRTMIN or SIGRTMAX (negative offset in the MAX case) because the library can (and does for its POSIX threads implementation) steal some of the lower real time signals and adjusts the SIGRTMIN value that the application program sees. Real time signals have improved semantic compared to the classic Unix signals (e.g. multiple instances of the same real time signal can be queued against a process where Unix signals would meld into one signal event in a similar situation).

In the diagram below the lifetime of an active sg_request object is shown from when it is created or retrieved from the free list in the top left to when the SCSI command has completed and the user space has been informed on the bottom right. It assumes that either the O_NONBLOCK flag is set on the file descriptor (assumed to be the same in all the system call boxes shown with the blue band at the top), or ioctl(SG_IORECEIVE) has SGV4_FLAG_IMMED or-ed into its flags. When the first ioctl(SG_IORECEIVE) is called the SCSI command has not completed so it gets rejected with EAGAIN. The first poll(2) system call indicates with POLLOUT that another SCSI command can be issued but there are no SCSI commands waiting for an ioctl(SG_IORECEIVE) on this file descriptor. Note that poll(2) description refers to a file descriptor, not this particular sg_request object but for simplicity we will assume there is only one outstanding SCSI command on this file descriptor. At some future time, preferably long before the command approaches its timeout (often 60 seconds or more) the storage device via its LLD informs the sg driver that a SCSI command belonging to this file descriptor has completed. If O_ASYNC has been set on this file descriptor then the sg driver will issue a SIGIO signalto the owning process. A poll(2) system call after the internal completion point yields (POLLIN | POLLOUT) [IOWs both POLLIN and POLLOUT]. That tells us that the next ioctl(SG_IORECEIVE) will be successful as is indicated in the diagram.




While it is useful to think and illustrate the above mentioned ioctl(2)s and poll(2)s as being in reference a single sg_request object, they are all actually against the file descriptor that is the parent of that sg_request object. This distinction matters when multiple sg_request objects are outstanding. In the absence of any selection information (e.g. a pack_id or a tag) the ioctl(SG_IORECEIVE) will fetch the oldest sg_request object since the active (and completed) command list acts as a FIFO. Instead of poll(2) the user may call the ioctl(SG_GET_NUM_WAITING) which yields the number of sg_request objects belonging to a file descriptor that have completed internally but are yet to have ioctl(SG_IORECEIVE) [or read(2) for the async v3 interface] called on them.

7 Request sharing

Request sharing refers to two requests, usually belonging to different storage devices (e.g. two disks), sharing the same in-kernel data buffer. Before request sharing can take place a share of two file descriptors belonging to those two storage devices needs to be set up. This is discussed in the 4 Sharing file descriptors earlier section.

The diagram below shows the synchronous sg driver interface using ioctl(SG_IO) which can take either the v3 or v4 interface. The synchronous interface can be seen as the combination of the various calls that make up the asynchronous interface discussed in the previous section. The time that the synchronous ioctl(SG_IO) takes is directly related to the access time of the underlying storage device. To stress that point the system call rectangles (with a blue band at the top) in the diagram below are shown as elongated rectangles with a beginning component to the left and a completion component to the right. The elongated system call boxes span the access time of the associated storage device.

A request share only takes place when a command request is issued and a SGV4_FLAG_SHARE flag is used (OR-ed with any other flags). This should be done first on the master side with a READ (like) command request. Other flags that might be combined with this are SG_FLAG_NO_DXFER or SG_FLAG_MMAP_IO flag (but not both). The SG_FLAG_NO_DXFER flag stops the in-kernel data buffer to user space copy. The SG_FLAG_MMAP_IO flag maps the in-kernel data buffer into the user space; that user space area is made available via a mmap(2) system call preceding the command request being sent. The diagram below shows the simpler case where the minimum number of flags are set. For brevity the leading SGV4_ is removed from the flags values in the following diagrams.





The slave may continue to send normal command requests but at some stage after this point it should send a WRITE (like) command request with both the SGV4_FLAG_SHARE and SG_FLAG_NO_DXFER flags set. That will use the in-kernel data buffer from the preceding master share command request and send that data (i.e. data-out) to the slave's device. So a single, in-kernel data buffer is used for a master share request followed by a slave share request.

In the terminology of the block subsystem both the master and slave share requests have their own request object, each with their own bio object. However the sg driver provides the data storage for those bios and arranges for the slave share request to use the same data buffer as the preceding master request's bio. And this is the reason that the slave request must use the SG_FLAG_NO_DXFER flag, otherwise a transfer from the user space usually associated with a WRITE (like) command would overwrite the in-kernel data buffer.

Once the slave request is successfully completed another master share request may be issued. Sanity checks ensure that using the SG_FLAG_SHARE flag on a file descriptor that is not a share will cause an error, as will trying to send a master share request before a prior master share request is complete (which means its matching slave request is finished). Also using the SGV4_FLAG_SHARE flag on a slave request will fail if there is no master request 'waiting' for it (as shown in the diagram above, the master must be in "rs_swap" state). Once a pair of file descriptors are shared, the master's side reserve request will only be used for command requests that have the SGV4_FLAG_SHARE flag set.

If the master share request fails (i.e. gives back any non zero status, or fails or warns at some other level) then the master request on completion will go to state "rs_inactive" (i.e. not "rs_swap"). It is also possible that the application wants to stop the request share after the master request (e.g. because the user wants to abort the copy or there is something wrong with the data copied to the user space near the location marked "***" in the above diagram). The EXTENDED ioctl has a MASTER_FINI boolean for that: writing 1 (true) changes the "rs_swap" to "rs_inactive" state while writing 0 (false) does the reverse of that (see below as to why).

The brown arrow-ed lines in the above diagram show the movement of the "dataset" which is usually an integral number of logical blocks (e.g. each containing 512 or 4096 bytes). The brown arrow-ed lines that are vertical and horizontal do not involve copying (or DMA-ing) of that dataset. That leaves three brown arrow-ed lines at an angle: the DMA from the device being read, the DMA to the device being written, and an optional in-kernel to user space copy (annotated with "***"). The vertical brown arrow-ed lines are performed by swapping pointers to scatter-gather lists within the kernel space.

The sgh_dd utility in the sg3_utils/testing directory uses both POSIX threads and sg driver sharing as discussed in this section (if the sg driver running on the target system is recent enough). sgh_dd has help (with 'sgh_dd -h') but no man page, like other test programs (its code is its documentation and an example of use).

A reasonable single READ, multiple WRITE solution needs the ability to have multiple slaves each associated with a different disk. Looking at the diagram above, two things need to happen to the master: it needs to adopt a new slave and it needs to get back into "rs_swap" state. A variant of the above mentioned ioctl(slave_fd, EXTENDED{SHARE_FD},) called ioctl(master_fd, EXTENDED{CHG_SHARE_FD},) has been added. As long as the new slave file descriptor meets requirements (e.g. it is not part of a file descriptor share already) then it will replace the existing slave file descriptor. To get back into "rs_swap" state the MASTER_FINI boolean in the EXTENDED ioctl, writing the value 0 (false) will do what is needed. The EXTENDED ioctl is a little tricky to use (because it essentially replaces many ioctls) but a side benefit is that multiple actions can be taken by a single EXTENDED ioctl call. So both the actions required to switch to another slave, ready to do another WRITE, can be done with a single invocation of the EXTENDED ioctl.

Here is a sequence of user space system calls to READ from /dev/sg1 (the master) and WRITE that same data to /dev/sg5, /dev/sg6 and /dev/sg7 (the slaves). Assume that fd1 is a file descriptor associated with /dev/sg1, fd5 with /dev/sg5, etc. In pseudocode that might be: ioctl(fd5, EXTENDED{SHARE_FD}, fd1); ioctl(fd1, SG_IO, FLAG_SHARE + READ); ioctl(fd5, SG_IO, FLAG_SHARE|NO_DXFER + WRITE); ioctl(fd1, EXTENDED{CHG_SHARE_FD=fd6 + MASTER_FINI=false}); ioctl(fd6, SG_IO, FLAG_SHARE|NO_DXFER + WRITE); ioctl(fd1, EXTENDED{CHG_SHARE_FD=fd7 + MASTER_FINI=false}); and ioctl(fd7, SG_IO, FLAG_SHARE|NO_DXFER + WRITE). So four ioctls to move data (one READ and three WRITEs) and three "housekeeping" ioctls. Notice that the WRITEs are done sequentially, they could theoretically be done in parallel but that would add complexity. Also note that a second READ cannot be done until the final WRITE from the previous sequence has completed, there is no easy way around that since only one, in-kernel buffer is being used (and a second READ would overwrite it). To make this sequence slightly faster (and hide the data from the user space) the flag in the second ioctl (the READ) can be expanded to FLAG_SHARE|NO_DXFER .

The sgh_dd utility in the sg3_utils testing directory (rev 803 or later) has been expanded to test the single READ, multiple WRITE feature. It has two extra "of" (output file) parameters: "of2=" and "ofreg=". The "of2=" is for a second WRITE sg device and the "ofreg=" takes a regular file or a pipe and will use the data that comes from the READ operation marked with "***" in the above diagram. If "ofreg=" is present among sgh_dd's operands then the READ's flag will be FLAG_SHARE, if "ofreg=" is not present its flags will be FLAG_SHARE|NO_DXFER . The latter should be slightly faster, and that difference can be reduced with "iflag=mmap". The "of2=" operand shares "oflag=" and "seek=" with "of=".

7.1 Slave waiting

A simple analysis of an asynchronous copy segment cycle based on the previous section starts with READ command being sent the master's file descriptor [one user to kernel space context swap], followed by a signal or sequence of polls [one or more context swaps] followed by a read(2) or ioctl(SG_IORECEIVE) to get the result [so another context swap]. Assuming the response is good, then the same sequence is repeated, this time on the slave's file descriptor doing a WRITE. So that is at least six context swaps and importantly they must occur in that order. This is what the diagram in the previous section shows, but with synchronous rather than asynchronous calls.

An enhancement has been added to relax the strict ordering outlined in the previous paragraph. The slave's WRITE command can be sent to the driver in advance of its paired master READ command completing. Again the diagram below shows a copy segment: a READ from one disk followed by a WRITE of the data fetched to a second disk.




The important feature of this diagram is that the slave WRITE is started before the prior master READ has completed. Three synchronization points are shown: S1, S2 and S3. The S1 point is when the slave becomes aware that the master request (the READ) has been issued, but not necessarily completed. The slave request can be issued at any time following S1. If the slave request is in another thread or process, then the application needs a way of signalling to the slave thread/process that it can now issue the slave WRITE. The S2 synchronization point is purely internal (i.e. there is no code needed by the application). S2 is when the driver gets notification that the READ has finished. Assuming the READ was successful, and that S1 is before S2 then the slave WRITE, which has been held internally, can now be issued to the device [/dev/sg2]. Notice that the slave request is in rs_swait state between S1 and S2, indicating that it is being held. The S3 synchronization point is when the slave WRITE has finished and the master transitions from rs_slave to rs_inactive state. After S3 the next copy segment can be started.

Why show the master as an asynchronous request and the slave as a synchronous request? As a practical matter, the application needs to know when the master READ request has been issued so it can then issue the slave WRITE request. The simplest way to do that is to make the master READ asynchronous (a timer is another technique, but it may be too quick (e.g. occurring before S1) or too slow, wasting time). As for the slave WRITE request we are not interested in it until it has completed, hopefully successfully, hence the use of a synchronous request.

So this "slave waiting" approach decouples the strict ordering outlined in the first paragraph of this section into two loosely coupled sequences, the first for the master, the second for the slave. the only addition to application complexity is making the master request asynchronous. Notice that all completions (e.g. the ioctl(master_fd, SG_IORECEIVE,)) must still be processed and checks made for errors.

What about errors? Wouldn't code be simpler without error processing, but it would be a lot less interesting. The simpler case is the slave WRITE request failing, in which case the error is conveyed in the WRITE's completion in the normal manner. Then the application can decide whether to repeat the WRITE, or to WRITE somewhere else, or abort the copy. The more interesting case is when the master READ request fails as the notification of that may occur after the application has issued the slave WRITE request. In that case, a decision is made at the S2 synchronization point, not to issue the WRITE request to /dev/sg2 . Instead the ioctl(slave_fd, SG_IO,) completes just after S2 with a return value of 0 (so there is no error value in errno) but with sg_io_hdr::driver_status or sg_io_v4::driver_status set to DRIVER_SOFT. And whenever ::driver_status, ::device_status or ::transport_status are non-zero then SG_INFO_CHECK is OR-ed into the ::info field. So that field is always worth checking on completion. The actual error is given in the master's completion in the normal fashion.

Can the master's completion be after S3? Yes it can. That allows a single thread to do the following pseudo-code sequence:

ioctl(master_fd, SG_IOSUBMIT, <ptr_to_WRITE_pt_object>);

ioctl(slave_fd, SG_IO, <ptr_to_READ_pt_object>);

ioctl(master_fd, SG_IORECEIVE, <ptr_to_WRITE_pt_object>);

There is only three context swaps in that sequence with only ioctl(slave_fd, SG_IO,) taking the time required to actually do the READ and the WRITE. In real code, those three calls should have their return values checked plus, at the very least, a check that ::info does not have SG_INFO_CHECK OR-ed into it.

The testing/sgh_dd.c utility (in sg3_utils-1.45 rev 811, see main page) has an oflag=swait command line operand for exercising this feature.

Details: "error" can be a bit difficult to define in SCSI. The interesting ones are like: that READ worked but in the firmware's opinion this storage device will soon fail! You can ignore that if this is the final copy of data on that medium to something safer, but otherwise it is probably more serious than that READ failing. Anyway, when this driver is deciding internally whether a request has failed (e.g. that other requests are queued on), then any non-zero value in the SCSI status, or the driver or transport status is regarded as an error with the queued commands that have not been sent to the device getting DRIVER_SOFT as indicated above.

8 Sharing design considerations

The primary application of sharing is likely to be copying from one storage device to another storage device where both are SCSI devices (or use the SCSI command set as SATA disks do in Linux). Lets assume the copy is large enough that it needs to be cut up into segments, implemented by READ(from source), WRITE(to destination) commands, each pair of which share the same data. Even with modern SSDs, maximum performance is usually obtained by queuing commands to storage devices. However the design of sharing in the sg driver requires sequential READ WRITE commands on a pair of shared file descriptors in a way that precludes queuing on those file descriptors. Worse still, the storage device that does the READ (i.e. the master side of the share) must wait, effectively doing nothing while its paired WRITE command is being done; it could be doing the next READ while it's waiting.

One relatively simple solution is to take advantage of threading which is well supported by the Linux kernel. Multi-threaded programs are typically multiple threads of execution running in a single process in which all threads share the same memory and other resources such as file descriptors. In the case of copying using sharing in the sg driver, a good approach would be to have one management thread and multiple worker threads. Each worker thread would go to a distribution centre where information about next segment offsets to be copied would be fetched; then the worker thread could go and do that copy segment using those offsets and return to the distribution centre for information on the next segment offsets to be copied, or be told there is nothing more to do in which case the thread could exit. The distribution centre needs to be stateful which in this context means that it needs to remember when it has given out copy segment offsets and not give them out again (unless the original thread reports an error). One way to protect this distribution centre from two worker threads accessing it at the same time is with a mutex shared between all worker threads. Finer grained threading mechanisms such as atomic integers may be able provide this protection in the place of a mutex.

With the sg driver there is no limit (in the driver modulo memory availability) to the number of file descriptors that there can be referring to a single storage device. So for this segmented copy using sg driver sharing, a good approach would be to do a separate open(2) system call on the source and another on the destination in each worker thread. Then each worker thread could set up a file descriptor share with the master being the source file descriptor. The number of worker threads should be no more than the maximum queue depth the two devices can comfortably handle. That said, having hundreds of worker threads may consume a lot of the machine's resources. An advantage of this approach is that each worker thread can use the sg driver's simpler synchronous interface (i.e. ioctl(SG_IO) ). Then the reader might wonder, is there any command queuing taking place? The answer is yes, because one way of viewing the sg driver is that under the covers it is always asynchronously accessing the SCSI devices. So even when when one thread is blocked on a ioctl(SG_IO) another thread can call ioctl(SG_IO) and that command will be forwarded to the device.

There is a big "gotcha" with this design (and almost any other design for segmented copy that isn't completely single threaded). The gotcha does not apply when the destination device is a SCSI device, or uses the pwrite(2) or writev(2) system calls but does apply to the write(2) system call, often used to write to a pipe or socket. The problem is that if a read is issued by one thread (or any asynchronous mechanism) called R1 and before it completes another thread issues a read called R2 then there is no guarantee that R1 will complete before R2. And if R2 does complete before R1 and the write(2) system call is called for W2 (i.e. the pair of R2) before W1 then those writes will be out of order. Detecting out-of-order writes when gigabytes are being copied can be a pain. If the source and shuffled destination are available as files then a utility like sha1sum will show them as different (because they are) but an old school sum (like from 'sum -s') will give the same value for both. There is a related issue associated with the atomicity of the Linux write(2) command. There is no corresponding atomicity issue with the SCSI WRITE command.

To save time and resources the master side shared READ request should be issued with SG_FLAG_NO_DXFER flag OR-ed with its other flags. That is assuming that the copy program does not need to "see" the data as it flies past. As a counter example, a copy program might want to do a sha256sum on the data being copied in which case that program needs to "see" the inflight data.

The above design can be extended to the single reader, multiple writer case. In other words each worker thread would open file descriptors to the READ storage device and every WRITE storage device. Code to demonstrate these techniques can be found in the sg3_utils package's testing/sgh_dd.c . That can be built into a utility, yet another dd variant called sgh_dd .

SCSI storage devices optionally report a "Block limits" Vital Product Data (VPD) page which contains a field called "Optimal transfer length" whose units are Logical blocks (e.g. usually either 512 or 4096 bytes). There is also a "Maximum transfer length" whose units are the same. If that VPD page is present (fetched via the SCSI INQUIRY command) but those fields are 0 then no guidance is provided. Otherwise the segment size chosen for a copy should probably be the minimum of the source and destination Optimal transfer length fields. However if that implies a segment size in the Megabyte range (say over 4 MB) then the Linux kernel may object.

Other copy designs are possible, that instead of using threads, use separate processes . One practical problem with this is the ioctl(2) that sets up the share between a destination file descriptor (fd) and a source fd. That will be done in the process containing the destination fd but how does it find out about the source fd? One way is in a process containing the source file descriptor, to use the Unix fork(2) system command to spawn a new process. The child process will share the same file descriptors as its parent. So if the child then goes on to open the destination storage device then it has the two file descriptors it needs to set up the share. While that solution may look good on paper, it may require a radical rewrite of existing code to implement. Perhaps a better solution is to pass an open file descriptor from one process to another process using a Unix socket. The blog by Keith Packard outlines the technique. Code based on both techniques can be found in the sg3_utils package's testing/sg_tst_ioctl.c (with the '-f' option).

9 Multiple requests

The bsg write(2) asynchronous interface (removed from kernel around lk 4.15) supported multiple sg_io_v4 objects in a single invocation . That is harder to do with an ioctl(2) based interface as the kernel favours pointers to fixed size objects passed as the third argument. Multiple requests have been implemented in this driver using an extra level of indirection [an common technique for solving software problems].

A new sg v4 interface flag, SGV4_FLAG_MULTIPLE_REQS, has been added to sg_io_v4::flags . It can be given in a sg_io_v4 object (i.e. a pointer as the third argument) to either ioctl(SG_IO) or ioctl(SG_IOSUBMIT). This is termed as the controlling sg_io_v4 object and is expected to have no ::request and ::request_len fields (i.e. NULL and 0 respectively) but have both din and dout fields defined. The ::din_xferp field is expected to be a pointer to an array of sg_io_v4 objects containing SCSI command information (i.e. the multiple requests). An error is returned (ERANGE) if the ::din_xfer_len field in not a whole multiple of the (byte) size of the sg_io_v4 object. Near the completion of both ioctl(SG_IO) and ioctl(SG_IOSUBMIT) the array of sg_io_v4 objects including any error or warning information is written back to where the ::dout_xferp field points. The ioctl(SG_IO) is always synchronous, including this case when it is used for multiple requests. Hence the ioctl(SG_IO) doesn't return to the user space until all the submitted SCSI command/requests are completed including error and any sense data being sent back to the user space. However ioctl(SG_IOSUBMIT) is synchronous when used for multiple requests, unlike its normal usage where it is the first half of an asynchronous SG_IOSUBMIT, SG_IORECEIVE pair. As outlined in the next paragraph there is an operational difference between the two ioctl(2)s .

When ioctl(SG_IO) is used for multiple requests, each command request is submitted and completed before the next one is started. Each command is processed in the order that it appears in the array given to ::din_xferp of the controlling sg_io_v4 object. This is important if the data output from one command (dout) is needed as input (din) to a following command. This will be the case when copies are done. When ioctl(IO_SUBMIT) is used for multiple requests, all command requests are submitted, unless otherwise indicated (see below), before completions are checked. Completions are then processed in the time order they arrive, so there can be a high amount of queuing to the underlying storage device(s) and that often improves performance. So at this level of multiple requests, ioctl(SG_IOSUBMIT) is more asynchronous than ioctl(SG_IO) . Rather than overload the terms synchronous and asynchronous, let us call these two modes: complete-before-next and submit-most-before.

This multiple request interface can also use two file descriptors via the shared file descriptor facility described above. Sending a command to the other (shared) file descriptor is indicated by OR-ing the SGV4_FLAG_DO_ON_OTHER flag on the commands that need it. There is no requirement in this usage for WRITE like commands (i.e. with dout) to be sent on the master or slave file descriptor; likewise for READ like command (i.e. with din). If the SGV4_FLAG_DO_ON_OTHER flag is given on a file descriptor that is not shared then an ERANGE error will result. As a general rule, all multiple request, initial processing errors are reported back with the ERANGE errno value which is not used for anything else in the driver. There are a few exceptions, for example out-of-memory errors yield ENOMEM, as they do in other situations.

In submit-most-before mode, the "unless otherwise indicated" qualification is explained here. If the SGV4_FLAG_COMPLETE_B4 flag is OR-ed into a request's flags field, then that command will wait until that request is completed before submitting the next request in the command array. And since the SGV4_FLAG_SHARE flag requires two commands to act in lockstep (one on each file descriptor) that flag implies SGV4_FLAG_COMPLETE_B4 (i.e. the SGV4_FLAG_COMPLETE_B4 flag does not need to be given with SGV4_FLAG_SHARE flag, it is assumed).

The SGV4_FLAG_SHARE flag can also be used with multiple requests as outlined in the above sections. If so, the master and slave indications now need to be honoured. This means that the multiple request starting with a READ (with SGV4_FLAG_SHARE, without SGV4_FLAG_DO_ON_OTHER) should be sent to the master fd (i.e. ioctl(2)'s first parameter); followed by a WRITE (with SGV4_FLAG_SHARE and with SGV4_FLAG_DO_ON_OTHER) which will indicate a WRITE on the slave fd. That pattern should be replicated until the copy is finished. And since this buffer sharing needs a strict ordering we should use ioctl(SG_IO). The shared buffer being used belongs to the master's reserve request. Even if ioctl(SG_IOSUBMIT) is used in this case, the pre-scan will notice the SGV4_FLAG_SHARE flags and will switch to complete-before-next mode. The complete-before-next mode is illustrated (using ioctl(SG_IO)) below:




The sequence points, shown as blue circles in the above diagram, are where the driver notionally changes its attention from one file descriptor to the other, with the prime (i.e. the trailing quote) showing the receiving end of that attention. SQ1 is where the share between the two file descriptors is established and that does not necessarily need to be immediately before the main multiple request ioctl(2). SQ2 on the master is at the completion of the first READ (i.e. 'ind 0') and at this point the driver starts the first WRITE (i.e. 'ind1') on the other file descriptor which is the slave. The performance win here is that there is no return to the user space to check the just completed command and issue the next command. At SQ3 the WRITE has completed and this causes the second READ (i.e. 'ind 2') to start. If the SGV4_FLAG_STOP_IF flag has been OR-ed into the controlling object's ::flag field then at SQ2, SQ3 and SQ4 an additional check is made to see if an error or warning has been issued by the storage device, the transport to it, or the LLD (and its associated HBA); if so ioctl(SG_IO) will exit.

The array of sg_io_v4 objects pointed to by ::din_xferp in the controlling object must not exceed 2 MBytes in size (E2BIG is placed in errno if the array is too large). The array is copied into the driver and is scanned up to four times. The first scan (a pre-scan) checks for syntax errors and contradictions (e.g. using SGV4_FLAG_SHARE but there is no file share established on the file descriptor given as the first argument to the ioctl(2) ). The next scan submits the commands and in complete-before-next mode, waits for each command to complete before submitting the next command. The final two scans are only needed in submit-most-before mode. They are not really list scans but counting back the number of commands submitted. First, all completions for the file descriptor given as the first argument to ioctl(SG_IOSUBMIT) are processed. Note that this is not necessarily in the order they were submitted. The final scan is for all completions on the shared file descriptor (i.e. those commands issued with the SGV4_FLAG_DO_ON_OTHER flag). To distinguish between SCSI commands that have been completed successfully and those that have not been submitted (e.g. due to some error condition such as disk offline prior to submission), those commands that have been completed have SG_INFO_MRQ_FINI OR-ed into each completed command's ::info field.

The SGV4_FLAG_SIG_ON_OTHER flag causes two actions when the command/request that it is set on completes. First it will flush (i.e. write out where the controlling object's ::dout_xferp field points) the current state of the array of command sg_io_v4 objects to user memory. Next it will send a SIGPOLL signal (or a real time signal if fcntl(F_SETSIG) has been used) to the other file descriptor with POLLIN set. The other file descriptor should prepared to receive that signal because the default action of that signal is to terminate the process owing the other file descriptor. A possible scenario to use this flag is given below in the paragrapg on READ GATHERED.

As always, the error processing is a challenge. Whenever the multiple request ioctl returns zero, the whole array of sg_io_v4 objects is copied back to the user space using the ::dout_xferp pointer in the controlling object, assuming ::dout_xfer_len is large enough. It can be, but probably shouldn't be, shorter than ::din_xfer_len . The output fields in each sg_io_v4 object are populated as normal. Errors can be categorized into four groups: OS errors (e.g. ENOMEM), driver errors (e.g. from the LLD or from the associated HBA), transport errors (e.g. SAS connectivity issue) and those from the device (e.g. SCSI status and associated sense data). OS errors will stop further submissions, however an attempt will be made to complete all commands that have already been submitted. This is important to stop or reduce resource leaks that could ultimately cause the kernel's "OOM killer" (out-of-memory) to be invoked. The other error categories will not stop further processing unless the SGV4_FLAG_STOP_IF flag has been given in complete-before-next mode (i.e via ioctl(SG_IO) ) in which case no further submissions are made. The controlling sg_io_v4 object is also copied back to the user space. Two output fields in the controlling object maybe be written: the ::info field has the index (origin zero) of the last element from the array of sg_io_v4 commands that was submitted; if all is well this should equal the number of elements in the sg_io_v4 array. If the ::info field has an integer less than the number of elements in that array then the ::driver_status field will be errno value that stopped the submission. If this is the complete-before-next mode and the SGV4_FLAG_STOP_IF flag has been set then ::driver_status will probably be zero (since an OS errno was not the cause stopping further submissions) and the various *_status fields (and sense data) of the array element corresponding to the index in the ::info field should hold information about the cause.

As a convenience, the ::response and the ::max_response_len fields can be given in the controlling object and they will be used in any element of the command array that doesn't have its own response and max_response_len fields (i.e. they are NULL and 0 respectively). Also ::din_xferp and ::dout_xferp can be the same (pointer) value. The controlling object and every object in the array must have their ::guard field set to 'Q' and their ::protocol and ::subprotocol fields set to zero. [All unused fields should to set to zero to avoid future unpleasant surprises.] The controlling object also must have the SGV4_FLAG_MULTIPLE_REQS flag set in the ::flags field and may have the SGV4_FLAG_STOP_IF flag OR-ed into the ::flags field. The controlling object's ::din_xferp and ::dout_xferp fields must be set to valid pointers and there associated lengths should be non-zero.

There is no diagram of the submit-most-before mode as it would be a bit cluttered. A possible scenario for this mode is to simulate the missing READ GATHERED SCSI command. A few years back the T10 committee added a WRITE SCATTERED command but felt there was no need for READ GATHERED which was a bit more difficult to implement. An end user requirement could be loading a large file that has been splintered on disk so that multiple READ commands would be needed. These READs could be completed in any order, the end user is only interested in knowing when they are all loaded (or if there has been a problem). So this is a good candidate for submit-most-before mode allowing the storage device(s) to complete the submitted READs in the order that is easiest and possibly fastest. So these READs could be placed in an array of sg_io_v4 objects passed to ioctl(SG_IOSUBMIT). Note that the SGV4_FLAG_STOP_IF flag is ignored in this mode. If there is an error and partial information may be useful to the end user, then the returned array of sg_io_v4 objects needs to be checked carefully to determine those READs that have succeeded, those that have failed and possibly those that haven't been submitted. The not-submitted case may be due to an out-of-resources (kernel) error (e.g. ENOMEM) , but hopefully that will be rare. If there is an important READ amongst those READs then the important one could have its SGV4_FLAG_SIG_ON_OTHER flag set. Additionally the application would need to set up a file share because a signal is going to be generated to that other file descriptor when the important READ completes. That additional file descriptor can be associated with the same sg device as the primary file descriptor.

Here are a few notes on multiple requests. Both ::din_iovec_count and ::dout_iovec_count in the controlling object must be zero. The ::timeout value in the controlling object is ignored and all output values in the controlling object are set to zero apart from ::info and ::driver_status (as discussed above). The submit-most-before mode can consume a lot of kernel resources as many SCSI requests may be outstanding between the submit and completion phases. If hundreds or even thousands of commands are to be submitted in this mode, it is best to split them into many ioctl(SG_IOSUBMIT) calls. The complete-before-next mode does not have an issue with a large number of requests.

10 pack_id or tag

When doing asynchronous IO with the sg driver there needs to be a way to wait for a particular response, not just the response that is the oldest. [By oldest is meant the command request in the active queue (a per file descriptor queue) whose callback occurred at the earliest time; this will usually be the first one in the active queue.] A common example would be a multi-thread application where each worker thread shares the same file descriptor and issues one command request and waits for the response to that request before issuing another command request.

Historically the way to do this with the sg driver is with a pack_id (short for packet identifier) which is a 32 bit integer. The pack_id is generated by the user application and passed into the interface structure (and in the v4 interface the pack_id is place in request_extra). The pack_id doesn't have to be unique (per file descriptor) but it is practical that it is unique (and the sg driver does not check its uniqueness). The user application should then call ioctl(SG_SET_FORCE_PACK_ID, 1) which alerts the sg driver to read (from the user space) the pack_id given to ioctl(SG_IORECEIVE) or read(2) and then get the (first) matching request on the active queue or wait for it to arrive. The pack_id value -1 (or 0xffffffff if viewed as an unsigned integer) is used as a wildcard or to report nothing is available, depending on the context. The pack_id method has worked well and generated few error or queries over the years and will continue to be supported in the sg v4 driver.

So what is a tag in this context? It is also a 32 bit integer but instead of being generated by the user application, it is generated by the block system. So instead of being given via the v4 interface structure to SG_IOSUBMIT, it is returned in the interface structure at the completion of ioctl(SG_IOSUBMIT) in the request_tag field (which is a 64 bit integer). Notice that the tag is only available in the v4 interface structure and via the two new async ioctls: SG_IOSUBMIT and SG_IORECEIVE. Using the tag to find a command response is very similar to the way it is done with pack_id described above. As currently implemented the tag logic does not work all the time, its reliability will most likely depend on the SCSI host (HBA driver (LLD)) that the target device belongs to. There seems no reliable way for this driver to fetch the tag from the block infrastructure. Currently this driver simply asks for it after forwarding the command request to the block code. However 3 cases have then been observed: it gets a tag; it doesn't get the tag (it is too early); it doesn't get the tag (it is too late), the request has already finished! The third case may only occur with the scsi_debug driver which can complete requests in a microsecond or less (that is configurable). The tag wildcard is also -1 (or all "f"s in hex when viewed as an unsigned integer) so again the logic is very similar to pack_id.

So given the above, the default remains what it was in v3 of the sg driver, namely, using pack_id unless another indication is given. To use tags to choose a response ioctl(SG_SET_FORCE_PACK_ID, 1) is needed first on the file descriptor. Then the v4 interface object given to ioctl(SG_IOSUBMIT) should OR SGV4_FLAG_YIELD_TAG with other flags in that interface object. Then after that ioctl has finished successfully, the request_tag field in that object should be set. If it is -1 then no tag was found (as discussed in the previous paragraph). The match ioctl(SG_IORECEIVE) call should make sure the request_tag field is set as appropriate and the SGV4_FLAG_FIND_BY_TAG flag should be OR-ed with other flags.

11 Bi-directional command support

One of the main reasons for designing the sg V4 interface was to handle SCSI (or other storage protocols) bi-directional commands (abbreviated here to bidi). In the SCSI command sets, bidi commands are mainly found in block commands that support RAID (e.g. XDWRITEREAD(10)) and many of the Object Storage Device (OSD) commands. Linux contains a "osd" upper level driver (ULD) and an object based file system called exofs. New SCSI commands are being considered such as READ GATHERED which would most likely be a bidi command. The NVMe command set (NVM) extends the bidi commands concept to "quad-di": data-in and data-out plus metadata-in and metadata-out.

Synchronous SCSI bidi commands have been available in the bsg driver for more than 12 years using ioctl(<bsg_dev_fd>, SG_IO) using the sg V4 interface (i.e. struct sg_io_v4) and are now available with the sg V4 driver where <bsg_dev_fd> is replaced by <sg_dev_fd>. Asynchronous SCSI bidi commands were available for the same period but were withdrawn around Linux kernel 4.15 due to problems with the bsg driver. Those asynchronous commands were submitted via the Unix write(2) call and the response was received using a Unix read(2) call. In the sg v4 driver the submitted and received object structure remains the same but the Unix write(2) and read(2) system calls can no longer be used. Instead two new ioctl(2)s have been introduced called SG_IOSUBMIT and SG_IORECEIVE to replace write(2) and read(2) respectively. The functionality is almost identical, read on for details.

In the sg driver the direct IO flag has the effect of letting the block layer manage the data buffers associated with a command. The effect of indirect IO in the sg driver is to let the sg driver manage the data buffers. Indirect IO is the default for the sg driver with the other options being mmap IO (memory mapped IO) and direct IO. Indirect IO is the most flexible with the sg driver, it can be used by both uni-directional and bidi commands and has no alignment requirements on the user space buffers. Request sharing discussed above cannot be used with direct IO (because the sg driver needs control of the data buffers to implement the share) while mmap IO is not implemented for bidi commands. Also a user space scatter gather list cannot be used for either the data-out or data-in transfers associated with a bidi command.

Other than the exclusions in the previous paragraph, all other capabilities of the sg driver are available to bidi commands. The completion is sent when the second transfer (usually a data-in transfer) has completed. pack_id and/or tags can be used as discussed in the previous section. Signal on completion, polling for completion and multi-threading should also work on bidi commands without issues.

12 SG interface support changes

In the following table, a comparison is made between the supported interfaces of the sg driver found in lk 4.20 (V3.5.36) and the proposed V4 sg driver. The movement of the main header file from the include/scsi directory to include/uapi/scsi/sg should not impact user space programs since modern Linux distributions should check both and the the stub header now in include/scsh/sg.h includes the other one. There is a chance the the GNU libc maintainers don't pick up this change/addition, but if so the author would expect that to be a transient problem. The sg3_utils/testing directory in the sg3_utils package gets around this problem with a local copy of the "real" new sg header in a file named uapi_sg.h .


 Table 1. sg interfaces supported by various sg drivers

interface support/



sg driver version

v1+v2 interfaces

Async

struct sg_header

v3 interface

Async

struct sg_io_hdr

v3 interface

Sync

struct sg_io_hdr

v4 interface

Async

struct sg_io_v4 (bsg.h)

v4 interface

Sync

struct sg_io_v4 (bsg.h)

sg driver V3.5.36

lk 2.6, 3, 4 and 5.0

interface header ==>

write(2)+read(2)

include/scsi/sg.h

write(2)+read(2)

include/scsi/sg.h

ioctl(SG_IO)

include/scsi/sg.h

not available ^^^

not available ***

sg driver V4.0.x

lk ?

interface header ==>

write(2)+read(2) ****



include/uapi/scsi/sg.h

ioctl(SG_IOSUBMIT)+

ioctl(SG_IORECEIVE) or

write(2)+read(2)



include/uapi/scsi/sg.h

ioctl(SG_IO)



include/uapi/scsi/sg.h

ioctl(SG_IOSUBMIT)+

ioctl(SG_IORECEIVE)



include/uapi/scsi/sg.h +

include/uapi/linux/bsg.h

ioctl(SG_IO)


include/uapi/scsi/sg.h +

include/uapi/linux/bsg.h


*** available via the bsg driver; ^^^ removed from the bsg driver in lk 4.15; **** the plan is to deprecate the write(2)/read(2) based interfaces which would leave v1+v2 interfaces unsupported.

Note that there is no v1+v2 sync interface. Rather than completely drop the write(2)+read(2) interface, it could be kept alive for only v1+v2 interfaces. Applications based on the v1+v2 interfaces would be written around 20 years ago and need a low level re-write to use the v3 or v4 async interfaces. So what might be dropped is the ability of the v3 interface to use the write(2)+read(2) interface as the only code change required should be to change the write(2) to an ioctl(SG_IOSUBMIT) and the read(2) to an ioctl(SG_IORECEIVE).

13 IOCTLs

Traditionally character device drivers in Unix have had a open(2), close(2), read(2), write(2), ioctl(2) interface to the user space. As well as those system calls this driver supports mmap(2), poll(2) and fasync(). The fasync() driver call is related to the fcntl(2) system call when the file descriptor flags are being changed to add O_AYSNC (e.g. fcntl(SET_FL(flags | O_ASYNC)) ) .

It may help in understanding this driver by adding a little history. This driver was present in Linux kernel 1.0.0 released in 1992. It supported just two ioctl(2)s at the time: SG_SET_TIMEOUT and SG_GET_TIMEOUT plus a bunch of "pass-through" ioctl(2)s that started with "SCSI_IOCTL_" that were in common with other ULDs (e.g. sd and st) and implemented by the Linux SCSI mid-level. The only method of sending a SCSI command by this driver was with the async write(2) and read(2) system calls (that neglects counting the "pass-through" pass-through ioctl(2): SCSI_IOCTL_SEND_COMMAND). Over time there has been a transfer of functionality from the write(2) and read(2) system calls to various ioctl(2)s which are listed below. Using the write(2) and read(2) system calls in the way that this driver does is frowned up by the Linux kernel architects, as is adding new ioctl(2)s! Only 4 new ioctl(2)s have been added in the sg v4 driver as noted in the status column of the table below. Two of those ioctl(2)s were proposed in this post by a Linux architect (L. Torvalds). However a lot of extra information exchanged between the user space and the driver is needed to support the new functionality added in v4 of this driver. That is nearly all done via one new omnibus ioctl(2): SG_SET_GET_EXTENDED using a 96 byte structure and flags listed in the second table below.

Another historical note, the v1 SCSI pass-through interface was based on this structure in linux kernel 1.0.0:

struct sg_header
 {
  int pack_len;    /* length of incoming packet <4096 (including header) */
  int reply_len;   /* maximum length <4096 of expected reply */
  int pack_id;     /* id number of packet */
  int result;      /* 0==ok, otherwise refer to errno codes */
  /* command follows then data for command */
 };

Only the pack_id field is found in all versions of the sg driver interface and its semantics remain the same.

The following table lists the ioctl(2)s that the sg v4 driver processes. They are in the alphabetical order of the name of the second ioctl(2) argument. In most cases the scope of the action of the ioctl(2) is that of the file descriptor, given as the first argument and referred to below as the current file descriptor. If the scope is other than the current file descriptor, that is noted in the second column. Note that there is a "fall-through" in the last row of this table, so any ioctl(2)s not processed by this driver will be passed to the SCSI mid-level and if it doesn't process them and thence onto the LLD (SCSI low level driver) that owns the "host" that the file descriptor's device is connected to. If no driver processes an ioctl(2) then it should return -1 with an errno of ENOTTY (according to POSIX) but the sometimes other error codes given, depending on the LLD.

ioctl(2) name [hex value]

second argument to ioctl(2) call

Status

output via 3rd arg ptr unless noted

Notes

BLKSECTGET [0x1267]

active

scope: host (HBA)

this ioctl value replicates what a block layer device file (e.g. /dev/sda) will do with the same value. It calls the queue_max_sectors() helper on the owning device's command queue. The resulting number is multiplied by 512 to get count in bytes and output where the third argument points, assumed to be a pointer to int (so a maximum of about 2 GB). It represent the maximum data size of a single request that the block layer will accept.

BLKTRACESETUP [0xc0481273]

active

scope: device

third argument of ioctl(2) is pointer to a struct blk_user_trace_setup object. Needs a kernel with CONFIG_BLK_DEV_IO_TRACE=y . This ioctl(2) and its siblings are passed through to the block layer which implements them: a pass-through inside a pass-through

BLKTRACESTART [0x1274]

active

scope: device

ignores third argument of ioctl(2). See blktrace and blkparse utilities in the blktrace package.

BLKTRACESTOP [0x1275]

active

scope: device

ignores third argument of ioctl(2). Part of blktrace support.

BLKTRACETEARDOWN [0x1276]

active

scope: device

ignores third argument of ioctl(2). Part of blktrace support.

SCSI_IOCTL_GET_BUS_NUMBER

active, deprecated

scope: host

implemented by the SCSI mid-level. Assumes the third argument is pointer to int (32 bit) and places a field called 'host_no' in it. host_no is an index of SCSI HBAs (host bus adapters) in the system. In this case it will the host number that the SCSI device is connected to. That SCSI device has been open(2)-ed to yield the file descriptor that this ioctl(2) uses. In modern Linux usage, this information is better obtained from sysfs. Alternatively ioctl(SG_GET_SCSI_ID) can be used (see below).

SCSI_IOCTL_GET_IDLUN

active, deprecated

scope: device

implemented by the SCSI mid-level. Assumes the third argument is pointer to int (32 bit) and places a packed integer (with 4 components) in it. The lower 8 bits are a target device number, the next 8 bits are the LUN, the next 8 bits are the channel number, and the top 8 bits are the host_no mentioned in the previous item. The are many things wrong with this from a modern SCSI perspective. In modern Linux usage, this information is better obtained from sysfs.

SCSI_IOCTL_PROBE_HOST

active, deprecated

scope: host

implemented by the SCSI mid-level. Yields an identifying string associated with the host. Assumes the third argument is a pointer to a byte array whose length in placed in a (32 bit) int in the first 4 bytes. That length will be overwritten by the ASCII byte array output. This information can also be obtained from sysfs.

SCSI_IOCTL_SEND_COMMAND [0x1]

active, deprecated

this is the SCSI mid-level pass-through which is very old, found in lk 1.0 with sg v1 interface vintage and even worse. Please do not use.

SG_EMULATED_HOST [0x2203]

seems to be "dead"

originally indicated a host emulated SCSI (e.g. ATAPI) but libata does not seem to set this value in the host template provided by each LLD.

SG_GET_ACCESS_COUNT [0x2289]

not supported

returns 1 [unless the owning sg device is missing in which case 0 is returned, very unlikely]

SG_GET_COMMAND_Q [0x2270]

active

see SG_SET_COMMAND_Q notes below. Yields current state of the COMMAND_Q flag held by the this file descriptor.

SG_GET_KEEP_ORPHAN [0x2288]

active

when a synchronous ioctl(SG_IO) is interrupted (e.g. by a signal from another process) the default action (depending on the signal) may be to terminated the ioctl(2) with an errno of EINTR. The driver terms such an inflight command/request an "orphan". The default action is to "throw away" the the response from the device and clean up the request's resources. This loses information such as whether the command succeeded. This ioctl return 0 (the default) or 1 depending on whether the request belonging to this file descriptor will throw away (when 0) or keep (when 1) the response to interrupted requests. Note that closing a sg file descriptor will clean-up any outstanding request resources this file descriptor is using at the time of the close(2) [in reality that takes place a little later (when the last response "lands") because nothing is permitted to suspend a close(2)].

SG_GET_LOW_DMA [0x227a]

active, deprecated

scope: host

Yields the host's unchecked_isa_dma flag (0 or 1) via the third argument. The 'host' is typically the host bus adapter (HBA) that this sg device (the parent of the current file descriptor) is connected to.

SG_GET_NUM_WAITING [0x227d]

active

Number of "inflight" commands that have "landed" but have not been read. Only applies to the sg device file descriptor given to the ioctl(2) and synchronous commands (e.g. ioctl(SG_IO) being processed on another thread) are not counted. The "read" is via ioctl(SG_IORECEIVE) or the read(2) system call.

SG_GET_PACK_ID [0x227c]

active

the third argument is expected to be a pointer to int. By default it will set that int to the pack_id of the first (oldest) command that is completed internally but still awaits ioctl(SG_IORECEIVE) or read(2) to finish. If no requests are waiting -1 (i.e. the wildcard value) is place din that int. This ioctl(2) yields the pack_id by default, unless the SG_CTL_FLAGM_TAG_FOR_PACK_ID boolean has been set on this file descriptor.

SG_GET_REQUEST_TABLE [0x2286]

active

The third argument is assumed to point to an array of 16 struct sg_req_info objects (that struct is defined in include/uapi/scsi/sg.h). First the array is zeroed making all req_state fields zero which corresponds to INACTIVE state. Then any requests that are active have fields placed in the sg_req_info elements. Then if there is still room requests from the free list are placed in sg_req_info elements. This action stops when either 16 elements are filled or there are no more requests associated with the current file descriptor to transfer.

SG_GET_RESERVED_SIZE [0x2272]

active

this is the size, in bytes, that the reserve request associated with this file descriptor currently has. The third argument is assumed to be a pointer to an int that receives this value.

SG_GET_SCSI_ID [0x2276]

active, enhanced in v4

the third argument should by a pointer to an object of type struct sg_scsi_id . This ioctl(2) fills the fields in that structure. The extension in v4 is to use two 'unused' 32 bit integers at the end of that struct as an array of 8 bytes to which the SCSI LUN is written. This is the preferred LUN format from t10.org . This extension does not change the size of struct sg_scsi_id . For those looking for the corresponding HCTL tuple for the device this file descriptor belongs to ,this ioctl(2) is one way: H --> sg_scsi_id::host_no; C --> sg_scsi_id::channel, T --> sg_scsi_id::scsi_id and L --> sg_scsi_id::scsi_lun[8] . Another way is to use 'lsscsi -g' which datamines in sysfs or the user can write their own sysfs datamining code.

SG_GET_SG_TABLESIZE [0x227F]

active

yields the maximum number of scatter gather elements that the associated host (HBA) supports. That is the host through which the sg device is attached, that "owns" the given file descriptor. The third argument is assumed to point to an int.

SG_GET_TIMEOUT [0x2201]

active, deprecated; timeout in seconds as return value

the v1 and v2 interfaces did not contain a command timeout field so this was a substitute. Both the v3 and v4 interface have a command timeout field which is better than using this ioctl

SG_GET_TRANSFORM [0x2205]

seems to be "dead"

this driver passes this ioctl value through to the SCSI mid-level which seems to do nothing with it. Testing reveals that it yields an errno of EINVAL

SG_GET_VERSION_NUM [0x2282]

active

uses the third argument as a pointer to write out a 32 bit integer whose latter digits went seen in decimal are in the form [x]xyyzz . [x] means blank (space) if zero. This is usually expressed as an ASCII string as '[x]x.[y]y.zz' .

SG_IO [0x2285]

active, added functionality in v4 driver

both v3 and v4 interface synchronous commands can be issued with this ioctl(2). Only returns -1 and sets errno when the preparation for submitting the command/request encounters an problem. Thereafter any problems encountered set the out fields in the v3 or v4 interface object. So both should be checked.

SG_IOABORT [0x40a02243]

***

new in v4

only the v4 interface can use this ioctl(2) to abort a command in process, using either the pack_id (in the request_extra field) or the tag. The pack_id is used by default, unless the SG_CTL_FLAGM_TAG_FOR_PACK_ID boolean has been set on this file descriptor. If no corresponding request is found (capable of being aborted) then errno is set to ENODATA. The completion on an aborted command will have DRIVER_SOFT set in the driver_status field.

SG_IORECEIVE [0xc0a02242]

***

new in v4

both the v3 and v4 interfaces can use this ioctl(2) to complete a command/request started with asynchronous ioctl(SG_IOSUBMIT) on the same file descriptor. If multiple requests are outstanding on the same file descriptor, then ioctl(SG_SET_FORCE_PACK_ID) can be used with the ::pack_id field in the v3 and v4 interfaces (or the ::request_tag in the v4 interface) to choose one response matching a given pack_id .

SG_IOSUBMIT [0xc0a02241]

***

new in v4

both the v3 and v4 interfaces can use this ioctl(2) to issue (submit) new commands. This ioctl(2) will return relatively quickly potentially well before the command has completed. Each call to ioctl(SG_IOSUBMIT) needs to be paired with a call to ioctl(SG_IORECEIVE) using the same (sg) file descriptor. This call is part of the v3 and v4 asynchronous interface.

SG_NEXT_CMD_LEN [0x2283]

active, deprecated

only applies to the v2 interface which does not include a command (cdb) length field. That assumes the driver can work out what the cdb length. While that works for standard cdbs (from T10) it may not work for vendor specific commands, hence this ioctl(2).

SG_SET_COMMAND_Q [0x2271]

active

in the v1 and v2 drivers the default was 0 (so no command queuing on this file descriptor). In the v3 driver it was 0 until a v3 interface structure was presented, in which case it was turned on (1) for this file descriptor. In the v4 driver it is on (1) by default. 0 --> only allow one command per fd; 1 --> allow command queuing. When command queuing is off, if a second command is presented before the previous has finished and errno of EDOM will result.

SG_SET_DEBUG [0x227e]

active, scope=device

0 --> turn off (def), 1 --> turn on . Currently the only impact of setting this is to print out sense data (to the log) of any request on all fds that belong to the current device. Typically only requests that yield a SCSI status of "Check condition" provide sense data.

SG_SET_GET_EXTENDED [0xc0602251]

new in v4

takes pointer to 96 byte structure, can set and get 32 bit values, can set and get boolean values. Each ioctl(2) can perform more than one action. See next table.

SG_SET_KEEP_ORPHAN [0x2287]

active

how to treat a SCSI response when a ioctl(SG_IO), read(2) or ioctl(SG_IORECEIVE) that is waiting is interrupted. 0 --> drop it (def); 1 -> hold it so the response can be fetched with either another read(2) or ioctl(SG_IORECEIVE) call

SG_SET_FORCE_LOW_DMA [0x2279]

does nothing

users of modern Linux systems should not concern themselves with "low DMA", this comes from the ISA era. 0 --> use adapter setting (def); 1 --> force "low dma". However this ioctl(2) has since been neutered and does nothing.

SG_SET_FORCE_PACK_ID [0x227b]

active

when receiving an async response only accept a response with a matching pack_id (or tag). A pack_id (or tag) of -1 is treated as a wildcard. In the v4 interface the request_extra field is used for the pack_id. Async receiving is done with ioctl(SG_IORECEIVE) or read(2). 0 --> take the oldest available response (def); 1 --> take the matching response. The default is to use the pack_id unless SG_SET_GET_EXTENDED{SG_CTL_FLAGM_TAG_FOR_PACK_ID} is given.

SG_SET_RESERVED_SIZE [0x2275]

active

sets or resets the size of the reserve request data buffer size of this file descriptor to the given value (in bytes). If this file descriptor is in use (i.e. sending a SCSI command) then this ioctl(2) will fail with an errno of EBUSY.

SG_SET_TIMEOUT [0x2201]

active, deprecated

command timeout in seconds (pointed to by third argument). See "_GET_" notes.

SG_SET_TRANSFORM [0x2204]

seems to be "dead"

this driver passes this ioctl value through to the SCSI mid-level which seems to do nothing with it. Testing reveals that it yields an errno of EINVAL

<< any others>>

??

sent through to the SCSI mid-level (and then to the LLD associated with the device the fd belongs to) for further processing.



*** These ioctls have the size of struct sg_io_v4 (i.e. the v4 interface structure) encoded into them. It is the 'a0' byte when the ioctl(2) value is viewed as hexadecimal and for 64 bit Linux it is 160 bytes. The Linux kernel might check that length is readbale (or writable) before passing contro to the sg driver. So if the v3 interface (which is 136 bytes long) is being used, it is best to embed it a 160 byte allocation.

The following table only applies to ioctl(SG_SET_GET_EXTENDED). As its name suggests a single call to this ioctl(2) can set and/or get one or more parameters. The ioctl(SG_SET_GET_EXTENDED) takes a pointer to an object of 'struct sg_extended_info' type. That structure is 96 bytes long (a number which is hard coded into part of SG_SET_GET_EXTENDED's numerical value) and is left purposely half full for future extensions. The object manipulates 32 bit integers and boolean flags as dictated by its *_mask fields. Currently it has 5 integer carrying fields and one field holding 11 boolean flags (::ctl_flags). The C++ notation of double colon is used with the structure name to the left and the field name within that structure to the right (e.g. sg_extended_info::share_fd). Sometimes the left hand structure name is omitted, other times only the field name is given (e.g. ctl_flags).

The following table is also in alphabetical order which is a little unfortunate as SG_SEIM_* rows in the middle are the most important ones.

SG_SET_GET_EXTENDED sub name

Type

scope (if other than file descriptor)

Notes [fd: file descriptor given as 1st arg to ioctl(2)]

SG_CTL_FLAGM_ALL_BITS [0x7ff *]

* this a a mask, not an ioctl(2) second argument

ctl_flags

This define in <scsi/sg.h> should agree with the value derived from SG_SEIM_READ_VAL{SG_SEIRV_BOOL_MASK}. If not use the latter. It is a bit mask for sg_extended_info::ctl_flags_wr_mask and ::ctl_flags_rd_mask

SG_CTL_FLAGM_CHECK_FOR_MORE [0x400]

ctl_flags [raw] 'read after write'

when this flag is set, subsequent async ioctl(SG_RECEIVE) and read(2) calls that are successful (i.e. find a completed (matching) request/command) will do an additional check to see if another request/command is completed awaiting one of these system calls. If so SG_INFO_ANOTHER_WAITING is OR-ed into thev3 and v4 interfaces' ::info field. This flag is clear (0) by default as the extra check takes a little time.

SG_CTL_FLAGM_IS_MASTER [0x40]

ctl_flags [ro] 'read-only'

set implies this fd is part of a file share and this is the master side

SG_CTL_FLAGM_IS_SHARE [0x20]

ctl_flags [ro]

set implies this fd is part of a file share.

SG_CTL_FLAGM_MASTER_ERR [0x200]

ctl_flags [ro]

set implies the master's request has completed with a non-zero SCSI status or other driver error. In this case the shared request state is terminated (i.e. the slave side will not be able an associated slave request). This may be used either on the master's or slave's fd

SG_CTL_FLAGM_MASTER_FINI [0x100]

ctl_flags [ro]

set implies the master's request has completed and is waiting for the slave request to start. This may be used either on the master's or slave's fd

SG_CTL_FLAGM_ORPHANS [0x8]

ctl_flags [ro]

set implies there is one or more orphaned commands/request associated with this fd.

SG_CTL_FLAGM_OTHER_OPENS [0x4]

ctl_flags [ro]

set implies there are other sg driver open(2)s active of this sg device.

SG_CTL_FLAGM_Q_TAIL [0x10]

ctl_flags [raw]

when written, set causes the following commands/requests on this fd to be queued to the block layer at the tail of its queue; clear causes them to be queued at head (the default). Each v3 and v4 command can use the SG_FLAG_Q_AT_TAIL or SG_FLAG_Q_AT_HEAD to override this setting.

SG_CTL_FLAGM_TAG_FOR_PACK_ID [0x2]

ctl_flags [raw]

when written, set causes the following commands/requests on this fd to use the tag field rather than pack_id (or sg_io_v4::request_extra)

SG_CTL_FLAGM_TIME_IN_NS [0x1]

ctl_flags [raw]

when written, set causes the following commands/requests to this fd to use command/request duration calculations to be done in nanoseconds; clear causes duration calculations to be done in milliseconds.

SG_CTL_FLAGM_UNSHARE [0x80]

ctl_flags [w, rd-->0]

this will break the share relationship between a master fd and a slave fd. It can be sent to either fd. If a shared command/request is active using either fd then this ioctl(2) will fail with an errno of EBUSY. If no share relationship exists for the given fd this ioctl(2) will return 0 and do nothing.

SG_SEIM_ALL_BITS [0x1ff *]

* this a a mask, not an ioctl(2) second argument

sg_extended_info::sei_wr_mask and

sg_extended_info::sei_rd_mask

This define in <scsi/sg.h> should agree with the value derived from SG_SEIM_READ_VAL{SG_SEIRV_INT_MASK}. If not use the latter. It is a bit mask for sg_extended_info::sei_flags_wr_mask and ::sei_flags_rd_mask

SG_SEIM_CHG_SHARE_FD [0x80]

sg_extended_info::share_fd [rbw]

'read before write'

when written, this is only valid if fd is the master side of a share. If so ::share_fd replaces the prior slave fd (which is the value read back) so that ::share_fd becomes the new slave side of a fd share.

SG_SEIM_CTL_FLAGS [0x4]

sg_extended_info::ctl_flags, ::ctl_flags_wr_mask and ::ctl_flags_rd_mask

three fields in a sg_extended_info object are associated with this variant of the ioctl(2), a value mask, a write mask and a read mask. The mask value are the SG_CTL_FLAGM_* values shown in the upper section of this table.

SG_SEIM_MINOR_INDEX [0x8]

sg_extended_info::minor_index [ro]

when read places the minor number of the sg device that this fd is associated with in ::minor_index . For example after open(2)-ing "/dev/sg3" that fd should yield 3 in ::minor_index .

SG_SEIM_READ_VAL [0x10]

sg_extended_info::read_value

when a known value (see SG_SEIRV_* entries below in this table) is written to ::read_value then after this ioctl(2) the corresponding value will be in that field.

SG_SEIM_RESERVED_SIZE [0x1]

sg_extended_info::reserved_sz [raw]

when written, this fd's reserve request's data buffer will be resized to ::reserved_sz . The given value may be trimmed down by system limits. When read, the actual size of this fd's (resized) data buffer will be placed in ::reserved_sz when this ioctl(2) completes. So when both written and read, this ioctl(2) is very similar to ioctl(SG_SET_RESERVED_SIZE) combined with ioctl(SG_GET_RESERVED_SIZE) .

SG_SEIM_SGAT_ELEM_SZ [0x40]

sg_extended_info::sgat_elem_sz [rbw]

when the driver builds a scatter gather list for a request's data buffer a fixed element size is used which is a power of 2 and greater than or equal to the machine's page size (often 4 KB). The default size is currently 32 KB (**15). When written, ::sgat_elem_sz will replacement the default element size. When read the prior element size is placed in ::sgat_elem_sz . Only effects future requests on this fd.

SG_SEIM_SHARE_FD [0x20]

sg_extended_info::share_fd [rbw]

when written, a shared fd relationship is set up by this ioctl(2). The fd that is the first argument of the ioctl(2) should be the future slave (i.e. the WRITE side of a copy) and ::share_fd identifies the future master. Neither fd can already be part of a share. When read (read before write), if successful ::share_fd should yield 0xffffffff which indicates (internally) both fds were not previously part of a share.

When read, but not written, then if ::share_fd will yield: 0xffffffff (-1) if the first argument is not part of a share; 0xfffffffe (-2) if the first argument is the master side of a share; or the master's fd if the first argument is the slave side of a share.

SG_SEIM_TOT_FD_THRESH [0x2]

sg_extended_info::tot_fd_thresh [raw]

By default, a limit of all data buffers that can be active on a fd is set at 16 MB. A request that tries to exceed this will be rejected with an errno of E2BIG. The default can be changed by writing to ::tot_fd_thresh . A value of 0 is taken as unlimited.

SG_SEIRV_BOOL_MASK [0x1]

-->sg_extended_info::read_value

with ::read_value set to SG_SEIRV_BOOL_MASK, after ioctl(SG_SET_GET_EXTENDED{SG_SEIM_READ_VAL) ::read_value has a 32 bit mask of bit positions that are used in sg_extended_info::ctl_flags (and ::ctl_flags_wr_mask and ::ctl_flags_rd_mask). That value is currently 0x7ff .

SG_SEIRV_DEV_FL_RQS [0x3]

-->sg_extended_info::read_value

scope=SCSI_device

sum of number of free list requests on each fd belonging to the SCSI device (e.g. a SSD) that owns the given fd.

SG_SEIRV_FL_RQS [0x3]

-->sg_extended_info::read_value

number of "inactive" request objects currently on this fd's free list. When there are no active command/requests, this value should be 1 and that entry should be this fd's reserved request (waiting for a user request to commence).

SG_SEIRV_INT_MASK [0x0]

-->sg_extended_info::read_value

after ioctl(2) ::read_value has a 32 bit mask of bit positions that are used in sg_extended_info::sei_wr_mask and sg_extended_info::sei_rd_mask . That value is currently 0xff .

SG_SEIRV_TRC_MAX_SZ [0x6]

-->sg_extended_info::read_value

after ioctl(2) ::read_value has a 32 bit integer which is the maximum number of bytes in the trace ring buffer. This is a driver compile time constant

SG_SEIRV_TRC_SZ [0x5]

-->sg_extended_info::read_value

after ioctl(2) ::read_value has a 32 bit integer which is the number of bytes currently in the trace ring buffer.

SG_SEIRV_VERS_NUM [0x2]

-->sg_extended_info::read_value

scope=driver

after ioctl(2) ::read_value has a 32 bit integer whose latter digits went seen in decimal are in the form [x]xyyzz . [x] means blank (space) if zero. This is usually expressed as an ASCII string as '[x]x.[y]y.zz' .

Further notes: ctl_flags are manipulated by the sg_extended_info::ctl_flags, sg_extended_info::ctl_flags_rd_mask and sg_extended_info::ctl_flags_wr_mask fields

Note that for ioctl(SG_SET_GET_EXTENDED) to do anything either sg_extended_info::sei_wr_mask or sg_extended_info::sei_rd_mask must be non-zero. Likewise to access or change the boolean flags either sg_extended_info::ctl_flags_wr_mask or sg_extended_info::ctl_flags_rd_mask must be non-zero.

14 Downloads

This tarball sgv4_20190211 has two parts. One directory is named lk5.0 and targets lk 5.0-rc<n> and the other is named lk_le4.20 and targets earlier lk 4.20 kernels and earlier. The difference is in lk 5.0-rc1 a kernel wide patch by Linus Torvalds changed the number of parameters to the access_ok() function. Since the sg driver uses that call over 10 times, it broke a lot of patches making it difficult to maintain a single set of patches, hence the split. Both of those directories have a sub-directory called sgv4_20190211 which contains a series of 19 patches. Both of those directories contain the 3 files that represent the sg v4 driver in the kernel: drivers/scsi/sg.c and include/scsi/sg.h and include/uapi/scsi/sg.h . The last file is new (i.e. not in the production sg driver). If those 3 files are copied into the corresponding locations in a kernel source tree then a subsequent kernel build will generate the sg v4 driver. It might be a good idea to take a copy of driver/scsi/sg.c and include/scsi/sg.h before copying those files to simplify reverting to the sg v3 driver currently in the kernel.

The patches are against Martin Petersen's 5.1/scsi_queue branch (the part under lk5.0) and his 4.21/scsi-queue branch (the part under lk_le4.20). It should apply against lk 4.18 and later (and perhaps earlier; to be tested). The recent patches on the sg driver that might interfere (or cause fuzz) are:

96d4f267e40f9 (Linux Torvalds 2019-01-03 18:57:57 -0800) access_ok() [3 -->2 function arguments] appeared in lk 5.0-rc1

92bc5a24844ad (Jens Axboe 2018-10-24 13:52:28 -0600) remove double underscore version of blk_put_request(), appeared in lk 5.0-rc1

abaf75dd610cc (Jens Axboe 2018-10-16 08:38:47 -0600) blk_put_request(srp->rq) addition, first appeared in lk 4.20-rc1

The sg driver patch prior to that was 8e4a4189ce02f (Tony Battersby 2018-07-12), first appeared in v4.18-rc8

The sg3_utils was originally written to test sg v3 interface when it was introduced. So where better to put sg v4 test code? Since the sg3_utils is well established, the author sees no benefit in introducing a sg4_utils in which less than an estimated 5% of the code would change, much easier to incorporate that code change/addition in the existing package. The latest sg3_utils beta on the main page (revision 811 (a beta of version 1.45) as this is written) contains utilities for testing the sg v4 interface. The underlying support library has been using the sg v4 header for many years as a common format. If the given device was a bsg device node then the sg v4 interface was used; otherwise (e.g. for sg and block devices) the sg v4 header was translated down to a v3 header and forwarded on. In the current beta, the sg3_utils will use ioctl(SG_GET_VERSION_NUM) on sg devices and if it is a v4 driver then it will send a v4 header, otherwise it will do as it does now. [That v4 interface usage can be defeated by './configure --disable-linux-sgv4' .] In the testing directory of that beta are several utilities that are "v4" driver aware:

15 Other documents

The original sg driver documentation is here: SCSI-Generic-HOWTO and a more recent discussion of ioctl(SG_IO) is here: sg_io .

16 Conclusion

The sg v4 driver is designed to be backwardly compatible with the v3 driver. This simplest way for an application to find which driver it has is with the ioctl(SG_GET_VERSION_NUM). Removing a restriction such as 16 outstanding commands per file descriptor can catch out programs that rely on hitting that limit. Adding a driver parameter to re-impose that limit and any other differing behaviour can be done if the need arises. The best way to test backward compatibility is to place this new driver "under" existing apps that use sg driver nodes and check the functionality.

Return to main page.

Douglas Gilbert

Last updated: 13th March 2018 15:00