The Linux SG driver version 4.0


1 Introduction

2 Changes to sg driver between version 3.5.36 and 4.0

3 Architecture of the sg driver

4 Sharing file descriptors

5 Async usage in v4

6 Request sharing

7 Sharing design considerations

8 pack_id or tag

9 Bi-directional command support

10 SG interface support changes

11 Downloads

12 Other documents

13 Conclusion


1 Introduction

The SCSI Generic (sg) driver in Linux has been present in the kernel since version 1.0 of the kernel in 1992. In the 26 years since then the driver has had 3 interfaces to the user space and now a fourth is going to be added. The first and second interfaces (v1 and v2) use the same header: 'struct sg_header' with only v2 now fully supported. The "v3" interface is based on 'struct sg_io_hdr'. Both these structures are defined in include/scsi/sg.h the bulk of whose contents will move to include/uapi/scsi/sg.h as part of this upgrade. Prior to the changes now proposed, the "v4" interface is only implemented in the block layer's bsg driver ("block SCSI generic" which is around 15 years old) . The bsg driver's user interface is found in include/uapi/linux/bsg.h . These changes propose adding support for the "v4" interface via the SG_IO ioctl(2) for synchronous use, and new SG_IO_SUBMIT and SG_IO_RECEIVE ioctl(2)s for asynchronous use. The plan is to deprecate and finally remove (or severely restrict) the write(2)/read(2) based asynchronous interface used currently by the v1, v2 and v3 interfaces. The v3 asynchronous interface is supported by the two new ioctl(2)s.

If the driver changes are accepted the driver version which is visible via an ioctl(2), will be bumped from 3.5.36 (in lk 5.0) to 4.0.x . The opportunity is being taken to clean the driver after 20 years of piecemeal patches. Those patches have left the driver with misleading variable names and nonsensical comments. Plus there are new kernel facilities that the driver can take advantage of. Also of note is that much or the low level code once in the sg driver (and remnants remain) have been moved to the block layer and SCSI mid-level. This upgrade has been done as a two stage process: first clean the driver up, remove some restrictions and re-instate some features that have been accidentally lost. Three versions of a patchset were sent to the linux-scsi list in October 2018. That patchset took the sg driver to version 3.9.01 . Now the v4 interface is supported as described here, so the sg driver number has been bumped to 4.0.04 .

Note that the Linux block layer implements the synchronous sg v3 interface via ioctl(SG_IO) on all block devices that use the SCSI subsystem, directly or via translation (e.g. SATA disks use libata which implements the SAT T10 standard). In pseudocode an example like this ' ioctl(open("/dev/sdc"), SG_IO, ptr_to_sg_io_hdr)' works as expected. This is not implemented by the sg driver so it is important that the sg driver's implementation of ioctl(SG_IO) remains consistent with other driver implementations (mainly the one found in block/scsi_ioctl.c kernel source code).

2 Changes to sg driver between version 3.5.36 and 4.0

A summary is given as bullet points:

There are still some things to do:

3 Architecture of the sg driver

Nothing much has changed in the overall architecture of the sg driver. It may help later explanations if a pictorial summary is given. See the following diagram:




The sg driver is shown as a laptop at the top of the object tree. The arrow end of solid lines show objects that are created automatically or by actions outside the user interface to the sg driver. So the disk-like objects created at the second level come from the device scanning logic undertaken by the SCSI mid-level driver in Linux. Note that there are SCSI devices other than disks such as tape units and SCSI enclosures. Also note that not all storage devices in Linux use the SCSI subsystem, examples of these are NVME SSDs and SD cards that are not attached via USB. The type of these SCSI device objects is sg_device (and in the driver code they appear as object of C type 'struct sg_device'). Even though the sg driver's implementation is closely associated with the block subsystem, the sg driver's device nodes are character devices in Linux (e.g. /dev/sg1). The nodes are also known as character special devices.

At the third level are file descriptors which the user makes via the open(2) system call (e.g. 'fd = open("/dev/sg1", O_RDWR);') . Various other system calls such as close(2), write(2), read(2), ioctl(2) and mmap(2) can use that file descriptor made by open(2). The file descriptor will stay in existence until the process containing the code that opened it exits or the user closes it (e.g. 'close(fd);'). A dotted line is shown from the "owning" device to each file descriptor in order to indicate that it was created by direct user action on the sg interface. The type of file descriptor objects is sg_fd. BTW most system calls have "man pages" and the form open(2) indicates that there is a manpage for that system call and it is in section 2.

At the lowest level are the sg_request objects each of which carry a user provided SCSI command to the target device which is its grandparent in the above diagram. These requests are then sent via the block and SCSI mid-level to a Low Level Driver (LLD) and then across the transport (with iSCSI that can be a long way) to the target device (e.g. a SSD). User data that moves in the same direction as the request is termed as "data-out" and the SCSI WRITE command is an example. In all cases (unless there is a command timeout) a response traverses the same route as the request, but in the reverse direction. Optionally it may be accompanied by user data which is termed as "data-in" and the SCSI READ command is an example. Notice that a heavy (thicker) line is associated with the first request of each file descriptor; it points to a reserve request (in earlier sg documentation this was referred to as the "reserve buffer"). That reserve request is created after each file descriptor is created and before the user has a chance to send a SCSI command/request on that file descriptor. This reserve request was originally created to make sure CD writing programs didn't run out of kernel memory in the middle of a "burn". That is no longer a major concern but the reserve buffer found other uses: for mmap-ed and direct IO. So when the mmap(2) system call is used on a sg device, it is the associated file descriptor's reserve buffer (i.e. the memory portion of the reserve request) that is being mapped into the user space. The type of the request object is sg_request.

The lifetime of sg_request objects is worth noting. When a sg_request object is active ("inflight" is the term used in the code) it has both an associated block request and a SCSI mid-level object. They all have similar roles and overlap somewhat. However once the response is received (and typically before the user has seen that response or any "data-in") the block request and the SCSI mid-level objects are freed up. The sg_request object lives on along with the data carrying part of the block request called the bio as that may be carrying "data-in" that has yet to be delivered to the user space. That is because the default user data handling (termed as "indirect IO") is a two stage process. For data-in that will be first DMA-ed from the target device into kernel memory, typically under the control of the LLD; the second stage is copying from that kernel memory to user space, under the control of this driver. Even after the user has fetched the response and any data-in, the sg_request continues to live. [However once any data-in has been fetched the block request bio is freed.] The sg_request object is then marked "inactive" and placed on a free list for requests, one of which is maintained for each file descriptor. So each sg file descriptor contains two request lists: one for any command that is active and the other one is a free list for inactive requests (there is an exception). The next time a user tries to send a SCSI command through that file descriptor, its free list will be checked to see if any inactive sg_request objects have a large enough data buffer for the new request; if so it will be (re-)used for the new request. Only when the user calls close(2) on that file descriptor will all the requests on the free list be truly freed. Note that in Unix, and thus Linux, the OS guarantees that it will call the close(2) command (called release() in the kernel and sg_release() in this driver) in this driver for every file descriptor in that the user has opened in this process, irrespective of what the code in that process does. This is important because processes can be shut down by signals from other processes or drivers, segmentation violations (i.e. bad code) or the kernel's OOM (out-of-memory) killer.

The above description is setting the stage for a newly added feature called "sharing" introduced in the sg v4 driver. It also uses the reserve request.

4 Sharing file descriptors

First a rationale. Copying data between storage devices is a relatively common operation. It can be both time and resource consuming. The best approach is avoiding copying all together. Another approach is to defer copies (or part of them) until they are really necessary which is the basis of COW (i.e. copy on write). Then there are offloaded copies, for example where the source and destination are disks in the same array, then a "third party copy" program (e.g. based on SCSI EXTENDED COPY and its related commands) can tell the array to do the copy itself and inform you if it finishes successfully or not. However often copies are unavoidable.

If the dd program is considered, copying one part of a one normal block storage device to to another storage device involves a surprising number of copies. Copies of large amounts of data are typically done in a staggered fashion to lessen the impact on other things the system may be doing. So typically 1 MByte (say) is read from the source device into a buffer, followed by a write of that buffer to the destination device; if no error occurs, repeat until finished. Copies between a target device and kernel memory are typically done by DMA (direct memory access) controlled by the LLDs owning the storage devices. So another copy is needed on each side of the copy to get the data in and out of kernel buffers to the user space. Moving data between a user space process and the kernel space has a little extra overhead to deal with the situations like the process being killed while data is being copied to and from it. So a reasonable implementation of dd has three buffers (2 in the kernel space) and performs 2 DMAs then 2 copies between the user space and the kernel space. As storage devices and transports get quicker, the time taken to do those copies may become more significant compared to the device access time.

Another aspect of the sharing being proposed is security. Often a user has the right to copy data but not see it. This is usually accomplished by encrypting the data. Another approach might be to make sure the copy's data is kept in kernel buffers and thus hidden from the user who is copying it. While the v4 sg driver can do this, the sg driver is not written with a view to security, since it offers a pass-through interface which, by definition, is a method to circumvent an Operating System. Those building a highly secure computer system might consider removing the sg driver or restricting its access to highly privileged users.

Sharing is a new technique added to the sg v4 driver to speed copy operations. The user first sets up a sharing relationship between two sg file descriptors, one that will be used for doing SCSI READ commands (more generally any data-in SCSI command), and the other that will be used for doing SCSI WRITE commands using the data received by the previous READ. Any data-out command can be used so, for example, the SCSI WRITE command could be replaced by WRITE AND VERIFY or WRITE SCATTERED. The file descriptor that does the READ is called the master side by the driver and the file descriptor that does the WRITE is called the slave side. The following diagram shows how one share between two file descriptors is set up.




Here the master side is /dev/sg1 and has 4 open file descriptors (fd_s 1 through 4). The slave side is /dev/sg2 has 3 open file descriptors (fd_s 5 through 7). The share shown is set up when the thread or process containing fd5 calls the "EXTENDED" ioctl on the fd5 file descriptor (i.e. the ioctl's first parameter) with a pointer to an integer containing fd1 as the ioctl's third parameter. The C code is a little more complicated than that.

How does the thread or process containing fd5 know about fd1? That is up to the design of the user space application. If they are both in the same thread then it should be obvious. If they are in different threads within the same process then it should be relatively simple to find out. The interesting case is when they are in different processes. A child process inherits all open file descriptors (including those to character special devices like the sg driver contains) from its parent in the Linux fork() system call. For processes that don't have a parent child relationship, UNIX domain sockets can be used to "send" an open file descriptor from one process to another. Note that in this case the file descriptor number might differ (e.g. because the receiver side already is using the same file descriptor number as the sender) but they will still logically refer to the same thing. Also that statement above about process termination leading to sg_release() being called for any sg file descriptors open(2)-ed in that process needs qualification: in this case the last process to hold an open file descriptor being terminated causes the driver's sg_release() to be called. In short the last close(2) on a file descriptor causes sg_release() to be called.

The sg driver's file descriptors can only be part of one share (pair). Given this restriction, in the above diagram, fd5 cannot also be in a share with fd4. fd6 may be in a share with fd7; that would imply that the share could be used for a copy from /dev/sg2 to /dev/sg2 . The master side of the share monopolizes that file descriptor's reserve request hence there can only be one outstanding share request per pair of shared file descriptors. Given this restriction one way to do a copy using queued command is to use POSIX threads. As an example from the above diagram, if 3 copy worker threads were used then the first thread could utilize fd1 and fd5, the second thread could utilize fd3 and fd6 while the last thread could utilize fd4 and fd7. This is what the sgh_dd test utility does (see below).

After a share of two file descriptors is established command requests can still be sent to both file descriptors in the normal fashion. Only when the new flag SGV4_FLAG_SHARE is given, or OR-ed in with other flags, is request sharing performed. See the 6 Request sharing section below.

5 Async usage in v4

The asynchronous interface in the context of the sg driver means issuing a SCSI command in one operation then at some later time a second operation retrieves the status of that SCSI command. Any data being transferred associated with the SCSI command is guaranteed to have occurred before that second operation succeeds. The synchronous interface can be viewed as combining these two operations into a single system call (e.g. ioctl(SG_IO) ).

The asynchronous interface starts with a call to ioctl(SG_IOSUBMIT) which takes a pointer to the sg v3 or v4 interface object. This 0bject includes the SCSI command with data transfer information for either data-in (from device) or data-out (to device). Depending on the storage device accessed (identified by the sg file descriptor given as the first argument to the ioctl() system call) the SCSI command will take milliseconds or microseconds to complete. Chances are the ioctl(SG_IOSUBMIT) will complete in a sub-microsecond timescale (on a modern processor) and that will be done before the SCSI command completes. If further processing depends on the result of that SCSI command then the program must wait until that SCSI command is complete. When that completion occurs, the data-out is guaranteed to be on the nominated storage device (or in its cache). And if a data-in transfer was specified, that data is guaranteed to be in the user space as directed. How does the program find out when that SCSI command has completed?

The exact timing of the data-out and data-in transfers can be thought of as a negotiation between the HBA (Host Bus Adapter controlled by the LLD) and the storage device. The essential point is that the data transfer and the completion are asynchronous to the program that requested the SCSI command. Since the completion is guaranteed to follow any associated data transfer then the completion event is what we will concentrate on. Detecting asynchronous events depends on Operating System features such as signals and polling. Polling is the simpler technique. However the simplest approach is to call the final step in the process which is ioctl(SG_IORECEIVE) as soon as possible. In the likely case that the SCSI command completion has not occurred, then the ioctl() can do one of two things: it can wait until the completion does occur or yield an "error" called EAGAIN. Similar to SCSI sense data, a UNIX errno doesn't always imply a hard error. So EAGAIN is not a hard error, but it tells the program that the operation didn't occur but may happen later, so try again, but preferably don't retry immediately. What determines whether the ioctl() waits or returns EAGAIN is the presence of the O_NONBLOCK flag on the file descriptor.

Two file descriptor flags are important to the asynchronous interface of the sg driver: O_NONBLOCK and O_ASYNC. The file descriptor flags are defined in such a way that they can be OR-ed together. The normal place to define flags is in the open(2) system call (its second argument) but they can be changed (and added to) later with the fcntl(2) system call. If the O_NONBLOCK is given then it will typically be given in the open(2). The O_ASYNC flag is a bit more difficult to handle because it arms the the SIGIO (also known as SIGPOLL) signal which if it occurs before a program has set up a handler for it, then the program will exit. Actually Linux ignores O_ASYNC is the open(2) call (see 'man 2 open' in the BUGS section), so fcntl(2) is the only way to set it. Below is a simplified example of adding the O_ASYNC flag to a file descriptor (sg_fd) that is already open:





It is possible to replace the classic Unix SIGIO signal with a POSIX real-time signal by making an additional call:



After that call the SIGRTMIN+1 real time signal will be used instead of SIGIO. Even though you could use hard numbers for the real-time signals the advice is to always use and offset from SIGRTMIN or SIGRTMAX (negative offset in the MAX case) because the library can (and does for its POSIX threads implementation) steal some of the lower real time signals and adjusts the SIGRTMIN value that the application program sees. Real time signals have improved semantic compared to the classic Unix signals (e.g. multiple instances of the same real time signal can be queued against a process where Unix signals would meld into one signal event in a similar situation).

In the diagram below the lifetime of an active sg_request object is shown from when it is created or retrieved from the free list in the top left to when the SCSI command has completed and the user space has been informed on the bottom right. It assumes that either the O_NONBLOCK flag is set on the file descriptor (assumed to be the same in all the system call boxes shown with the blue band at the top), or ioctl(SG_IORECEIVE) has SGV4_FLAG_IMMED or-ed into its flags. When the first ioctl(SG_IORECEIVE) is called the SCSI command has not completed so it gets rejected with EAGAIN. The first poll(2) system call indicates with POLLOUT that another SCSI command can be issued but there are no SCSI commands waiting for an ioctl(SG_IORECEIVE) on this file descriptor. Note that poll(2) description refers to a file descriptor, not this particular sg_request object but for simplicity we will assume there is only one outstanding SCSI command on this file descriptor. At some future time, preferably long before the command approaches its timeout (often 60 seconds or more) the storage device via its LLD informs the sg driver that a SCSI command belonging to this file descriptor has completed. If O_ASYNC has been set on this file descriptor then the sg driver will issue a SIGIO signalto the owning process. A poll(2) system call after the internal completion point yields (POLLIN | POLLOUT) [IOWs both POLLIN and POLLOUT]. That tells us that the next ioctl(SG_IORECEIVE) will be successful as is indicated in the diagram.




While it is useful to think and illustrate the above mentioned ioctl(2)s and poll(2)s as being in reference a single sg_request object, they are all actually against the file descriptor that is the parent of that sg_request object. This distinction matters when multiple sg_request objects are outstanding. In the absence of any selection information (e.g. a pack_id or a tag) the ioctl(SG_IORECEIVE) will fetch the oldest sg_request object since the active (and completed) command list acts as a FIFO. Instead of poll(2) the user may call the ioctl(SG_GET_NUM_WAITING) which yields the number of sg_request objects belonging to a file descriptor that have completed internally but are yet to have ioctl(SG_IORECEIVE) [or read(2) for the async v3 interface] called on them.

6 Request sharing

Request sharing refers to two requests, usually belonging to different storage devices (e.g. two disks), sharing the same in-kernel data buffer. Before request sharing can take place a share of two file descriptors belonging to those two storage devices needs to be set up. This is discussed in the 4 Sharing file descriptors earlier section.

The diagram below shows the synchronous sg driver interface using ioctl(SG_IO) which can take either the v3 or v4 interface. The synchronous interface can be seen as the combination of the various calls that make up the asynchronous interface discussed in the previous section. The time that the synchronous ioctl(SG_IO) takes is directly related to the access time of the underlying storage device. To stress that point the system call rectangles (with a blue band at the top) in the diagram below are shown as elongated rectangles with a beginning component to the left and a completion component to the right. The elongated system call boxes span the access time of the associated storage device.

A request share only takes place when a command request is issued and a SGV4_FLAG_SHARE flag is used (OR-ed with any other flags). This should be done first on the master side with a READ (like) command request. Other flags that might be combined with this are SG_FLAG_NO_DXFER or SG_FLAG_MMAP_IO flag (but not both). The SG_FLAG_NO_DXFER flag stops the in-kernel data buffer to user space copy. The SG_FLAG_MMAP_IO flag maps the in-kernel data buffer into the user space; that user space area is made available via a mmap(2) system call preceding the command request being sent. The diagram below shows the simpler case where the minimum number of flags are set.






The slave may continue to send normal command requests but at some stage after this point it should send a WRITE (like) command request with both the SGV4_FLAG_SHARE and SG_FLAG_NO_DXFER flags set. That will use the in-kernel data buffer from the preceding master share command request and send that data (i.e. data-out) to the slave's device. So a single, in-kernel data buffer is used for a master share request followed by a slave share request.

In the terminology of the block subsystem both the master and slave share requests have their own request object, each with their own bio object. However the sg driver provides the data storage for those bios and arranges for the slave share request to use the same data buffer as the preceding master request's bio. And this is the reason that the slave request must use the SG_FLAG_NO_DXFER flag, otherwise a transfer from the user space usually associated with a WRITE (like) command would overwrite the in-kernel data buffer.

Once the slave request is successfully completed another master share request may be issued. Sanity checks ensure that using the SG_FLAG_SHARE flag on a file descriptor that is not a share will cause an error, as will trying to send a master share request before a prior master share request is complete (which means its matching slave request is finished). Also using the SGV4_FLAG_SHARE flag on a slave request will fail if there is no master request 'waiting' for it (as shown in the diagram above, the master must be in "rs_swap" state). Once a pair of file descriptors are shared, the master's side reserve request will only be used for command requests that have the SGV4_FLAG_SHARE flag set.

If the master share request fails (i.e. gives back any non zero status, or fails or warns at some other level) then the master request on completion will go to state "rs_inactive" (i.e. not "rs_swap"). It is also possible that the application wants to stop the request share after the master request (e.g. because the user wants to abort the copy or there is something wrong with the data copied to the user space near the location marked "***" in the above diagram). The EXTENDED ioctl has a MASTER_FINI boolean for that: writing 1 (true) changes the "rs_swap" to "rs_inactive" state while writing 0 (false) does the reverse of that (see below as to why).

The brown arrow-ed lines in the above diagram show the movement of the "dataset" which is usually an integral number of logical blocks (e.g. each containing 512 or 4096 bytes). The brown arrow-ed lines that are vertical and horizontal do not involve copying (or DMA-ing) of that dataset. That leaves three brown arrow-ed lines at an angle: the DMA from the device being read, the DMA to the device being written, and an optional in-kernel to user space copy (annotated with "***"). The vertical brown arrow-ed lines are performed by swapping pointers to scatter-gather lists within the kernel space.

The sgh_dd utility in the sg3_utils/testing directory uses both POSIX threads and sg driver sharing as discussed in this section (if the sg driver running on the target system is recent enough). sgh_dd has help (with 'sgh_dd -h') but no man page, like other test programs (its code is its documentation and an example of use).

A reasonable single READ, multiple WRITE solution needs the ability to have multiple slaves each associated with a different disk. Looking at the diagram above, two things need to happen to the master: it needs to adopt a new slave and it needs to get back into "rs_swap" state. A variant of the above mentioned ioctl(slave_fd, EXTENDED{SHARE_FD},) called ioctl(master_fd, EXTENDED{CHG_SHARE_FD},) has been added. As long as the new slave file descriptor meets requirements (e.g. it is not part of a file descriptor share already) then it will replace the existing slave file descriptor. To get back into "rs_swap" state the MASTER_FINI boolean in the EXTENDED ioctl, writing the value 0 (false) will do what is needed. The EXTENDED ioctl is a little tricky to use (because it essentially replaces many ioctls) but a side benefit is that multiple actions can be taken by a single EXTENDED ioctl call. So both the actions required to switch to another slave, ready to do another WRITE, can be done with a single invocation of the EXTENDED ioctl.

Here is a sequence of user space system calls to READ from /dev/sg1 (the master) and WRITE that same data to /dev/sg5, /dev/sg6 and /dev/sg7 (the slaves). Assume that fd1 is a file descriptor associated with /dev/sg1, fd5 with /dev/sg5, etc. In pseudocode that might be: ioctl(fd5, EXTENDED{SHARE_FD}, fd1); ioctl(fd1, SG_IO, FLAG_SHARE + READ); ioctl(fd5, SG_IO, FLAG_SHARE|NO_XFER + WRITE); ioctl(fd1, EXTENDED{CHG_SHARE_FD=fd6 + MASTER_FINI=false}); ioctl(fd6, SG_IO, FLAG_SHARE|NO_XFER + WRITE); ioctl(fd1, EXTENDED{CHG_SHARE_FD=fd7 + MASTER_FINI=false}); and ioctl(fd7, SG_IO, FLAG_SHARE|NO_XFER + WRITE). So four ioctls to move data (one READ and three WRITEs) and three "housekeeping" ioctls. Notice that the WRITEs are done sequentially, they could theoretically be done in parallel but that would add complexity. Also note that a second READ cannot be done until the final WRITE from the previous sequence has completed, there is no easy way around that since only one, in-kernel buffer is being used (and a second READ would overwrite it). To make this sequence slightly faster (and hide the data from the user space) the flag in the second ioctl (the READ) can be expanded to FLAG_SHARE|NO_XFER .

The sgh_dd utility in the sg3_utils testing directory (rev 803 or later) has been expanded to test the single READ, multiple WRITE feature. It has two extra "of" (output file) parameters: "of2=" and "ofreg=". The "of2=" is for a second WRITE sg device and the "ofreg=" takes a regular file or a pipe and will use the data that comes from the READ operation marked with "***" in the above diagram. If "ofreg=" is present among sgh_dd's operands then the READ's flag will be FLAG_SHARE, if "ofreg=" is not present its flags will be FLAG_SHARE|NO_XFER . The latter should be slightly faster, and that difference can be reduced with "iflag=mmap". The "of2=" operand shares "oflag=" and "seek=" with "of=".

7 Sharing design considerations

The primary application of sharing is likely to be copying from one storage device to another storage device where both are SCSI devices (or use the SCSI command set as SATA disks do in Linux). Lets assume the copy is large enough that it needs to be cut up into segments, implemented by READ(from source), WRITE(to destination) commands, each pair of which share the same data. Even with modern SSDs, maximum performance is usually obtained by queuing commands to storage devices. However the design of sharing in the sg driver requires sequential READ WRITE commands on a pair of shared file descriptors in a way that precludes queuing on those file descriptors. Worse still, the storage device that does the READ (i.e. the master side of the share) must wait, effectively doing nothing while its paired WRITE command is being done; it could be doing the next READ while it's waiting.

One relatively simple solution is to take advantage of threading which is well supported by the Linux kernel. Multi-threaded programs are typically multiple threads of execution running in a single process in which all threads share the same memory and other resources such as file descriptors. In the case of copying using sharing in the sg driver, a good approach would be to have one management thread and multiple worker threads. Each worker thread would go to a distribution centre where information about next segment offsets to be copied would be fetched; then the worker thread could go and do that copy segment using those offsets and return to the distribution centre for information on the next segment offsets to be copied, or be told there is nothing more to do in which case the thread could exit. The distribution centre needs to be stateful which in this context means that it needs to remember when it has given out copy segment offsets and not give them out again (unless the original thread reports an error). One way to protect this distribution centre from two worker threads accessing it at the same time is with a mutex shared between all worker threads. Finer grained threading mechanisms such as atomic integers may be able provide this protection in the place of a mutex.

With the sg driver there is no limit (in the driver modulo memory availability) to the number of file descriptors that there can be referring to a single storage device. So for this segmented copy using sg driver sharing, a good approach would be to do a separate open(2) system call on the source and another on the destination in each worker thread. Then each worker thread could set up a file descriptor share with the master being the source file descriptor. The number of worker threads should be no more than the maximum queue depth the two devices can comfortably handle. That said, having hundreds of worker threads may consume a lot of the machine's resources. An advantage of this approach is that each worker thread can use the sg driver's simpler synchronous interface (i.e. ioctl(SG_IO) ). Then the reader might wonder, is there any command queuing taking place? The answer is yes, because one way of viewing the sg driver is that under the covers it is always asynchronously accessing the SCSI devices. So even when when one thread is blocked on a ioctl(SG_IO) another thread can call ioctl(SG_IO) and that command will be forwarded to the device.

There is a big "gotcha" with this design (and almost any other design for segmented copy that isn't completely single threaded). The gotcha does not apply when the destination device is a SCSI device, or uses the pwrite(2) or writev(2) system calls but does apply to the write(2) system call, often used to write to a pipe or socket. The problem is that if a read is issued by one thread (or any asynchronous mechanism) called R1 and before it completes another thread issues a read called R2 then there is no guarantee that R1 will complete before R2. And if R2 does complete before R1 and the write(2) system call is called for W2 (i.e. the pair of R2) before W1 then those writes will be out of order. Detecting out-of-order writes when gigabytes are being copied can be a pain. If the source and shuffled destination are available as files then a utility like sha1sum will show them as different (because they are) but an old school sum (like from 'sum -s') will give the same value for both. There is a related issue associated with the atomicity of the Linux write(2) command. There is no corresponding atomicity issue with the SCSI WRITE command.

To save time and resources the master side shared READ request should be issued with SG_FLAG_NO_DXFER flag OR-ed with its other flags. That is assuming that the copy program does not need to "see" the data as it flies past. As a counter example, a copy program might want to do a sha256sum on the data being copied in which case that program needs to "see" the inflight data.

The above design can be extended to the single reader, multiple writer case. In other words each worker thread would open file descriptors to the READ storage device and every WRITE storage device. Code to demonstrate these techniques can be found in the sg3_utils package's testing/sgh_dd.c . That can be built into a utility, yet another dd variant called sgh_dd .

SCSI storage devices optionally report a "Block limits" Vital Product Data (VPD) page which contains a field called "Optimal transfer length" whose units are Logical blocks (e.g. usually either 512 or 4096 bytes). There is also a "Maximum transfer length" whose units are the same. If that VPD page is present (fetched via the SCSI INQUIRY command) but those fields are 0 then no guidance is provided. Otherwise the segment size chosen for a copy should probably be the minimum of the source and destination Optimal transfer length fields. However if that implies a segment size in the Megabyte range (say over 4 MB) then the Linux kernel may object.

Other copy designs are possible, that instead of using threads, use separate processes . One practical problem with this is the ioctl(2) that sets up the share between a destination file descriptor (fd) and a source fd. That will be done in the process containing the destination fd but how does it find out about the source fd? One way is in a process containing the source file descriptor, to use the Unix fork(2) system command to spawn a new process. The child process will share the same file descriptors as its parent. So if the child then goes on to open the destination storage device then it has the two file descriptors it needs to set up the share. While that solution may look good on paper, it may require a radical rewrite of existing code to implement. Perhaps a better solution is to pass an open file descriptor from one process to another process using a Unix socket. The blog by Keith Packard outlines the technique. Code based on both techniques can be found in the sg3_utils package's testing/sg_tst_ioctl.c (with the '-f' option).

8 pack_id or tag

When doing asynchronous IO with the sg driver there needs to be a way to wait for a particular response, not just the response that is the oldest. [By oldest is meant the command request in the active queue (a per file descriptor queue) whose callback occurred at the earliest time; this will usually be the first one in the active queue.] A common example would be a multi-thread application where each worker thread shares the same file descriptor and issues one command request and waits for the response to that request before issuing another command request.

Historically the way to do this with the sg driver is with a pack_id (short for packet identifier) which is a 32 bit integer. The pack_id is generated by the user application and passed into the interface structure (and in the v4 interface the pack_id is place in request_extra). The pack_id doesn't have to be unique (per file descriptor) but it is practical that it is unique (and the sg driver does not check its uniqueness). The user application should then call ioctl(SG_SET_FORCE_PACK_ID, 1) which alerts the sg driver to read (from the user space) the pack_id given to ioctl(SG_IORECEIVE) or read(2) and then get the (first) matching request on the active queue or wait for it to arrive. The pack_id value -1 (or 0xffffffff if viewed as an unsigned integer) is used as a wildcard or to report nothing is available, depending on the context. The pack_id method has worked well and generated few error or queries over the years and will continue to be supported in the sg v4 driver.

So what is a tag in this context? It is also a 32 bit integer but instead of being generated by the user application, it is generated by the block system. So instead of being given via the v4 interface structure to SG_IOSUBMIT, it is returned in the interface structure at the completion of ioctl(SG_IOSUBMIT) in the request_tag field (which is a 64 bit integer). Notice that the tag is only available in the v4 interface structure and via the two new async ioctls: SG_IOSUBMIT and SG_IORECEIVE. Using the tag to find a command response is very similar to the way it is done with pack_id described above. As currently implemented the tag logic does not work all the time, its reliability will most likely depend on the SCSI host (HBA driver (LLD)) that the target device belongs to. There seems no reliable way for this driver to fetch the tag from the block infrastructure. Currently this driver simply asks for it after forwarding the command request to the block code. However 3 cases have then been observed: it gets a tag; it doesn't get the tag (it is too early); it doesn't get the tag (it is too late), the request has already finished! The third case may only occur with the scsi_debug driver which can complete requests in a microsecond or less (that is configurable). The tag wildcard is also -1 (or all "f"s in hex when viewed as an unsigned integer) so again the logic is very similar to pack_id.

So given the above, the default remains what it was in v3 of the sg driver, namely, using pack_id unless another indication is given. To use tags to choose a response ioctl(SG_SET_FORCE_PACK_ID, 1) is needed first on the file descriptor. Then the v4 interface object given to ioctl(SG_IOSUBMIT) should OR SGV4_FLAG_YIELD_TAG with other flags in that interface object. Then after that ioctl has finished successfully, the request_tag field in that object should be set. If it is -1 then no tag was found (as discussed in the previous paragraph). The match ioctl(SG_IORECEIVE) call should make sure the request_tag field is set as appropriate and the SGV4_FLAG_FIND_BY_TAG flag should be OR-ed with other flags.

9 Bi-directional command support

One of the main reasons for designing the sg V4 interface was to handle SCSI (or other storage protocols) bi-directional commands (abbreviated here to bidi). In the SCSI command sets, bidi commands are mainly found in block commands that support RAID (e.g. XDWRITEREAD(10)) and many of the Object Storage Device (OSD) commands. Linux contains a "osd" upper level driver (ULD) and an object based file system called exofs. New SCSI commands are being considered such as READ GATHERED which would most likely be a bidi command. The NVMe command set (NVM) extends the bidi commands concept to "quad-di": data-in and data-out plus metadata-in and metadata-out.

Synchronous SCSI bidi commands have been available in the bsg driver for more than 12 years using ioctl(<bsg_dev_fd>, SG_IO) using the sg V4 interface (i.e. struct sg_io_v4) and are now available with the sg V4 driver where <bsg_dev_fd> is replaced by <sg_dev_fd>. Asynchronous SCSI bidi commands were available for the same period but were withdrawn around Linux kernel 4.15 due to problems with the bsg driver. Those asynchronous commands were submitted via the Unix write(2) call and the response was received using a Unix read(2) call. In the sg v4 driver the submitted and received object structure remains the same but the Unix write(2) and read(2) system calls can no longer be used. Instead two new ioctl(2)s have been introduced called SG_IOSUBMIT and SG_IORECEIVE to replace write(2) and read(2) respectively. The functionality is almost identical, read on for details.

In the sg driver the direct IO flag has the effect of letting the block layer manage the data buffers associated with a command. The effect of indirect IO in the sg driver is to let the sg driver manage the data buffers. Indirect IO is the default for the sg driver with the other options being mmap IO (memory mapped IO) and direct IO. Indirect IO is the most flexible with the sg driver, it can be used by both uni-directional and bidi commands and has no alignment requirements on the user space buffers. Request sharing discussed above cannot be used with direct IO (because the sg driver needs control of the data buffers to implement the share) while mmap IO is not implemented for bidi commands. Also a user space scatter gather list cannot be used for either the data-out or data-in transfers associated with a bidi command.

Other than the exclusions in the previous paragraph, all other capabilities of the sg driver are available to bidi commands. The completion is sent when the second transfer (usually a data-in transfer) has completed. pack_id and/or tags can be used as discussed in the previous section. Signal on completion, polling for completion and multi-threading should also work on bidi commands without issues.

10 SG interface support changes

In the following table, a comparison is made between the supported interfaces of the sg driver found in lk 4.20 (V3.5.36) and the proposed V4 sg driver. The movement of the main header file from the include/scsi directory to include/uapi/scsi/sg should not impact user space programs since modern Linux distributions should check both and the the stub header now in include/scsh/sg.h includes the other one. There is a chance the the GNU libc maintainers don't pick up this change/addition, but if so the author would expect that to be a transient problem. The sg3_utils/testing directory in the sg3_utils package gets around this problem with a local copy of the "real" new sg header in a file named uapi_sg.h .


 Table 1. sg interfaces supported by various sg drivers

interface support/



sg driver version

v1+v2 interfaces

Async

struct sg_header

v3 interface

Async

struct sg_io_hdr

v3 interface

Sync

struct sg_io_hdr

v4 interface

Async

struct sg_io_v4 (bsg.h)

v4 interface

Sync

struct sg_io_v4 (bsg.h)

sg driver V3.5.36

lk 2.6, 3 and 4

interface header ==>

write(2)+read(2)

include/scsi/sg.h

write(2)+read(2)

include/scsi/sg.h

ioctl(SG_IO)

include/scsi/sg.h

not available ^^^

not available ***

sg driver V4.0.x

lk ?

interface header ==>

write(2)+read(2) ****



include/uapi/scsi/sg.h

ioctl(SG_IOSUBMIT)+

ioctl(SG_IORECEIVE) or

write(2)+read(2)



include/uapi/scsi/sg.h

ioctl(SG_IO)



include/uapi/scsi/sg.h

ioctl(SG_IOSUBMIT)+

ioctl(SG_IORECEIVE)



include/uapi/scsi/sg.h +

include/uapi/linux/bsg.h

ioctl(SG_IO)


include/uapi/scsi/sg.h +

include/uapi/linux/bsg.h


*** available via the bsg driver; ^^^ removed from the bsg driver in lk 4.15; **** the plan is to deprecate the write(2)/read(2) based interfaces which would leave v1+v2 interfaces unsupported.

Note that there is no v1+v2 sync interface. Rather than completely drop the write(2)+read(2) interface, it could be kept alive for only v1+v2 interfaces. Applications based on the v1+v2 interfaces would be written around 20 years ago and need a low level re-write to use the v3 or v4 async interfaces. So what might be dropped is the ability of the v3 interface to use the write(2)+read(2) interface as the only code change required should be to change the write(2) to an ioctl(SG_IOSUBMIT) and the read(2) to an ioctl(SG_IORECEIVE).

11 Downloads

This tarball sgv4_20190118 has two parts. One directory is named lk5.0 and targets lk 5.0-rc<n> and the other is named lk_le4.20 and targets earlier lk 4.20 kernels and earlier. The difference is in lk 5.0-rc1 a kernel wide patch by Linus Torvalds changed the number of parameters to the access_ok() function. Since the sg driver uses that call over 10 times, it broke a lot of patches making it difficult to maintain a single set of patches, hence the split. Both of those directories have a sub-directory called sgv4_20190116 which contains a series of 17 patches. Both of those directories contain the 3 files that represent the sg v4 driver in the kernel: drivers/scsi/sg.c and include/scsi/sg.h and include/uapi/scsi/sg.h . The last file is new (i.e. not in the production sg driver). If those 3 files are copied into the corresponding locations in a kernel source tree then a subsequent kernel build will generate the sg v4 driver. It might be a good idea to take a copy of driver/scsi/sg.c and include/scsi/sg.h before copying those files to simplify reverting to the sg v3 driver currently in the kernel.

The patches are against Martin Petersen's 5.1/scsi_queue branch (the part under lk5.0) and his 4.21/scsi-queue branch (the part under lk_le4.20). It should apply against lk 4.18 and later (and perhaps before; to be tested). The recent patches on the sg driver that might interfere (or cause fuzz) are:

96d4f267e40f9 (Linux Torvalds 2019-01-03 18:57:57 -0800) access_ok(3 parameters to 2 parameters) appeared in lk 5.0-rc1

92bc5a24844ad (Jens Axboe 2018-10-24 13:52:28 -0600) remove double underscore version of blk_put_request(), appeared in lk 5.0-rc1

abaf75dd610cc (Jens Axboe 2018-10-16 08:38:47 -0600) blk_put_request(srp->rq) addition, first appeared in lk 4.20-rc1

The sg driver patch prior to that was 8e4a4189ce02f (Tony Battersby 2018-07-12), first appeared in v4.18-rc8

The sg3_utils was originally written to test sg v3 interface when it was introduced. So where better to put sg v4 test code? Since the sg3_utils is well established, the author sees no benefit in introducing a sg4_utils in which less than an estimated 5% of the code would change, much easier to incorporate that code change/addition in the existing package. The latest sg3_utils beta on the main page (revision 807 (a beta of version 1.45) as this is written) contains utilities for testing the sg v4 interface. The underlying support library has been using the sg v4 header for many years as a common format. If the given device was a bsg device node then the sg v4 interface was used; otherwise (e.g. for sg and block devices) the sg v4 header was translated down to a v3 header and forwarded on. In the current beta, the sg3_utils will use ioctl(SG_GET_VERSION_NUM) on sg devices and if it is a v4 driver then it will send a v4 header, otherwise it will do as it does now. [That v4 interface usage can be defeated by './configure --disable-linux-sgv4' .] In the testing directory of that beta are 5 utilities that are "v4" driver aware: sg_tst_ioctl, sg_tst_async, sg_tst_bidi, sgh_dd and sgs_dd. sgh_dd is yet another dd clone, with POSIX thread support and 'sharing" support discussed above (in the 20181217 version this utility was called sgs_dd, see next sentence for why the name change). Retrieved from an archive is sgs_dd which exercises SIGIO and real-rime signals plus polling. These test utilities are not built by default and are not part of the automake setup; instead an old school Makefile is used. And sg_tst_async is a C++ program and can be build with 'make -f Makefile.cplus sg_tst_async' . Prior to building these test utilities the sg3_utils library needs to be built. That can be done with 'cd <root_of_sg3_utils> ; ./configure ; cd lib ; make ; cd ../testing' . There is a 'make install' which will place the test utilities in /usr/local/bin ; there is also a 'make -f Makefile.cplus install' .

12 Other documents

The original sg driver documentation is here: SCSI-Generic-HOWTO and a more recent discussion of ioctl(SG_IO) is here: sg_io .

13 Conclusion

The sg v4 driver is designed to be backwardly compatible with the v3 driver. This simplest way for an application to find which driver it has is with the ioctl(SG_GET_VERSION_NUM). Removing a restriction such as 16 outstanding commands per file descriptor can catch out programs that rely on hitting that limit. Adding a driver parameter to re-impose that limit and any other differing behaviour can be done if the need arises.

Return to main page.

Douglas Gilbert

Last updated: 18th January 2018 11:00