The  Linux SG_IO ioctl in the 2.6 series

  1. The  Linux SG_IO ioctl in the 2.6 series
    1. Introduction
    2. SCSI and related command sets
    3. SG_IO ioctl overview
    4. SG_IO ioctl in the sg driver
    5. SG_IO ioctl differences
    6. open() considerations
    7. SCSI command permissions
    8. CAP_SYS_RAWIO from a user process
    9. SG_IO and the st driver
    10. Maximum transfer size per command
    11. Conclusion

Introduction

The SG_IO ioctl permits user applications to send SCSI commands to a device. In the linux 2.4 series this ioctl was only available via the SCSI generic (sg) driver. In the linux 2.6 series the SG_IO ioctl is additionally available for block devices and SCSI tape (st) devices.  So there are multiple implementations of this ioctl within the kernel with slightly different characteristics and describing these is the purpose of this document.

The information in this page is valid for linux kernel 2.6.16 .

SCSI and related command sets

All SCSI devices should respond to an INQUIRY command and part of their response is the so-called peripheral device type. This is used by the linux kernel to decide which upper level driver controls the device. There are also devices that belong to other (i.e. not considered SCSI) transports that use SCSI command sets, the primary examples of this are (S-)ATAPI CD and DVD drives. Not all peripheral device types map to upper level drivers and devices of these types are usually accessed via the SCSI generic (sg) driver.

SCSI (draft) standards are found at www.t10.org . SCSI commands common to all SCSI devices are found in SPC-4 while those specific to block devices are found in SBC-2, those for CD/DVD drives are found in MMC-5 and those for SCSI tape drives are found in SSC-3.

The major non-SCSI command set in the storage area is for ATA non-packet devices which are typically disks. ATA packet devices use ATAPI which in the vast majority of cases carry a SCSI command set. The most recent draft ATA command set standard is ATA8-ACS and can be found at www.t13.org . To complicate things (non-packet) ATA devices may have their native command set translated into SCSI. This can happen in the kernel (e.g. libata in linux) or in an intermediate device (e.g. in a USB external disk enclosure). Yet another possibility are disks whose firmware can be changed to allow them to use either the SCSI or ATA command set, this may happen in the SAS/SATA area since the physical (cabling) and phy (electrical signalling) levels are so similar.

SG_IO ioctl overview

The third argument given to the SG_IO ioctl is a pointer to an instance of the sg_io_hdr structure which is defined in the <scsi/sg.h> header file. The execution of the SG_IO ioctl can viewed as going through three phases:
  1. do sanity checks on the metadata in the sg_io_hdr instance; read the input fields and the data pointed to by some of those fields; build a SCSI command and issue it to the device
  2. wait for either a response from the device, the command to timeout or the user to terminate the process (or thread) that invoked the SG_IO ioctl
  3. write the output fields and in some cases write data to locations pointed to by some fields, then return
Only phase 1 returns an ioctl error (i.e. a return value of -1 and a value set in errno). In phase 2, command timeouts should be used sparingly as the device (and some others on the same interconnect) may end up being reset. If the user terminates the process or thread that invoked the SG_IO ioctl then obviously phase 3 never occurs but the command execution runs to completion (or timeout) and the kernel "throws away" the results. If the command yields a SCSI status of CHECK CONDITION (in field "status") then sense data is written out in phase 3 .

Now we will assume that the SCSI command involves user data being transferred to or from the device. The SCSI subsystem does not support true bidirectional data transfers to a device. All data DMA transfers (assuming the hardware supports DMA) occur in phase 2. However, if indirect IO is being used (i.e. neither direct IO nor mmap-ed transfers) then either:
When direct IO or mmap-ed transfers are being used then all user data is moved in phase 2 . If a process is terminated during such a data transfer then the kernel gracefully handles this (by pinning the associated memory pages until the transfer is complete).

The sg_io_hdr structure has 22 fields (members) but typically only a small number of them need to be set. The following code fragment shows the setup for a simple TEST UNIT READY SCSI command which has no associated data transfers:
   
    unsigned char sense_b[32];
    unsigned char turCmbBlk[] = {TUR_CMD, 0, 0, 0, 0, 0};
    struct sg_io_hdr io_hdr;

    memset(&io_hdr, 0, sizeof(struct sg_io_hdr));
    io_hdr.interface_id = 'S';
    io_hdr.cmd_len = sizeof(turCmbBlk);
    io_hdr.mx_sb_len = sizeof(sense_b);
    io_hdr.dxfer_direction = SG_DXFER_NONE;
    io_hdr.cmdp = turCmbBlk;
    io_hdr.sbp = sense_b;
    io_hdr.timeout = DEF_TIMEOUT;

    if (ioctl(fd, SG_IO, &io_hdr) < 0) {

The memset() call is pretty important, setting unused input fields to safe values. Setting the timeout field to zero is not a good idea; 30,000 (for 30 seconds) is a reasonable default for most SCSI commands. As always, good error processing consumes a lot more code. This is especially the case with SCSI commands that yield "sense data" when something goes wrong. For example, if there is a medium error during a disk read, the sense data will contain the logical block address (lba) of the failure. Another error processing example is a SCSI command that the device considers an "illegal request", the sense data may show the byte and bit position of the field in the command block (usually referred to as a "cdb") that it objects to. For examples on error processing please refer to the sg3_utils package, its "examples" directory and its library components: sg_lib.c (SCSI error processing and tables) and sg_cmds.c (common SCSI commands).

Below is a grouping of important sg_io_hdr structure fields with brief summaries:
Command block (historically referred to as the "cdb"):
Data transfer:
Error indication:
Sense data (only used when 'status' is CHECK CONDITION or (driver_status & DRIVER_SENSE) is true):
The fields in the sg_io_hdr structure are defined in more detail in the SCSI-Generic-HOWTO document.

SG_IO ioctl in the sg driver

Linux kernel 2.4.0 was the first production kernel in which the SG_IO ioctl appeared in the SCSI generic (sg) driver. The sg driver itself has been in linux since around 1993. An instance of the sg_io_hdr structure in the sg driver can either be:
The SCSI-Generic-HOWTO document describes the sg driver in the lk 2.4 series including its use of the SG_IO ioctl. Prior to the lk 2.4 series the sg driver only had the sg_header structure. It was used as an asynchronous command interface in which command, metadata and optionally user data was sent via a Unix write() system call. The corresponding response which included error information (e.g. sense data) or optionally user data was received via a Unix read() system call. Two major additions were made to the sg driver at the beginning of the lk 2.4 series:
The sg_io_hdr only contains metadata in the sense that it contains pointers to locations of where data will come from (command or data in) or go to (sense data or data out). These pointers have caused problems in mixed 32/64 bit environments, especially when the user application (e.g. cdrecord) is built for 32 bits and the kernel is 64 bits. The lk 2.6 series has a compatibility layer to cope with this via code specialized for the SG_IO ioctl. Unfortunately this problem was not foreseen when the sg_io_hdr structure was designed.

A significant feature of the SG_IO ioctl in the sg driver is that it is user interruptible. This means between issuing a command (e.g. a long duration command like a disk format) and its response arriving a user could hit control-C on the associated application. The kernel would remain stable and resources would be cleared up at the appropriate time. The sg driver does not attempt to abort such a command that is "in flight", it simply throws away the response and cleans up. Naturally the user has no direct way of finding out whether an interrupted command succeeded or not, by there may be indirect ways.

A warning may also be in order here: a long duration command such as format would typically be given a long timeout value. If the user interrupted the application that sent the format command then the device may remain busy doing the format (especially if the IMMED bit is not set). So if the user then sent a short duration command such as TEST UNIT READY or REQUEST SENSE to see what the device was doing, these commands may timeout. This would invoke the SCSI subsystem error handler which would most likely send a device reset, thus aborting the format, to get the device's attention. This is probably not what the user had in mind!

SG_IO ioctl differences

In the following table, sg_io_hdr structure fields are listed in the order they appear in that structure. Basically the "in" fields appear at the top of the structure and are read in phase 1. The latter fields are termed as "out" and are written by the SG_IO implementation in phase 3.
 
Table 1. sg_io_hdr structure  summary and implementation differences
sg_io_hdr field
in or out
type
different
brief description including differences between implementations
interface_id
in
int

guard field. Current implementations only accept " (int)'S' ". If not set, the sg driver sets errno to ENOSYS while the block layer sets it to EINVAL
dxfer_direction
in
(-ve) int
minor
direction of data transfer. SG_DXFER_NONE and friends are defined as negative integers so the sg driver can discriminate between sg_io_hdr instances and those of sg_header. This nuance is irrelevant to non-sg driver usage of SG_IO. See below.
cmd_len
in
unsigned char

limits command length to 255 bytes. No SCSI commands (even variable length ones in OSD) are this long (yet)
max_sb_len
in
unsigned char

maximum number of bytes of sense data that the driver can output via the sbp pointer
iovec_count
in
unsigned short
yes
if not sg driver and greater than zero then the SG_IO ioctl fails with errno set to EOPNOTSUPP; sg driver treats dxferp as a pointer to an array struct sg_iovec when this field is greater than zero
dxfer_len
in

unsigned int
minor
number of bytes of data to transfer to or from the device. Upper limit for block devices related to /sys/block/<device>/queue/max_sectors_kb
dxferp in [*in or *out]
void *
minor
pointer to (user space) data to transfer to (if reading from device) or transfer from (if writing to device). Further level of indirection in the sg driver when iovec_count is greater than 0 .
cmdp in [*in]
unsigned char *

pointer to SCSI command. The SG_IO ioctl in the sg drive fails with errno set to  EMSGSIZE if cmdp is NULL and EFAULT if it is invalid; the block layer sets errno to EFAULT  in both cases.
sbp
in [*out]
unsigned char *

pointer to user data area where no more than max_sb_len bytes of sense data from the device will be written if the SCSI status is CHECK CONDITION.
timeout
in
unsigned int
yes
(if = 0)
time in milliseconds that the SCSI mid-level will wait for a response. If that timer expires before the command finishes, then the command may be aborted, the device (and maybe others on the same interconnect) may be reset depending on error handler settings. Dangerous stuff, the SG_IO ioctl has no control (through this interface) of exactly what happens. In the sg driver a timeout value of 0 means 0 milliseconds, in the block layer (currently) it means 60 seconds.
flags
in
unsigned int
yes
Block layer SG_IO ioctl ignores this field; the sg driver uses it to request special services like direct IO or mmap-ed transfers. It is a bit mask.
pack_id
in -> out
int

unused (for user space program tag)
usr_ptr
in -> out
void *

unused (for user space pointer tag)
status
out
unsigned char

SCSI command status, zero implies GOOD
masked_status
out
unsigned char

Logically: masked_status == ((status & 0x3e) >> 1). Old linux SCSI subsystem usage, deprecated.
msg_status
out
unsigned char

SCSI parallel interface (SPI) message status (very old, deprecated)
sb_len_wr
out
unsigned char

actual length of sense data (in bytes) output via sbp pointer.
host_status
out
unsigned short

error reported by the initiator (port). These are the "DID_*" error codes in scsi.h
driver_status
out
unsigned short

bit mask: error and suggestion reported by the low level driver (LLD). These are the "DRIVER_*" error codes in scsi.h
resid
out
int

(dxfer_len - number_of_bytes_actually_transferred). Typically only set when there is a shortened DMA transfer from the device.  Not necessarily an error. Older LLDs always yield zero.
duration
out
unsigned int

number of milliseconds that elapsed between when the command was injected into the SCSI mid level and the corresponding "done" callback was invoked. Roughly the duration of the SCSI command in milliseconds.
info
out
unsigned int
minor
bit mask indicating what was done (or not) and whether any error was detected. Block layer SG_IO ioctl only sets SG_INFO_CHECK if an error was detected

The DID_* and DRIVER_* error and suggestion codes (associated with host_status and driver_status) are discussed in more detail in the SCSI-Generic-HOWTO document.

open() considerations

Various drivers have different characteristics when a device node is opened. One problem with the ioctl system call is that a user only needs read permissions to execute it but may, with the ioctls like SG_IO, write to a device (e.g. format it).  Command (operation code) sniffing logic is used to overcome this security problem. Also users of the SG_IO ioctl need to be aware when they "share" a device with sd, st or a cdrom driver that state machines within those drivers may be tricked. This may be unavoidable but the users of the SG_IO ioctl should take appropriate care.

Opening a file in linux with flags of zero implies the O_RDONLY flag and hence read only access. All open() system calls can yield ENOENT (no such file or directory); ENODEV (no such device) if the file exists but there is no attached device and EACCES (permission denied) if the user doesn't have appropriate permissions.

A user with CAP_SYS_RAWIO capability (normally associated with the "root" user) bypasses all command sniffing and other access controls that would otherwise lead to EACCES or EPERM errors. With the sg driver such a user may still need to open() a device node with O_RDWR (rather than O_RDONLY) to use all SCSI commands.

Table 2. open() flags for SG_IO ioctl usage
open() flags
sg
notes
sd
notes
st
notes
cdrom
notes
Comments
<none> or
O_RDONLY
1, 2
3,4
3,5
3,6
best to add O_NONBLOCK. For a device with removable media (e.g. tape drive) that depends on whether the drive or its media is being accessed.
O_RDONLY | O_NONBLOCK
1,7
3
3,13
3
recommended when SCSI commands are recognized as reading information from the device
O_RDWR
2
4,8,9
5,8,9
6,8,9
again, could be better to add O_NONBLOCK
O_RDWR | O_NONBLOCK
7
8,9
8,9,13
8,9
recommended when arbitrary (including vendor specific) SCSI commands are to be sent
<< interaction with O_EXCL>>
10
11
12
11
only use when sure that no other application may want to access the device (or partition). A surprising number of applications do "poke around" devices.
<< interaction with O_DIRECT>>
-
-->
-
-->
requires sector alignment on data transfers (ignored by sg and st)

Notes:
  1. on subsequent SG_IO ioctl calls, the sg driver will only allow SCSI commands in its allow_ops array, others result in EPERM (operation not permitted) in errno. See below .
  2. if previous open() of this sg device node still holds O_EXCL then this open() waits until it clears.
  3. on subsequent SG_IO ioctl calls, the block layer will only allow SCSI commands listed as "safe_for_read" in the verify_command() function in the drivers/block/scsi_ioctl.c file; others result in EPERM (operation not permitted) in errno. See below .
  4. if removable media and it is not present then yields ENOMEDIUM (no medium found)
  5. if a tape is not present in drive then yields EIO (input/output error), if tape is "in use" then yields EBUSY (resource busy). Only one open file descriptor is allowed per st device node at a time (although dup() can be used).
  6. if tray closed and media is not present then yields ENOMEDIUM (no medium found); if tray open then tries to close it and if no media present then yields ENOMEDIUM
  7. if previous open() of this sg device node still holds O_EXCL then yields EBUSY (resource busy).
  8. on subsequent SG_IO ioctl calls, the block layer will allow SCSI commands listed as either "safe_for_read" or "safe_for_write". For other SCSI commands the user requires the CAP_SYS_RAWIO capability (usually associated with the "root" user); if not yields EPERM (operation not permitted). The first instance of other SCSI commands since boot, sends an annoying "scsi: unknown opcode" message to the log.
  9. if the media or drive is marked as not writable then yields EROFS (read-only file system).
  10. if sg device node already has exclusive lock then a subsequent attempt to open(O_EXCL) will wait unless O_NONBLOCK is given in which case it yields EBUSY (resource busy)
  11. implemented at block device level (which knows about partitions within devices). If a previous open(O_EXCL) is active then a subsequent open(O_EXCL) yields EBUSY (resource busy). Mounted file systems typically open a device/partition with O_EXCL; as long as an application using the SG_IO ioctl does not also try and use the O_EXCL flag then it will be allowed access to the device.
  12. the st driver does not support (i.e. ignores) the O_EXCL flag. However the fact that it only permits one active open() per tape device is similar functionality.
  13. if tape is "in use" then yields EBUSY (resource busy). Only one open file descriptor is allowed per st device node at a time.
The O_EXCL flag has a different effect in the sg driver and the block layer. In the sg driver, once O_EXCL is held on a device, all subsequent open() attempts will either wait or yield EBUSY (irrespective of whether they attempt to use the O_EXCL flag). Once a partition/device is opened successfully in the block layer (with the sd or cdrom driver) only subsequent open() attempts that also use the O_EXCL flag are rejected (with EBUSY). A O_EXCL lock held on a device in the block layer has no effect on accessing the same device via the sg driver (and vice versa).

The first successful open on a sd or a cdrom device node that has removable media will send a PREVENT ALLOW MEDIUM REMOVAL (prevent) SCSI command to the device. If successful, this will inhibit a subsequent START STOP UNIT (eject) SCSI command and de-activate the eject button on the drive. In emergencies, the SG_IO ioctl can be used to defeat this action, an example of this is the sdparm utility, specifically "sdparm --command=unlock".

The open() flag O_NDELAY has the same value and meaning as O_NONBLOCK. Other flags such as O_DIRECT, O_TRUNC and O_APPEND have no effect on the SG_IO ioctl.

SCSI command permissions

In linux a user only needs read permissions on a file descriptor to execute an ioctl() system command. In the case of the SG_IO ioctl, a SCSI command could be sent that obviously changes the state of a device (e.g. WRITE to a disk). So both implementations of the SG_IO ioctl require more than read permissions for some commands, especially those that are known to change the state of a device or those that have some unknown action (e.g. vendor specific commands).

Here is a table of SCSI commands that don't need the user to have write permissions (or in some cases CAP_SYS_RAWIO capability which usually equates to "root" user):
Table 3. SCSI command minimum permission requirements
SCSI command
(draft) standard
sg driver requires
block layer SG_IO
requires (except st)

Comments
BLANK
MMC-4
O_RDWR O_RDWR
CLOSE TRACK/SESSION
MMC-4
O_RDWR O_RDWR
ERASE
MMC-4
O_RDWR O_RDWR
FLUSH CACHE
SBC-3, MMC-4
O_RDWR O_RDWR Really SYNCHRONIZE CACHE command
FORMAT UNIT
SBC-3, MMC-4
O_RDWR O_RDWR default command timeout may not be long enough
GET CONFIGURATION
MMC-4
O_RDWR O_RDONLY reads CD/DVD metadata
GET EVENT STATUS NOTIFICATION
MMC-4
O_RDWR O_RDONLY
GET PERFORMANCE
MMC-4 O_RDWR O_RDONLY
INQUIRY
SPC-4
O_RDONLY
O_RDONLY
All SCSI devices should respond to this command
LOAD UNLOAD MEDIUM
MMC-4
O_RDWR O_RDWR MEDIUM may be replaced by CD, DVD or nothing
LOG SELECT
SPC-4
O_RDWR O_RDWR used to change logging or clear logged data
LOG SENSE
SPC-4
O_RDONLY O_RDONLY used to fetch logged data
MAINTENANCE COMMAND IN
SPC-4
O_RDONLY
CAP_SYS_RAWIO

various "REPORT ..." commands such as REPORT SUPPORTED OPERATION CODES in here
MODE SELECT (6+10)
SPC-4
O_RDWR O_RDWR Used to change SCSI device metadata
MODE SENSE (6+10)
SPC-4
O_RDONLY O_RDONLY Used to read SCSI device metadata
PAUSE RESUME MMC-4 O_RDWR O_RDONLY
PLAY AUDIO (10)
MMC-4
O_RDWR O_RDONLY
PLAY AUDIO MSF
MMC-4
O_RDWR O_RDONLY
PLAY AUDIO TI
??
O_RDWR O_RDONLY opcode 0x48, unassigned to  any spec in SPC-4
PLAY CD MMC-2
O_RDWR O_RDONLY old, now SPARE IN in SPC-4
PREVENT ALLOW MEDIUM REMOVAL
SPC-4, MMC-4
O_RDWR O_RDWR sd, st and cdrom drivers use this internally
READ (6+10+12+16)
SBC-3
O_RDONLY O_RDONLY READ(16) requires O_RDWR with the sg driver before lk2.6.11
READ BUFFER
SPC-4
O_RDONLY O_RDONLY
READ BUFFER CAPACITY
MMC-4
O_RDWR O_RDONLY
READ CAPACITY(10)
SBC-3, MMC-4
O_RDONLY O_RDONLY
READ CAPACITY(16)
SBC-3,
MMC-4
O_RDONLY
CAP_SYS_RAWIO
within SERVICE ACTION IN command. Needed for RAIDs larger than 2 TB
READ CD
MMC-4
O_RDWR O_RDONLY
READ CD MSF
MMC-4
O_RDWR O_RDONLY
READ CDVD CAPACITY
SBC-3, MMC-4
O_RDONLY O_RDONLY Strange (old ?) name from cdrom.h . Actually is READ CAPACITY.
READ DEFECT (10)
SBC-3
O_RDWR
O_RDONLY
READ DISC INFO
MMC-4
O_RDWR O_RDONLY
READ DVD STRUCTURE
MMC-4
O_RDWR O_RDONLY
READ FORMAT CAPACITIES
MMC-4
O_RDWR O_RDONLY
READ HEADER
MMC-2
O_RDWR O_RDONLY
READ LONG (10)
SBC-3
O_RDONLY O_RDONLY but not READ LONG (16)
READ SUB-CHANNEL
MMC-4
O_RDWR O_RDONLY
READ TOC/PMA/ATIP
MMC-4
O_RDWR O_RDONLY
READ TRACK (RZONE) INFO
MMC-4
O_RDWR O_RDONLY In MMC-4 called READ TRACK INFO
RECEIVE DIAGNOSTIC
SPC-4
O_RDONLY CAP_SYS_RAWIO the SES command set uses this command a lot. An SES device is only accessible via an sg device node
REPAIR (RZONE) TRACK
MMC-4 O_RDWR O_RDWR
REPORT KEY
MMC-4
O_RDWR O_RDONLY
REPORT LUNS
SPC-4
O_RDONLY CAP_SYS_RAWIO mandatory since SPC-3
REQUEST SENSE
SPC-4
O_RDONLY O_RDONLY has uses other than those displaced by autosense
RESERVE (RZONE) TRACK
MMC-4
O_RDWR O_RDWR
SCAN
MMC-4
O_RDWR O_RDONLY
SEEK
MMC-4 O_RDWR O_RDONLY
SEND CUE SHEET
MMC-4
O_RDWR O_RDWR
SEND DVD STRUCTURE
MMC-4
O_RDWR O_RDWR
[SEND EVENT]
MMC-2

O_RDWR cdrom.h associates opcode 0xa2 but MMC-2 uses opcode 0x5d ??
SEND KEY
MMC-4
O_RDWR O_RDWR
SEND OPC INFORMATION
MMC-4
O_RDWR O_RDWR
SERVICE ACTION IN
SPC-4, SBC-3
O_RDONLY CAP_SYS_RAWIO READ CAPACITY (16) service action in here
SET CD SPEED
MMC-4
O_RDWR O_RDWR cdrom.h calls this SET SPEED
SET STREAMING
MMC-4
O_RDWR O_RDWR
START STOP UNIT
SBC-3, MMC-4
O_RDWR O_RDONLY hmm
STOP PLAY/SCAN
MMC-4 O_RDWR O_RDONLY
SYNCHRONIZE CACHE SBC-3, MMC-4
O_RDWR O_RDWR cdrom.h calls this FLUSH CACHE
TEST UNIT READY
SPC-4
O_RDONLY O_RDONLY All SCSI devices should respond to this command
VERIFY (10+16)
SBC-3, MMC-4
O_RDWR O_RDONLY
WRITE (6+10+12+16) SBC-3
O_RDWR O_RDWR
WRITE LONG (10+16)
SBC-3
O_RDWR O_RDWR
WRITE VERIFY (10+16)
SBC-3, MMC-4
O_RDWR O_RDWR only WRITE VERIFY(10) is in MMC-4

Any other SCSI command (opcode) not mentioned for the sg driver needs O_RDWR. Any other SCSI command (opcode) not mentioned for the block layer SG_IO ioctl needs a user with CAP_SYS_RAWIO capability. All "block" SG_IO ioctl calls on st device nodes need a user with CAP_SYS_RAWIO capability. If a user does not have sufficient permissions to execute a SCSI command via the SG_IO ioctl then the system calls fails (i.e. no SCSI command is sent) and errno is set to EPERM (operation not permitted).

Both the sg driver and the block layer SG_IO code use internal tables to enforce the permissions shown in the above table (allow_ops and cmd_type [safe_for_read and safe_for_write] respectively). This technique doesn't scale well, since more advanced command sets (e.g. OSD) use service actions (and one opcode: 0x7f in the case of OSD). There may also be overlap in opcode usage between command sets, for example between SBC, MMC and SSC.

CAP_SYS_RAWIO from a user process

While root processes usually have CAP_SYS_RAWIO, processes running under a user's ID (i.e. non-root) typically don't. Hence non-root processes may not be able to use SG_IO to send SCSI commands that require CAP_SYS_RAWIO. This may occur even if the permission bits of the device node file allow for read or write access, user processes will receive EPERM when using SG_IO.

By default the capability to assign capabilities to other processes (CAP_SETPCAP) is limited to very few processes, such as certain kernel threads. Changing this default would require to change and recompile the kernel.

Processes which are forked by a root process and call setuid later will lose the CAP_SYS_RAWIO capability the parent root process (and the child before the setuid) had. However, the child can preserve the capabilities of the root process in the permitted set and raise it after the call of setuid:

/* ... in child after fork(), still running as root ... */
prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
setuid(...);
cap_set_proc(cap_from_text("cap_sys_rawio+ep"));

This way a user process with a parent root process can 'get back' the required capabilities to directly send SCSI commands to a device via SG_IO.

The above technique may be of use to daemons that are started with root permissions (most are) and then changes to another user after a fork(). It is not obvious to the author how utilities that use the SG_IO ioctl on device nodes that require CAP_SYS_RAWIO for some or all SCSI commands (e.g. nodes associated with the sd and st drivers) can use the above technique.

SG_IO and the st driver

In order to implement its user space API, the st driver has to maintain information about where the read head is with respect to the structural elements of the tape (filemarks, beginning of tape, end of data). Because the streaming device SCSI commands don't have addresses, the st driver has to know what commands have been sent. When reading, the filemarks are noticed when a read fails and sense data is fethed. If SG_IO is mixed with tape commands, the st driver may lose information (it does not look at the SG_IO commands and results). Because of this, the st driver may not implement the semantics the user expects. If the user accepts this or knows when using  SG_IO does not cause information loss, then using SG_IO is OK.

So mixing st driver read, write and ioctl commands with SCSI commands sent via SG_IO that change the state of the tape is not recommended. This applies whether the SG_IO SCSI commands are sent via st or sg device nodes.

Maximum transfer size per command

The largest amount of data that can be transferred by a single SCSI command is often a concern. Various SCSI command sets (e.g. SBC-3 for disk READs and WRITEs, SSC-3 for tape READs and WRITEs, and SPC-4 for READ+WRITE BUFFER) allow very large data transfer sizes but Linux is not so accommodating. The Host Bus Adapter (HBA) could have transfer size limits as could the transport and finally the SCSI device itself. In the latter case SBC-3 defines a "Block Limits" Vital Product Data (VPD) while SSC has the READ BLOCK LIMITS SCSI command. SBC-3's optional Block Limits VPD page contains both maximum and optimal counts. In the author's opinion that latter distinction is very important: the block susbsystem should try and use optimal sizes while pass through users should only be constrained by maximum sizes. Also if a pass through user exceeds a maximum transfer size imposed by a SCSI device, then the device can report an error. There is an underlying assumption that the applications using a pass through interface know what they are doing, or at least know more than the various kernel susbsystems. On the other hand, the kernel has the responsibility to allocate critical shared resources such as memory.

In the past, Linux used a single, "big-enough", block of memory for the source or destination of large data transfers. Then scatter-gather lists where added to break transfers up into smaller (often "page" size (4 KB on i386 architecture)) chunks which made memory management easier for the kernel. Now, in the lk 2.6 series, the single block of memory option is being phased out.

The Linux SCSI subsystem imposes a 128 element limit on scatter gather lists via its SCSI_MAX_PHYS_SEGMENTS define. The way various memory pools are allocated by the linux SCSI subsystem, SCSI_MAX_PHYS_SEGMENTS could be increased to 256. Associated with each type of HBA there is normally a low level driver (LLD). Each LLD can further limit the maximum number of elements with the scsi_host_template::sg_tablesize field. Prior to lk 2.6.16 the sg and st drivers used the .sg_tablesize field only, since lk 2.6.16 those drivers are also constrained by SCSI_MAX_PHYS_SEGMENTS. This leads to a potential halving of the maximum transfer size. Many LLDs set the .sg_tablesize field to SG_ALL (which is 255) but they may as well set that field to 256 unless the HBA hardware has a constraint.

User space memory may be allocated as the source and/or destination for DMA transfers from the HBA (i.e. direct IO). Even if the user space allocated a large amount of memory with a single malloc(), the HBA DMA element typically has a different view of memory. This view may well contain many "page" size discontinuous pieces. This has the effect of using up, or perhaps exhausting, scatter-gather elements.

The sg driver attempts to build scatter gather lists with each element up to SG_SCATTER_SZ bytes large. This define is found in include/scsi/sg.h and has been set to 32 KB for some years. That is 8 times the page size (of 4 KB) on the i386 architecture. Some users who need really large transfers increase this define (and it is best to keep it a power of 2). However since lk 2.6.16 another limit comes into play: the MAX_SEGMENT_SIZE define which is set to 64 KB. MAX_SEGMENT_SIZE is a default and can be overridden by the LLD calling blk_queue_max_segment_size().

In lk 2.6.16 two further LLD parameters come into play even when the sg (and st) driver is used. These are scsi_host_template::max_sectors and scsi_host_template::use_clustering . 

The .max_sectors setting in the LLD is the maximum number of 512 byte sectors allowed in a single SCSI command's scatter gather lists (for data transfers). Yes, that is a strange limit when trying to send a SCSI WRITE BUFFER command to upload firmware. Sysfs makes the LLD's .max_sectors setting visible (converted to kilobytes) in /sys/block/sd<x>/queue/max_hw_sectors_kb . The maximum allowable value in a LLD's .max_sector seems to be 65535 (0xffff in hexadecimal). This limits the maximum transfer size to (32*1024*1024 - 512) bytes, assuming other limitations have been overcome. [The 65535 sector limit is because Scsi_Host::max_sectors has type "unsigned short". Hopefully this type is expanded to "int" in the future (or removed).]

The .use_clustering field should be set to ENABLE_CLUSTERING . If not, the block subsystem rebuilds the scatter gather list it gets from the sg driver with page size (e.g. 4 KB) elements. [Actually is does that anyway, but when ENABLE_CLUSTERING is set, it coalesces them again!]

Conclusion

In some situations, sending commands via the SG_IO ioctl may interfere with a higher level driver's use of a device. Users of the SG_IO ioctl should be aware that they are using a powerful, but low level facility, and write code accordingly. An example of this would be a utility to perform self tests on a disk: "background" self tests should be preferred over "foreground" self tests if there is a chance the computer may be using a file system on that disk at the time. Even a short foreground self test may take up to two minutes which is a long time to lock out a file system.

Return to main page.

Last updated: 26th July 2008