NAME
RDS-rdma - Zerocopy Interface for RDMA over RDS
DESCRIPTION
This manual page describes the zerocopy interface of RDS, which was
added in RDSv3. For a description of the basic RDS interface, please
refer to rds(7).
The principal mode of operation for RDS zerocopy is like this: one
participant (the client) wishes to initiate a direct transfer to or
from some area of memory in its process address space. This memory
does not have to be aligned.
The client obtains a handle for this region of memory, and passes it to
the other participant (the server). This is called the RDMA cookie. To
the application, the cookie is an opaque 64bit data type.
The client sends this handle to the server application, along with
other details of the RDMA request (such as which data to transfer to
that memory area). Throughout the following discussion, we will refer
to this message as the RDMA request.
The server uses this RDMA cookie to initiate the requested RDMA
transfer. The RDMA transfer is combined atomically with a normal RDS
message, which is delivered to the client. This message is called the
RDMA ACK throughout the following. Atomic in this context means that
either both the RDMA succeeds and the RDMA ACK is delivered, or neither
succeeds.
Thus, when the client receives the RDMA ACK, it knows that the RDMA has
completed successfully. It can then release the RDMA cookie for this
memory region, if it wishes to.
RDMA operations are not reliable, in the sense that unlike normal RDS
messages, RDS RDMA operations may fail, and get dropped.
INTERFACE
The interface is currently based on control messages (ancillary data)
sent or received via the sendmsg(2) and recvmsg(2) system calls.
Optionally, an older interface can be used that is based on the
setsockopt(2) system call. However, we recommend using control
messages, as this reduces the number of system calls required.
Control message interface
With the control message interface, the RDMA cookie is passed to the
server out-of-band, included in an extension header attached to the RDS
message.
The following outlines the mode of operation; the data types used will
be specified in details in a subsequent section.
Initially, the client will send RDMA requests along with a
RDS_CMSG_RDMA_MAP control message. The control message contains the
address and length of the memory region for which to obtain a handle,
some flags, and a pointer to a memory location (in the caller’s address
space) where the kernel will store the RDMA cookie.
Alternatively, if the application has already obtained a RDMA cookie
for the memory range it wants to RDMA to/from, it can hand this cookie
to the kernel using the RDS_CMSG_RDMA_DEST control message.
Either way, the kernel will include the resulting RDMA cookie in an
extension header that is transmitted as part of the RDMA request to the
server.
When the server receives the RDMA request, the kernel will deliver the
cookie wrapped inside a RDS_CMSG_RDMA_DEST control message.
The server then initiates the data transfer by sending the RDMA ACK
message along with a RDS_CMSG_RDMA_ARGS control message. This message
contains the RDMA cookie, and the local memory to copy to or from.
The server process may request a notification when an RDMA operation
completes. Notifications are delivered as a RDS_CMSG_RDMA_STATUS
control messages. When an application calls recvmsg(2), it will either
receive a regular RDS message (possibly with other RDMA related control
messages), or an empty message with one or more status control
messages.
In addition, applications When an RDMA operation fails for some reason
and is discarded, the application can ask to receive notifications for
failed messages as well, regardless of whether it asked for success
notification of an individual message or not. This behavior is turned
on by setting the RDS_RECVERR socket option.
Setsockopt interface
In addition to the control message interface, RDS allows a process to
register and release memory ranges for RDMA through calls to
setsockopt(2).
RDS_GET_MR
To obtain a RDMA cookie for a given memory range, the
application can use setsockopt with RDS_GET_MR. This operates
essentially the same way as the RDS_CMSG_RDMA_MAP control
message: the argument contains the address and length of the
memory range to be registered, and a pointer to a RDMA cookie
variable, in which the system call will store the cookie for the
registered range.
RDS_FREE_MR
Memory ranges can be released by calling setsockopt with
RDS_FREE_MR, giving the RDMA cookie and additional flags as
arguments.
RDS_RECVERR
This is a boolean option which can be set as well as queried
(using getsockopt). When enabled, RDS will send RDMA
notification messages to the application for any RDMA operation
that fails. This option defaults to off.
For all of these calls, the level argument to setsockopt is SOL_RDS.
RDMA MACROS AND TYPES
RDMA cookie
typedef u_int64_t rds_rdma_cookie_t
This encapsulates a memory location in the client process. In
the current implementation, it contains the R_Key of the remote
memory region, and the offset into it (so that the application
does not have to worry about alignment.
The RDMA cookie is used in several struct types described below.
The RDS_CMSG_RDMA_DEST control message contains a
rds_rdma_cookie_t all by itself as payload.
Mapping arguments
The following data type is used with RDS_CMSG_RDMA_MAP control
messages and with the RDS_GET_MR socket option:
struct rds_iovec {
u_int64_t addr;
u_int64_t bytes;
};
struct rds_get_mr_args {
struct rds_iovec vec;
u_int64_t cookie_addr;
uint64_t flags;
};
The cookie_addr specifies a memory location where to store the
RDMA cookie.
The flags value is a bitwise OR of any of the following flags:
RDS_RDMA_USE_ONCE
This tells the kernel that the allocated RDMA cookie is
to be used exactly once. When the RDMA ACK message
arrives, the kernel will automatically unbind the memory
area and release any resources associated with the
cookie.
If this flag is not set, it is the application’s
responsibility to release the memory region at a later
time using the RDS_FREE_MR socket option.
RDS_RDMA_INVALIDATE
Normally, RDMA memory mappings are invalidated lazily, as
this requires some relatively costly synchronization with
the HCA. However, this means that the server application
can continue to access the registered memory for some
indeterminate amount of time. If this flag is set, the
RDS code will invalidate the mapping at the time it is
released (either upon arrival of the RDMA ACK, if
USE_ONCE was specified; or when the application destroys
it using FREE_MR).
RDMA Operation
RDMA operations are initiated by the server using the
RDS_CMSG_RDMA_ARGS control message, which takes the following
data as payload:
struct rds_rdma_args {
rds_rdma_cookie_t cookie;
struct rds_iovec remote_vec;
u_int64_t local_vec_addr;
u_int64_t nr_local;
u_int64_t flags;
u_int32_t user_token;
};
The cookie argument contains the RDMA cookie received from the
client. The local memory is given via an array of rds_iovecs.
The array address is given in local_vec_addr, and its number of
elements is given in nr_local.
The struct member remote_vec specifies a location relative to
the memory area identified by the cookie: remote_vec.addr is an
offset into that region, and remote_vec.bytes is the length of
the memory window to copy to/from. This length must match the
size of the local memory area, i.e. the sum of bytes in all
members of the local iovec.
The flags field contains the bitwise OR of any of the following
flags:
RDS_RDMA_READWRITE
If set, any RDMA WRITE is initiated from the server’s
memory to the client’s. If not set, RDS will do a RDMA
READ from the client’s memory to the server’s memory.
RDS_RDMA_FENCE
By default, Infiniband makes no guarantee about the
ordering of an RDMA READ with respect to subsequent SEND
operations. Setting this flag asks that the RDMA READ
should be fenced off the subsequent RDS ACK message.
Setting this flag requires an additional round-trip of
the IB fabric, but it is a good idea to use set this flag
by default, unless you are really sure you do not want
it.
RDS_RDMA_NOTIFY_ME
This flag requests a notification upon completion of the
RDMA operation (successful or otherwise). The noticiation
will contain the value of the user_token field passed in
by the application. This allows the application to
release resources (such as buffers) assosicated with the
RDMA transfer.
The user_token can be used to pass an application specific
identifier to the kernel. This token is returned to the
application when a status notification is generated (see the
following section).
RDMA Notification
The RDS kernel code is able to notify the server application
when an RDMA operation completes. These notifications are
delivered via RDS_CMSG_RDMA_STATUS control messages.
By default, no notifications are generated. There are two ways
an application can request them. On one hand, status
notifications can be enabled on a per-operation basis by setting
the RDS_RDMA_NOTIFY_ME flag in the RDMA arguments. On the other
hand, the application can request notifications for all RDMA
operations that fail by setting the RDS_RECVERR socket option
(see below). In both cases, the format of the notification is
the same; and at most one notification will be sent per
completed operation.
The message format is this:
struct rds_rdma_notify {
u_int32_t user_token;
int32_t status;
};
The user_token field contains the value previously given to the
kernel in the RDS_CMSG_RDMA_ARGS control message. The status
field contains a status value, with 0 indicating success, and
non-zero indicating an error.
The following status codes are currently defined:
RDS_RDMA_SUCCESS
The RDMA operation succeeded.
RDS_RDMA_REMOTE_ERROR
The RDMA operation failed due to a remote access error.
This is usually due to an invalid R_key, offset or
transfer size.
RDS_RDMA_CANCELED
The RDMA operation was canceled by the application.
(This error code is not yet generated).
RDS_RDMA_DROPPED
RDMA operations were discarded after the connection broke
and was re-established. The RDMA operation may have been
processed partially.
RDS_RDMA_OTHER_ERROR
Any other failure.
RDMA setsockopt arguments
When using the RDS_GET_MR socket option to register a memory
range, the application passes a pointer to a struct
rds_get_mr_args variable, described above.
The RDS_FREE_MR call takes an argument of type struct
rds_free_mr_args:
struct rds_free_mr_args {
rds_rdma_cookie_t cookie;
u_int64_t flags;
};
cookie specifies the RDMA cookie to be released. RDMA access to
the memory range will usually not be invoked instantly, because
the operation is rather costly. However, if the flags argument
contains RDS_RDMA_INVALIDATE, RDS will invalidate the indicated
mapping immediately, as described in section Mapping arguments
above.
If the cookie argument is 0, and RDS_RDMA_INVALIDATE is set, RDS
will invalidate old memory mappings on all devices.
ERRORS
In addition to the usual error codes returned by sendmsg, recvmsg and
setsockopt, RDS returns the following error codes:
EAGAIN RDS was unable to map a memory range because the limit was
exceeded (returned by RDS_CMSG_RDMA_MAP and RDS_GET_MR).
EINVAL When sending a message, there were were conflicting control
messages (e.g. two RDMA_MAP messages, or a RDMA_MAP and a
RDMA_DEST message).
In a RDS_CMSG_RDMA_MAP or RDS_GET_MR operation, the application
specified memory range greater than the maximum size supported.
When setting up an RDMA operation with RDS_CMSG_RDMA_ARGS, the
size of the local memory (given in the rds_iovec) did not match
the size of the remote memory range.
EBUSY RDS was unable to obtain a DMA mapping for the indicated memory.
LIMITS
Currently, the following limits apply
· The maximum size of a zerocopy transfer is 1MB. This can be
adjusted via the fmr_message_size module parameter.
· The maximum number of memory ranges that can be mapped is
limited to 2048 at the moment. This can be adjusted via the
fmr_pool_size module parameter. However, the actual limit
imposed by the hardware may in fact be lower.
AUTHORS
RDS was written and is Copyright (C) 2007-2008 by Oracle, Inc.
RDS zerocopy(7)