Support for ibverbs
Receiver performance can be significantly improved by using the Infiniband Verbs API instead of the BSD sockets API. This is currently only tested on Linux with ConnectX® NICs. It depends on device managed flow steering (DMFS).
There are a number of limitations in the current implementation:
Only IPv4 is supported.
VLAN tagging, IP optional headers, and IP fragmentation are not supported.
For sending, only multicast is supported.
Within these limitations, it is quite easy to take advantage of this faster
code path. The main difficulties are that one must specify the IP address of
the interface that will send or receive the packets, and that the
CAP_NET_RAW
capability may be needed. The netifaces2 module can
help find the IP address for an interface by name, and the
spead2_net_raw tool simplifies the process of getting the
CAP_NET_RAW
capability.
System configuration
ConnectX®-3
Add the following to /etc/modprobe.d/mlnx.conf
:
options ib_uverbs disable_raw_qp_enforcement=1
options mlx4_core fast_drop=1
options mlx4_core log_num_mgm_entry_size=-1
Note
Setting log_num_mgm_entry_size
to -7 instead of -1 will activate faster
static device-managed flow steering. This has some limitations (refer to the
manual for details), but can improve performance when capturing a large
number of multicast groups.
ConnectX®-4+, MLNX OFED up to 4.9
Add the following to /etc/modprobe.d/mlnx.conf
:
options ib_uverbs disable_raw_qp_enforcement=1
All other cases
No system configuration is needed, but the CAP_NET_RAW
capability is
required. Running as root will achieve this; a full discussion of Linux
capabilities is beyond the scope of this manual. The spead2_net_raw
utility can also be used to give users access to this capability without
exposing full root access.
For more information, see the libvma documentation.
Multicast loopback
By default, multicast traffic sent using ibverbs can also be received on the
same port. While convenient, this is a slow path in the NIC, and can limit
performance. To disable this loopback, write 1
to
/sys/class/net/interface/settings/force_local_lb_disable
(note that
the setting does not persist across reboots).
Receiving
The ibverbs API can be used programmatically by using an extra method of
spead2.recv.Stream
.
The configuration is specified using a spead.recv.UdpIbvConfig
.
- class spead2.recv.UdpIbvConfig(*, endpoints=[], interface_address='', buffer_size=DEFAULT_BUFFER_SIZE, max_size=DEFAULT_MAX_SIZE, comp_vector=0, max_poll=DEFAULT_MAX_POLL)
- Parameters:
interface_address (str) – Hostname/IP address of the interface which will be subscribed
buffer_size (int) – Requested memory allocation for work requests. It may be adjusted to an integer number of packets.
max_size (int) – Maximum packet size that will be accepted
comp_vector (int) – Completion channel vector (interrupt) for asynchronous operation, or a negative value to poll continuously. Polling should not be used if there are other users of the thread pool. If a non-negative value is provided, it is taken modulo the number of available completion vectors. This allows a number of streams to be assigned sequential completion vectors and have them load-balanced, without concern for the number available.
max_poll (int) – Maximum number of times to poll in a row, without waiting for an interrupt (if comp_vector is non-negative) or letting other code run on the thread (if comp_vector is negative).
The constructor arguments are also instance attributes. Note that they are implemented as properties that return copies of the state, which means that mutating endpoints (for example, with
append()
) will not have any effect as only the copy will be modified. The entire list must be assigned to update it.
- spead2.recv.Stream.add_udp_ibv_reader(config)
Feed data from IPv4 traffic.
If supported by the NIC and the drivers, the receive code will automatically use a “multi-packet receive queue”, which allows each packet to consume only the amount of space needed in the buffer. This is currently only supported on ConnectX®-4+ with MLNX OFED drivers 5.0 or later (or upstream rdma-core). When in use, the max_size parameter has little impact on performance, and is used only to reject larger packets.
When multi-packet receive queues are not supported, performance can be improved by making max_size as small as possible for the intended data stream. This will increase the number of packets that can be buffered (because the buffer is divided into fixed-size slots), and also improve memory efficiency by keeping data more-or-less contiguous.
Environment variables
An existing application can be forced to use ibverbs for all IPv4
readers, by setting the environment variable SPEAD2_IBV_INTERFACE
to the IP
address of the interface to receive the packets. Note that calls to
spead2.recv.Stream.add_udp_reader()
that pass an explicit interface
will use that interface, overriding SPEAD2_IBV_INTERFACE
; in this case,
SPEAD2_IBV_INTERFACE
serves only to enable the override.
It is also possible to specify SPEAD2_IBV_COMP_VECTOR
to override the
completion channel vector from the default.
Note that this environment variable currently has no effect on senders.
Sending
Sending is done by using the class spead2.send.UdpIbvStream
instead
of spead2.send.UdpStream
. It has a different constructor, but the
same methods. There is also a spead2.send.asyncio.UdpIbvStream
class, analogous to spead2.send.asyncio.UdpStream
.
There is an additional configuration class for ibverbs-specific configuration:
- class spead2.send.UdpIbvConfig(*, endpoints=[], interface_address='', buffer_size=DEFAULT_BUFFER_SIZE, ttl=1, comp_vector=0, max_poll=DEFAULT_MAX_POLL, memory_regions=[])
- Parameters:
endpoints (List[Tuple[str, int]]) – Peer endpoints (one per substream)
interface_address (str) – Hostname/IP address of the interface which will be subscribed
buffer_size (int) – Requested memory allocation for work requests. It may be adjusted to an integer number of packets.
ttl (int) – Multicast TTL
comp_vector (int) – Completion channel vector (interrupt) for asynchronous operation, or a negative value to poll continuously. Polling should not be used if there are other users of the thread pool. If a non-negative value is provided, it is taken modulo the number of available completion vectors. This allows a number of streams to be assigned sequential completion vectors and have them load-balanced, without concern for the number available.
max_poll (int) – Maximum number of times to poll in a row, without waiting for an interrupt (if comp_vector is non-negative) or letting other code run on the thread (if comp_vector is negative).
memory_regions (List[object]) – Objects implementing the buffer protocol that will be used to hold item data. This is not required, but data stored in these buffers may be transmitted directly without requiring a copy, yielding higher performance. There may be platform-specific limitations on the size and number of these buffers.
The constructor arguments are also instance attributes. Note that they are implemented as properties that return copies of the state, which means that mutating endpoints or memory_regions (for example, with
append()
) will not have any effect as only the copy will be modified. The entire list must be assigned to update it.
- class spead2.send.UdpIbvStream(thread_pool, config, udp_ibv_config)
Create a multicast IPv4 UDP stream using the ibverbs API
- Parameters:
thread_pool (
spead2.ThreadPool
) – Thread pool handling the I/Oconfig (
spead2.send.StreamConfig
) – Stream configurationudp_ibv_config (
spead2.send.UdpIbvConfig
) – Additional stream configuration