1 Paravirtualized RDMA Device (PVRDMA)
2 ====================================
7 PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
8 It works with its Linux Kernel driver AS IS, no need for any special guest
11 While it complies with the VMware device, it can also communicate with bare
12 metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
13 can work with Soft-RoCE (rxe).
15 It does not require the whole guest RAM to be pinned allowing memory
16 over-commit and, even if not implemented yet, migration support will be
17 possible with some HW assistance.
19 A project presentation accompany this document:
20 - http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
30 Fedora 27+ kernels work out of the box, older distributions
31 require updating the kernel to 4.14 to include the pvrdma driver.
33 However the libpvrdma library needed by User Level Software is still
34 not available as part of the distributions, so the rdma-core library
35 needs to be compiled and optionally installed.
37 Please follow the instructions at:
38 https://github.com/linux-rdma/rdma-core.git
43 The pvrdma backend is an ibdevice interface that can be exposed
44 either by a Soft-RoCE(rxe) device on machines with no RDMA device,
45 or an HCA SRIOV function(VF/PF).
46 Note that ibdevice interfaces can't be shared between pvrdma devices,
47 each one requiring a separate instance (rxe or SRIOV VF).
50 2.2.1 Soft-RoCE backend(rxe)
51 ===========================
52 A stable version of rxe is required, Fedora 27+ or a Linux
53 Kernel 4.14+ is preferred.
55 The rdma_rxe module is part of the Linux Kernel but not loaded by default.
56 Install the User Level library (librxe) following the instructions from:
57 https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
59 Associate an ETH interface with rxe by running:
61 An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
64 2.2.2 RDMA device Virtual Function backend
65 ==========================================
66 Nothing special is required, the pvrdma device can work not only with
67 Ethernet Links, but also Infinibands Links.
68 All is needed is an ibdevice with an active port, for Mellanox cards
69 will be something like mlx5_6 which can be the backend.
74 Configure QEMU with --enable-rdma flag, installing
75 the required RDMA libraries.
81 Currently the device is working only with memory backed RAM
82 and it must be mark as "shared":
84 -object memory-backend-ram,id=mb1,size=1G,share \
85 -numa node,memdev=mb1 \
87 The pvrdma device is composed of two functions:
88 - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
89 but is required to pass the ibdevice GID using its MAC.
91 For an rxe backend using eth0 interface it will use its mac:
92 -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
93 For an SRIOV VF, we take the Ethernet Interface exposed by it:
94 -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
95 - Function 1 is the actual device:
96 -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
97 where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
98 Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
99 The rules of conversion are part of the RoCE spec, but since manual conversion
100 is not required, spotting problems is not hard:
101 Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
102 MAC: 7c:fe:90:cb:74:3a
103 Note the difference between the first byte of the MAC and the GID.
107 4. Implementation details
108 =========================
113 The device acts like a proxy between the Guest Driver and the host
115 On configuration path:
116 - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
117 a resource from the backend interface, maintaining a 1-1 mapping
118 between the guest and host.
120 - Every post_send/receive received from the guest will be converted into
121 a post_send/receive for the backend. The buffers data will not be touched
122 or copied resulting in near bare-metal performance for large enough buffers.
123 - Completions from the backend interface will result in completions for
132 (0) Command - used when execution of a command is completed.
133 (1) Async - not in use.
134 (2) Completion - used when a completion event is placed in
137 --------------------------------------------------------
138 | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC |
139 --------------------------------------------------------
140 DSR - Address of driver/device shared memory used
141 for the command channel, used for passing:
142 - General info such as driver version
143 - Address of 'command' and 'response'
144 - Address of async ring
145 - Address of device's CQ ring
146 - Device capabilities
147 CTL - Device control operations (activate, reset etc)
148 IMG - Set interrupt mask
149 REQ - Command execution register
150 ERR - Operation status
153 ---------------------------------------------------------
154 | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag |
155 ---------------------------------------------------------
156 - Offset 0 used for QP operations (send and recv)
157 - Offset 4 used for CQ operations (arm and poll)
166 - Allocates pages for CQ ring
167 - Creates page directory (pdir) to hold CQ ring's pages
168 - Initializes CQ ring
169 - Initializes 'Create CQ' command object (cqe, pdir etc)
170 - Copies the command to 'command' address
171 - Writes 0 into REQ register
173 - Reads the request object from the 'command' address
174 - Allocates CQ object and initialize CQ ring based on pdir
175 - Creates the backend CQ
176 - Writes operation status to ERR register
177 - Posts command-interrupt to guest
179 - Reads the HW response code from ERR register
184 - Allocates pages for send and receive rings
185 - Creates page directory(pdir) to hold the ring's pages
186 - Initializes 'Create QP' command object (max_send_wr,
187 send_cq_handle, recv_cq_handle, pdir etc)
188 - Copies the object to 'command' address
189 - Write 0 into REQ register
191 - Reads the request object from 'command' address
192 - Allocates the QP object and initialize
193 - Send and recv rings based on pdir
194 - Send and recv ring state
195 - Creates the backend QP
196 - Writes the operation status to ERR register
197 - Posts command-interrupt to guest
199 - Reads the HW response code from ERR register
204 - Initializes a wqe and place it on recv ring
205 - Write to qpn|qp_recv_bit (31) to QP offset in UAR
207 - Extracts qpn from UAR
208 - Walks through the ring and does the following for each wqe
209 - Prepares the backend CQE context to be used when
210 receiving completion from backend (wr_id, op_code, emu_cq_num)
211 - For each sge prepares backend sge
212 - Calls backend's post_recv
214 4.3.4 Process backend events
215 ============================
216 - Done by a dedicated thread used to process backend events;
217 at initialization is attached to the device and creates
218 the communication channel.
220 - Polls for completions
221 - Extracts QEMU _cq_num, wr_id and op_code from context
222 - Writes CQE to CQ ring
223 - Writes CQ number to device CQ
224 - Sends completion-interrupt to guest
225 - Deallocates context
226 - Acks the event to backend
232 - The device obviously is limited by the Guest Linux Driver features implementation
233 of the VMware device API.
234 - Memory registration mechanism requires mremap for every page in the buffer in order
235 to map it to a contiguous virtual address range. Since this is not the data path
236 it should not matter much. If the default max mr size is increased, be aware that
237 memory registration can take up to 0.5 seconds for 1GB of memory.
238 - The device requires target page size to be the same as the host page size,
239 otherwise it will fail to init.
240 - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
241 so it can't work with huge pages. The limitation will be addressed in the future,
242 however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
243 pages available, QEMU will use them. QEMU will fail to init if the requirements
250 By design the pvrdma device exits on each post-send/receive, so for small buffers
251 the performance is affected; however for medium buffers it will became close to
252 bare metal and from 1MB buffers and up it reaches bare metal performance.
253 (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
255 All the above assumes no memory registration is done on data path.