RDMA 基础概念

一、基础概念

1. QP — Queue Pair

RDMA 通信的基础，其中包含了一个 RQ 和一个 SQ。基于连接的 QP 只能一对一交流，而基于数据报的非连接通信能实现 QP 一对多通信。

2. RQ — Receive Queue

存储了接收的 WQE。

3. SQ — Send Queue

存储了发送的 WQE。

4. CQ — Completion Queue

存放通信完成的信号，其中存放的每个实体较说 CQE — Completion Queue Entry。

5. WQE — Work Queue Elements

用于下发操作指令，一般是放置在 RQ 或者 SQ 中，其大小受多种因素影响。

6. MR — Message Region

用于记录从主机内存中分配来的内存，可用于之后传输数据和接收数据使用。

二、动作原语 (Verbs)

1. 内存交互原语

Read

Write

这类原语需要提前知道对端主机里要被操作的内存地址（相对地址，基于virtual address），同时它无需对端主机 CPU 的参与。

2. 消息交互原语

Send

Receive

这类原语无需知道地址，Send 操作会将消息写到对端通过 Receive 操作声明的内存中，对端声明地址时需要 CPU 参与。

三、传输方式

1. 基于连接

RC — Reliable Connected

NIC 会使用确认机制来保证按顺序传递消息。

UC — Unreliable Connected

不可靠连接不保证消息顺序，但 InfiniBand 提供了链路层的流控制，保证极少丢包。

2. 无连接的

UD — Unreliable Datagram

使用较少，但可以实现广播

3. 对 Verbs 的支持

四、传输流程

这里使用 Send 和 Recv 作为示例进行介绍，常规 RDMA 编程还需提前建立连接、协商地址等过程，这里进行了省略，详细可参考 RDMA_CA 的使用。

1. 开辟内存空间

RDMA 操作开始于“搞”内存。当你在对内存进行操作的时候，就是告诉内核这段内存名花有主了，主人就是你的应用程序。因此我们要先注册一个内存区域(MR)。一旦 MR 注册完毕，我们就可以使用这段内存来做任何 RDMA 操作。


struct ibv_mr {
    struct ibv_context     *context;
    struct ibv_pd          *pd;
    void                   *addr;
    size_t                  length;
    uint32_t                handle;
    uint32_t                lkey; // 本地操作的凭据
    uint32_t                rkey; // 远端操作的凭据
};

2. 创建 WQE (send / recv)

有了内存下面就可以进行传输了，首先发送端应该在 SQ 中创建一个 WQE 或者叫 Work Request，并使用 sg_list 指定需要传输的缓冲区地址。


struct ibv_send_wr {
		uint64_t                wr_id;		    /* User defined WR ID */
		struct ibv_send_wr     *next;		      /* Pointer to next WR in list, NULL if last WR */
		struct ibv_sge         *sg_list;	    /* Pointer to the s/g array */
		int                     num_sge;	    /* Size of the s/g array */
		enum ibv_wr_opcode      opcode;		    /* Operation type */
		int                     send_flags;		/* Flags of the WR properties */
		uint32_t                imm_data;		  /* Immediate data (in network byte order) */
		union {
				struct {
						uint64_t        remote_addr;    /* Start address of remote memory buffer */
						uint32_t        rkey;           /* Key of the remote Memory Region */
				} rdma;
				struct {
						uint64_t        remote_addr;    /* Start address of remote memory buffer */
						uint64_t        compare_add;    /* Compare operand */
						uint64_t        swap;           /* Swap operand */
						uint32_t        rkey;           /* Key of the remote Memory Region */
				} atomic;
				struct {
						struct ibv_ah  *ah;             /* Address handle (AH) for the remote node address */
						uint32_t        remote_qpn;     /* QP number of the destination QP */
						uint32_t        remote_qkey;    /* Q_Key number of the destination QP */
				} ud;
		} wr;
};

对应的接收端也要创建一个 WQE 用于接收。


struct ibv_recv_wr {
    uint64_t                wr_id;
    struct ibv_recv_wr     *next;
    struct ibv_sge         *sg_list;
    int                     num_sge;
};

RDMA编程中，SGL(Scatter/Gather List)是最基本的数据组织形式。 SGL 是一个数组，该数组中的元素被称之为 SGE(Scatter/Gather Element)，每一个 SGE 就是一个 Data Segment (数据段)。RDMA 支持 Scatter/Gather 操作，具体来讲就是 RDMA 可以支持一个连续的 Buffer 空间，进行 Scatter 分散到多个目的主机的不连续的 Buffer 空间。Gather 指的就是多个不连续的 Buffer 空间可以 Gather 到目的主机的一段连续的 Buffer 空间。


struct ibv_sge {
  uint64_t  addr; // 数据段所在的虚拟内存的起始
									// 地址 (Virtual Address of the Data Segment (i.e. Buffer))
  uint32_t  length; // 数据段长度(Length of the Data Segment)
  uint32_t  lkey; // 该数据段对应的L_Key (Key of the local Memory Region)
}

3. 添加任务

之后需要将任务添加到对应的 QP 中，让网卡去执行，需要使用 ibv_post_send() 函数


int ibv_post_send(struct ibv_qp *qp, 
                  struct ibv_send_wr *wr,
                  struct ibv_send_wr **bad_wr);

4. 网卡执行

根据 sg_list 字段网卡驱动便可以知道需要操作的内存地址，之后即可进行内存操作，同时为了防止不同主机操作本不属于它的内存，又出现了 PD 这个概念。通过 PD 可以将 MR 与 QP 进行绑定，通过网卡驱动硬件上防止内存被错误访问到。

网卡检验 PD 后便可以通过 SGL 组合内存，进行数据传输。