RDMA over Commodity Ethernet at Scale – Paper Overview

RDMA over Commodity Ethernet at Scale – 論文導讀

Previous Article RDMA Overview This article briefly introduces RDMA, and this time focuses on Microsoft's paper presented at SIGCOMM 2016, detailing their large-scale deployment of RDMA in their data centers. RDMA over Commodity Ethernet at Scale A brief overview.

Abstract

Over the past one and a half years, we have been using RDMA over commodity Ethernet (RoCEv2) to support some of Microsoft’s highly-reliable, latency-sensitive services. This paper describes the challenges we encountered during the process and the solutions we devised to address them.

This paper mainly introduces the challenges and solutions Microsoft encountered while using RDMA in its infrastructure over the past year and a half.

In order to scale RoCEv2 beyond VLAN, we have designed a DSCP-based priority flow-control (PFC) mechanism to ensure large-scale deployment.

To enable better scalability of RoCEv2, Microsoft designed a DSCP-based PFC to address large-scale deployment.

We have addressed the safety challenges brought by PFC-induced deadlock (yes, it happened!), RDMA transport livelock, and the NIC PFC pause frame storm problem. We have also built the monitoring and management systems to make sure RDMA works as expected.

This section discusses some issues caused by PFC and presents Microsoft’s solutions, along with the design of a comprehensive monitoring and management system.

Our experiences show that the safety and scalability issues of running RoCEv2 at scale can all be addressed, and RDMA can replace TCP for intra-data center communications and achieve low latency, low CPU overhead, and high throughput.

Microsoft’s experience demonstrates that scalability and safety issues with RoCEv2 can be resolved, and RDMA can replace TCP as the primary transport protocol within data centers, delivering low latency, low CPU utilization, and high throughput.

Introduction

In recent years, driven by the rising demands of network services and edge computing, data center networks are requiring increasingly higher speeds and lower latencies. TCP/IP remains the dominant transport protocol in data centers today, but more and more applications reveal that TCP/IP is no longer sufficient to meet current requirements. There are two major reasons:

  1. High CPU Usage
  2. kernel and packet drop caused high latency

This paper summarizes the challenges and solutions Microsoft encountered while deploying large-scale RoCEv2, including the following key points:

  • DSCP-based PFC for Layer-3 (IP) Scaling
  • RDMA Transport Livelock
  • PFC Deadlock
  • NIC PFC Pause Frame Storm
  • Slow-receiver Symptom

DSCP-based PFC

PFC is enabled through the PCP (Priority Code Point) in the VLAN tag, but in data centers, due to two major reasons, VLAN-based Layer-2 is not feasible.

  1. Server configured for PXE provisioning; during PXE boot, NIC not set with VLAN, thus unable to receive packets with VLAN tags.
  2. Switch uses layer-3 forwarding instead of layer-2 bridging; layer-3 routing offers better scalability, security, and easier management.

基本上 PFC pause frame 並不帶任何 VLAN tag,所以使用 VLAN 的目的單純只是為了攜帶封包優先權的資料。而在 IP header 中 DSCP 也有同樣的功能,所以 Microsoft 就開發了一個基於 DSCP 進行 PFC 的規範,以解決上述遇到的兩個問題。

Security Challenges

RDMA Transport Livelock

RDMA 在設計的時候基本上是假設整個網路是不會因為 congestion 所造成 packet lost 的。在 RoCEv2 中使用是 PFC 達到這個條件,但仍然會有其他如 FCS error 或是硬體軟體中的 bug 會造成掉包。Microsoft 希望在這種狀況發生的時候所造成的效能影響達到最低

但在實際測試中,就算非常低的掉包機率仍然會造成 RDMA Application 完全無法運作,原因在於 NIC 所使用的重傳演算法為 go-back-0 也就是在同一個 message 中只要發生了任何一個 packet drop 就會從頭開始重新傳輸。

不過在 NIC 中無法實現像 TCP 過於複雜的 SACK 機制,於是 Microsoft 跟 NIC 廠商合作實現了 go-back-N 的演算法,也是從第一個掉包的封包開始重傳,大大的減少了 livelock 發生的可能性。

PFC Deadlock

PFC Deadlock 基本上會發生在 MAC Address Table 跟 ARP Table 不 match 的狀況下。此時 switch 就會透過 flooding 的方式嘗試來學習新的 MAC Address,但是在 PFC 開啟並且某一台接收者無法接收封包的狀況下,可能會造成整個網路 deadlock。 在這裡簡單說明,而論文中有比較詳細的情境解釋。

When a receiver experiences an abnormal condition and does not properly receive packets, switch will flood packets to every port, potentially causing other port congestion and initiating PFC pause frames, which will continuously propagate through the data center's leaf-spine network and result in complete network deadlock.

Microsoft's solution is that when the MAC Address Table does not match the ARP Table, a lossless network will not forward any packets that do not correspond to a MAC address, but instead directly drop them, rather than attempting to flood, thus preventing buffer exhaustion on other ports and avoiding PFC deadlock.

NIC PFC Pause Frame Storm

Microsoft observed that NICs may continuously send PFC pause frames under abnormal conditions, which can lead to the following situations:

  1. Some NICs continuously send pause frames to TOR switch
  2. TOR switch therefore halts all traffic, including leaf uplinks
  3. Leaf also halts all traffic to spines
  4. Spine halts all traffic to other leaves
  5. Other leaves halt all traffic to TORs
  6. TOR halts all traffic to servers

This solution is relatively simple. Microsoft designed two watchdog mechanisms—one on the server side and one on the switch side. On the server side, when the watchdog detects that the NIC continues to send pause frames, it will block the NIC from sending further pause frames. On the switch side, a similar watchdog monitors each port, and when a port continues to receive pause frames, it will temporarily disable that port's lossless mode and re-enable it after a brief period once the pause frame storm has subsided.

Slow-receiver Symptom

During design, the NIC places most of its data structures in the main memory cache and maintains only a small amount of cache on the NIC itself, using a Memory Translation Table (MTT) to store the mapping between physical and virtual addresses. However, if the preconfigured MTT page size is too small, it can easily cause cache misses, reducing performance and leading to buffer overflows that exceed the PFC threshold, ultimately causing the NIC to send large volumes of PFC pause frames.

Simple explanation of the solution method: Increasing page size leads to higher cache miss rates, reducing performance.

RDMA in Production

This section describes how Microsoft uses RDMA in production environments, including how to configure and monitor PFC pause frames, as well as tools for measuring PFC pause frame quantities and latency.

The final part of this section presents some RDMA performance tests, showing that RDMA offers significant improvements over TCP in both latency and throughput. However, it also notes that RDMA cannot simultaneously provide high throughput and low latency.

Experiences

The final section primarily discusses the issues Microsoft encountered with RDMA over Ethernet in production and their resolution approaches, along with insights on RoCEv2 technology and its challenges, concluding with a brief exploration of other RDMA technologies such as InfiniBand and iWARP.

Conclusion

This paper details Microsoft's practical implementation of RDMA technology in data centers and their methods for resolving encountered issues. It also outlines future research directions, such as low-latency paths within data centers, deadlock resolution strategies across different architectures, and how to simultaneously provide high throughput and low latency.

Acknowledgments

The author considers this paper highly worthwhile, not only because it clarifies how RDMA technology operates over network links, but also because Microsoft offers valuable insights and methodologies that are highly relevant for network engineers and infrastructure specialists.


Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stated otherwise.

Leave a Reply