Introduction
Does low latency, high throughput & CPU offloading require RDMA? What? Blasphemy, how dare we even question this? In my defense, I’m not questioning anything. I am merely being curious. I’m the inquisitive kind.
The need for RDMA is the premise that we have been working with ever since RDMA became available outside of HPC InfiniBand fabrics. For us working in the Windows ecosystem this was with SMB Direct. Windows Server 2012 was the OS version that introduced us to SMB Direct, which leverages RDMA.
Over time more and more features in Windows Server supported SMB Direct / RDMA. The list now quite long and impressive.
- Storage Replica
- S2D Storage Bus
- CSV redirected traffic
- Live Migration, Storage Live Migration and Shared Nothing Live Migration
- SMB 3 file sharing
There are many more uses cases for RDMA or RDMA like capabilities. iSER, NFS, NVMe over Fabric, Ceph, …
For now, and as far as I can judge, most of the ecosystem including myself, is leveraging RDMA. RoCE mostly actually for a number of reasons over the past 8 years. But if I’m not the infidel that is questioning the need, who is? Well, companies like SolarFlare, Pavioliondata and LightBits are. They are active in the NVMe over Fabrics world and offer an alternative to NVMe/FC and NVME/RDMA.
Their point of view pours some fuel on the raging RDMA wars between RoCE and iWarp. In that war a couple of interesting fronts are forming. One, driven by the companies above, is the idea that RDMA is overrated, not needed and only adds complexity and overhead for very little benefit. On top of that it requires the workloads to be RDMA capable. The second is that iWarp got one idea right, that the (RDMA) NIC should not have a hard requirement for lossless fabrics. But they on boarded the overhead of TCP/IP with RDMA to achieve their goals, which is complex and costly.
There is a 3rd point of discussion, which is something many seems to conveniently forget. The notion of simplicity with TCP/IP with or without iWarp only lasts a long as you have over provisioned network fabrics with no serious congestion. This is true for RoCE as well. But at least until today you had to consider and take care of the network as providing a lossless network is a hard requirement for RoCE to work properly. At least for now. But whether this requirement is prerequisite or just needed under certain conditions, it remains something you need to think about. Your fast car isn’t that impressive on a congested highway either.
Photo by David Becker on Unsplash
All the above should have gotten your attention, at least that’s what I hope. Let’s dive into the wonderful world of RDMA once more and try to make sense of all this.
The big 3 RDMA technologies
There are 3 big technologies in RDMA. These are InfiniBand, RoCE and iWarp. InfiniBand went out of style and focus as an option for SMB Direct. It still works but, in all honesty, it was never the primary fabric for this use case. That was ethernet, which rules over datacenters big and small. That leaves two RDMA options: RoCE and iWarp. When Intel dumped their old iWarp card Chelsio was the only game in town for quite a while. Mellanox ruled and still rules the ROCE seas. They leave most other vendors far behind as far as I can judge. Emulex tried to do RoCE before but had so many issues it wasn’t funny anymore. CISCO also has some NICs. Other vendors that did support DCB were not really aligned with the SMB Direct effort. Things are slowly changing and improving but it’s not a tidal wave. From my perspective this technology should have been bigger by now. But I understand the reasons behind this apparent inertia. Some of these reasons, often associated with RoCE did not give huge advantage to iWarp either it seems.
Meanwhile, between 2012 and 2017, a series of sales & acquisitions in the network world made everybody confused about who owns what now and under what brand name. I won’t go into that but let’s look at the results. After 6 years we finally have a few iWarp vendors with Cavium being the one in the spot light right now. Intel is in the iWarp game (again) but is not very clear on what, why and when in regards to a road map. They’re a bit preoccupied by competing with InfiniBand it seems. They do not believe in RoCE, that’s clear. The lure of iWarp is easy to understand. Just plug in the cards and RDMA just works. Or at least that’s the idea. We also have Mellanox, Cavium, CISCO, Emulex competing in the RoCE ecosystem.
So, there is choice in RDMA. Choice is good right?
Frustration & support calls
A large number of support calls leads to augmented costs, uncomfortable talks with superiors and less bonuses or career perspectives. RoCE or better it’s dependency on DCB for FPC (and ETS) created a surge in such support calls. That made no one happy.
Some pundits say I could be making a killing by configuring networks correctly for RoCE, but that is not the case. The reality is that much of the gear was bad or wrong so often that this was a bit of a show stopper. On top of that, people don’t have any intention of paying for expertise. Not unless it they have to pay in order to prevent jail time or premature death.
For RoCE Mellanox is and remains the dominant player, but it has some headwinds due to the above. The lack of skills and knowledge in the DCB segment of networking bit many S2D implementations in the proverbial behind. As said, this led to many support calls with OEMs and Microsoft who became aver more frustrated with this and now, for S2D, they recommend iWarp. Do note that Azure today leverages RoCEv2. Why is that? Well, maybe the scale is very different and requires RoCE? Or maybe network engineers just love pain?
So, while MSFT supports RoCE 100% it does recommend iWarp for S2D deployments. Meanwhile Mellanox is trying to address the need for a lossless network on several fronts. But bar for some (very) small deployments with direct cabling it’s not yet a plug and play solution today. Do note RoCE isn’t throwing in the towel at all and they are working on improving the experience.
Another reason I see for the iWarp love is RDMA to the guest. Think about the head aches RoCE caused Microsoft on the hosts. Now extend that to a multitude of VMs … right. Maybe they’d rather not deal with that on per customer basis.
iWarp with UDP, Improved RoCE NICs (IRN) and plain TCP/IP
With all the above in mind during the last 18 months we’ve seen two trends that are interesting and perhaps at odds with each other. Some years ago, there was even a 3rd idea that was mentioned once, around improving iWarp so let’s start here.
iWarp
At a given time the idea to improve iWarp by having it leverage UDP instead of TCP/IP came up. You can read one such experiment here Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA. Another paper is RDMA Capable iWARP over Datagrams. The idea is to get rid of overhead dragging the entire TCP/IP stack into the RDMA solution. Other than that, I have not seen much ideas about changes except for optimizing iWarp for ever bigger bandwidths. I guess that means that iWarp is just great as it is, ignores issues outside of the NIC’s scope. But I have little to share here. One could argue that it’s RDMA only and as such cannot optimize non RDMA traffic. If this is a concern, they share this drawback with RoCE anyway and as such it is not a competitive disatvantage.
Improved RoCE NICs
What is gaining some genuine attention is Improved RoCE NICs (IRN). It’s an idea that Mellanox could run width and offer the best possible RDMA solution. The concept is all about improving RoCE cards to a level where they can handle lossy networks themselves and do not depend on the network fabric to do so for them. That would take the wind out of the sails (and sales) of iWarp that need to rely on the overhead/complexity of TCP/IP on the NIC to do this.
What’s also important to note is that the mechanism to make the fabric doesn’t not have to be PFC. Progress is being made there as well, but you can leverage it when and where you so desire. Same like you would do today with RoCEv2 and ECN where you can add PFC as a fail save.
Non RDMA TCP/IP that does what RDMA does
Let’s look at what’s being talked about and promoted right now by some. Basically, it comes down to not needing RDMA at all. The players here seem to be SolarFlare, Lightbits and Paviliondata. systems.
These vendors are all heavily focusing on NVME/TCP fabrics and storage. Fabrics, yes, so decoupled storage and compute. Not exclusively perhaps but we know that in a hyperconverged world separating compute and storage is blasphemy and some player don’t believe in this. In the battle for the big truth and sales infidels (people who disagree or think there is no one size fits all) are not given much attention, let alone a voice or airtime. Anyway, there is no reason why SolarFlare and the like cards could not be leveraged for other use cases like SMB 3. I need to ask them actually.
The NVMe over Fabric effort is gaining a more traction and it’s actually cool to see StarWind pushing this forward as well. Look at some of what StarWind is working on in these presentation form Storage Field Day 17.
The message form SolarFlare, Pavioliondata and LightBits is to forget about RDMA. The overhead in operations, compatibility and in the process are not worth the effort. So, take that iWarp, you’re doing all this effort to get that entire TCP/IP stack working with RDMA and it isn’t even necessary?! To add insult to injury, you don’t only burden RDMA with the full TCP/IP stack but even RDMA itself might be overkill. So, they do slap RoCE in the face as well as iWarp with one big swing.
Now the fun part is they seem to focus on competing with RoCE? Why? Because they leverage TCP/IP and can hardly bash it as it’s leveraged in iWarp? But if the overhead on the card for RoCE RDMA is too big and needless, doesn’t this, perhaps even to a bigger extend, applies to iWarp RDMA as well? As iWarp “works on any switch” they cannot play that card to hard either I guess.
The question this leaves is how can these other players not know this and have changes there approach? Is what they do good enough and is the rest just technical vanity? Or vice versa, are SolarFlare, Pavioliondata and LightBits wrong or just don’t know all the issues that are going to bite them (yet)?
What about the network fabric?
What am I concerned about?
Well problems at the network fabric level tend to be ignored in the discussion. That’s something abstract for many people and it only comes into view when things go wrong.
I’m pretty sure the network engineers at Azure, AWS, Google and some other players requiring (High Performance Computing, High Frequency Traders, …) have a slightly different or more holistic point of view. But sometimes the business and application owners forget about those network people when they can reduce their own support issues. They’ll remember that when thing go wrong. But hey, until that happens, it’s not their part of the stack that’s causing the issues.
Sure, PFC might no longer be the prime feature in the future to ensure lossless traffic under congestion with IRN as ECN/DCQN/DCTCP and other techniques take care of that better. On top of that smart FPGA components will open up new and better approaches to this challenge.
The point remains that you cannot ignore the network fabric just like that, certainly not for every use case. Questions still remain for any solution. And what I like about RoCE and IRN is that they at least always discuss the need, the benefits & the challenges. The beauty of RoCE is that as the protocol requires lossless networks or it just doesn’t function well, they have to address it up front.
What if the small difference in latency between RDMA and TCP/IP does matter for certain use cases? What happens with head of line blocking, delayed acks, incast, etc. Basically, what happens when (serious) congestion occurs? This question is and remains valid at scale for RoCE/IRN, iWarp and TCP/IP. We must realize that we cannot solve congestion when it’s due to a total lack of capacity. That’s a given. Still managing occurrences of congestion optimally does help to make sure the experience isn’t a total catastrophe. This is tied into the question in how far you can rely on the cards on the source and target the handle it by themselves.
Am I being to cautious?
Maybe I’m a bit too cautious on this side? Lossless or not, when the congestion becomes bad enough latency spikes and perhaps things fall apart anyway as the efforts to keep the network lossless slow things down beyond usability. Anyone who has seen the primary and secondary effects of storage latency on a high availability cluster and data can attest to the fact that it becomes pretty much useless. So, is the answer only to be found in over provisioning or not?
For the use case if SMB Direct in all its applications that approach seems straight forward as long as the deployment stamps are small enough. Use redundant highly capable switches that won’t require exotic congestion management for normal S2D use cases. I.e. 4 to 16 nodes nodes. Even with the bigger ones this could still go well. At least until we start having 16 * 1TB NVDIMM storage devices in a single server perhaps. Don’t laugh, look at NVMe, Optane, NVDIMM, etc. These have characteristics that make any existing OS, storage and file system struggle to keep up with and leverage.
Another reason why the network fabric might come into focus a bit more is in the uplinks. With Windows Server 2019 Cluster Sets this will become perhaps even a more common attention point as a lot of the inter cluster operations become easier and transparent. Which might lead to uplinks being used more than ever. It’s worth taking into consideration at the design phase.
And there are many more use cases for low latency, high throughput & CPU offloading, so a valid solution for one use case might not be the better one for another.
What to make of all this?
If Mellanox is right, the future will look bright for easier to deploy RoCE. Or should I call that IRN? Most certainly if it can be combined with efforts to manage congestion easier, hop to hop as well as end to end. At least someone is thinking about the network fabric and I have to applaud that. If those Improved RoCE NICs become reality they might have all bases covered, the ease of iWarp, the tooling & plans to manage congestion on the network fabric and the benefits of RDMA at its best.
Now if iWarp is right that would mean RoCE is overcomplicating things for no good reason. Don’t worry be happy. RDMA on the NICs is where it is all at. That sounds great but that would mean those network fabric people worrying about congestion might need to worry a bit more. There is no limit to what people can throw on the network with iWarp and they don’t care.
If iWarp is right in regards to TCP/IP and we let others worry about the network that’s not just good news for iWarp. It’s also great news for the pure TCP/IP point of view. They can do what iWarp can do but they don’t need to RDMA to achieve that. So why limit yourself to RDMA workloads with associated cost and overhead iWarp if that’s potentially not needed? Some pundits will say there only 3 players claiming that is possible, so is everybody else wrong or are they missing something. But judging this on numbers of vendors only isn’t a great idea as 3 isn’t bad compared to the number of serious iWarp and RoCE players. I’m pretty sure current consumers or RDMA are at least intrigued by the idea they could just use TCP/IP and forget all about the RDMA wars.
Conclusion
As with most things the future will show us what happens. For now, to me, it seems that future will not be black and white. There will be many shades of gray. What technology is used depends on intrinsic scale and scalability. Both up scaling and out scaling. It will also depend heavily on the segment of the market (cloud players versus server rooms) in combination with the nature of the workloads (NVME fabrics, Persistent Memory, Live Migrations, storage replica etc.).
Vendors will choose technologies based on their requirements and supportability. A prime example of this is Microsoft recommending iWarp now over RoCE. This doesn’t mean RoCE doesn’t work or is not needed. It means that Microsoft has too much support issues with badly implemented DCB (PFC/ETS). It means that iWarp delivers what they need and avoids their problems of support calls due to badly implemented RoCE. They’re happy.
The question is whether TCP/IP cards might really be a better choice in the future. Some companies seem to think so. And if RDMA is not needed to achieve high throughput, low latency and CPU offload why bother? SMB multichannel will still work and aggregate multiple NICs where appropriate if the workload leveraged is SMB. But even with non-SMB 3 workloads the benefits could become available. Interesting at least, no?
As we get into a world of storage driven by persistent memory it might show that RoCE or iWarp (RDMA) is still required to get the job done. All these technologies, as I see it today, will complement each other. Just like FC is not dying off but giving up part of the market to other storage fabrics that also create new markets. This is what happened with iSCSI. They all (still) have their role to play, maybe not for you personally, but within the greater ecosystem, there is rarely one size fits all.
Last but not least if the workload and scale are big enough, don’t ignore congestion control on the network fabric. Interesting times lie ahead! You can always count on people taking the easy way, like water. They’ll gravitate towards effective results with little or no effort and minimal costs. The best advice I can give is to keep in mind your use case(s) and the context in which you work. Complexity is OK as long as you can manage it and when you need it to achieve your goals. Remember that there is no free lunch.
Also, realize that technology evolves and as such also solutions to challenges. I for one keep a very interest eye on this technology and market. Expect me to report back with new insights when I get them. While technology developments move fast, it takes many “sales cycles” to see changes in the field. Hardware, infrastructure (storage, network, memory, compute) and software designs need to align to leverage new technologies.
Finally
If you like these sorts of discussion, I’d like to remind you I do present on the subject of SMB Direct. By extension this also means RDMA and all things related to achieving what RDMA is working to achieve. So, keep an eye out for my conference speaking engagements or let me know if you’d like to me to come talk at your conference or event. Educated customers can buy the solutions that fit them the best. The uninformed customers are sold what suits the vendors interests. You choose!