Search
StarWind is a hyperconverged (HCI) vendor with focus on Enterprise ROBO, SMB & Edge

DRBD/LINSTOR vs Ceph vs StarWind VSAN: Proxmox HCI Performance Comparison

  • May 14, 2024
  • 44 min read
StarWind DevOps Team Lead. Volodymyr possesses broad expertise in virtualization, storage, and networking, with exceptional experience in architecture planning, storage protocols, hardware sourcing, and research.
StarWind DevOps Team Lead. Volodymyr possesses broad expertise in virtualization, storage, and networking, with exceptional experience in architecture planning, storage protocols, hardware sourcing, and research.

Introduction

Hyperconverged infrastructure (HCI) has revolutionized the way organizations manage their IT environments. By blending compute, storage, and networking into a single platform, HCI offers simplicity, scalability, and cost-effectiveness.

But with so many HCI storage options available, choosing the right one can feel like navigating a maze. That’s where we come in. In this article, we’re going to take a deep dive into three popular HCI storage solutions, all running on a 2-node Proxmox cluster.

Here’s what we’ll be looking at:

  • StarWind Virtual SAN (VSAN) NVMe-oF over RDMA/TCP: This solution promises fast performance with its NVMe technology, coupled with host mirroring and MDRAID-5.
  • Linstor/DRBD over TCP: Using DRBD technology, this solution offers robust host mirroring and MDRAID-5, promising reliability and efficiency.
  • Ceph: While not intended for production use in a 2-node setup, we’ll still explore Ceph’s capabilities for the sake of understanding its performance in this scenario.

Our goal? To give you the inside scoop on how these solutions perform in real-world situations. We’ll be running tests, crunching numbers, and providing practical insights to help you make informed decisions about your HCI storage needs.

So, whether you’re an IT pro, a system architect, or just curious about HCI storage, buckle up as we dive into the world of hyperconvergence and uncover the secrets behind these HCI storage solutions.

Solutions Overview

StarWind VSAN NVMe-oF over RDMA:

StarWind VSAN NVMe-oF over RDMA

StarWind VSAN NVMe-oF over RDMA leverages Remote Direct Memory Access (RDMA) technology for high-speed data transfers between nodes. In this scenario, each Proxmox node is equipped with 5 NVMe drives, ensuring fast and efficient storage access. To enable RDMA compatibility for the Mellanox Network Interface Cards (NICs) on the Proxmox nodes, Single Root I/O Virtualization (SR-IOV) is configured. This allows the NICs to create Virtual Functions (VFs), which are then passed through to the StarWind Cloud Virtual Machine (CVM). Within the StarWind CVM, the 5 NVMe drives from each Proxmox node are assembled into a single MDRAID5 array. This is achieved using the mdadm utility with specific creation parameters, including RAID level 5, a chunk size of 4K, an internal bitmap, and 5 RAID devices.

On top of the MDRAID5 array, two StarWind High Availability (HA) devices are created. This enhances fault tolerance and ensures continuous availability of data by replicating data between nodes.

StarWind VSAN NVMe-oF over TCP:

StarWind VSAN NVMe-oF over TCP
The configuration of this scenario is the same as StarWind VSAN NVMe-oF over RDMA. The only difference is in the network setup. Unlike the RDMA setup where SR-IOV is used for Mellanox NICs to enable RDMA compatibility, in the TCP configuration, SR-IOV is not configured. Instead, network communication is facilitated over TCP using Virtio network adapters. The StarWind CVM utilizes Virtio network adapters for both data and replication traffic. These adapters are specifically configured to support multiqueue, with 8 queues installed. This optimization ensures efficient network utilization and enhances overall performance.

Linstor/DRBD over TCP:

Linstor/DRBD over TCP
Similar to other configurations, the Linstor/DRBD over TCP scenario incorporates 5 NVMe drives on each Proxmox node, assembled into MDRAID5 arrays. On top of the MDRAID5 arrays, Linstor establishes a storage pool and resource group, configuring them with the “–place-count=2” parameter. This parameter determines the number of replicas for data redundancy and fault tolerance.

The Quorum Device Node was used as the diskless node with the “Quorum tiebreaker” feature for ensuring reliable cluster decision-making in case of network partitioning or node failure. Since the Linstor Diskless Node (Quorum Device Node) is connected to the Proxmox nodes via the 1GbE management network and doesn’t have RDMA compatibility, we couldn’t configure Linstor with RDMA transport.

Based on the performance recommendation, we have applied several tweaks, including adjustments to parameters such as max-buffers, sndbuf-size, and al-extents. These tweaks fine-tune the DRBD configuration for enhanced efficiency and throughput.

Ceph:

Ceph Cluster
In this scenario, the focus is on evaluating the performance of a 2-node Ceph cluster, but don’t forget that such a setup is a non-production scenario. Instead, the aim is to gain insights into the potential performance capabilities of Ceph in this specific context, purely “for science purposes”.

Based on this blog, for setting up the Ceph cluster, each Proxmox node is configured to host two OSDs per NVMe drive, resulting in a total of 20 OSDs across the cluster (10 per node). This configuration is established using the command pveceph osd create /dev/nvme0n1 –osds-per-device=2.

Furthermore, a pool for RBD (RADOS Block Device) devices is created with a replicated rule, ensuring data redundancy and fault tolerance. The size and min_size parameters are set to 2 and 1 respectively, defining the replication factor. Additionally, the placement group count is set to 256 to optimize data distribution and balancing within the cluster.

For performance tuning, only one parameter, osd_mclock_profile, is adjusted. This parameter, set to high_client_ops, is intended to optimize Ceph’s handling of client operations, potentially enhancing overall system responsiveness and throughput.

Testbed overview:

The evaluation is performed on Supermicro SYS-220U-TNR nodes with Intel Xeon Platinum CPUs and Mellanox ConnectX-6 EN 200GbE NICs. Each node is equipped with 5x NVMe Micron 7450 MAX drives. Proxmox VE 8.1.3 serves as the hypervisor, with specific versions of StarWind VSAN, Linstor/DRBD, and Ceph.

Hardware:

Proxmox Node-{1..2} Supermicro SYS-220U-TNR
CPU Intel(R) Xeon(R) Platinum 8352Y @2.2GHz
Sockets 2
Cores/Threads 64/128
RAM 256GB
NICs 2x Mellanox ConnectX®-6 EN 200GbE (MCX613106A-VDA)
Storage 5x NVMe Micron 7450 MAX: U.3 3.2TB

Software:

Proxmox VE 8.1.3 (kernel 6.5.13-3-pve)
StarWind VSAN Version V8 (build 15260, CVM 20231016)
Linstor/DRBD 9.2.8
Ceph reef 18.2.2

StarWind CVM parameters:

CPU 32 vCPU
RAM 32GB
NICs 1x e1000 network adapter for management

4x Mellanox ConnectX-6 Virtual Function network adapter (SRIOV) for NVMe-oF over RDMA scenario

4x virtio network adapter for NVMe-oF over TCP scenario

Storage MDRAID5 (5x NVMe Micron 7450 MAX: U.3 3.2TB)

Testing methodology

For our tests, we used a tool called FIO, which helps us measure performance in client/server mode. Here’s how we set things up: we had 16 virtual machines, each with 4 virtual CPUs, 8GB of RAM, and 3 raw virtual disks, each 100GB in size. Before running our tests, we made sure to fill these virtual disks with random data.

We tested various patterns to see how well our setup performs:

  • 4k random read;
  • 4k random read/write with a 70/30 ratio;
  • 4k random write;
  • 64k random read;
  • 64k random write;
  • 1M read;
  • 1M write.

It’s important to note that before running some of our tests, we warmed up the virtual disks to ensure consistent results. For example, we ran the 4k random write pattern for 4 hours before testing the 4k random read/write 70/30 and 4k random write patterns. Similarly, we ran the 64k random write pattern for 2 hours before testing the 64k random write pattern.

We repeated each test 3 times to get an average result, and each test had a duration of 600 seconds for read tests and 1800 seconds for write tests. This way, we could be sure our results were reliable and representative.

Benchmarking local NVMe performance

Before starting our evaluation, we wanted to make sure the NVMe drive lived up to its vendor’s promises, so we ran a series of tests to see if its performance matched up. Here is the image with vendor-claimed performance:

Using the FIO utility in client/server mode, we checked how well the NVMe drive handled various workload patterns. The following results have been achieved:

1x NVMe Micron 7450 MAX: U.3 3.2TB
Pattern Numjobs IOdepth IOPS MiB\s Latency (ms)
4k random read 6 32 997,000 3,894 0.192
4k random read/write 70/30 6 16 531,000 2,073 0.142
4k random write 4 4 385,000 1,505 0.041
64k random read 8 8 92,900 5,807 0.688
64k random write 2 1 27,600 1,724 0.072
1M read 1 8 6,663 6,663 1.200
1M write 1 2 5,134 5,134 0.389

The NVMe drive performed just as the vendor promised across all our tests. Whether it was handling small 4k reads or large 1M writes, it delivered on speed and consistency.

Tables with benchmark results

To provide a comprehensive analysis of the performance of different storage configurations, we conducted a series of benchmarks across various workload patterns. In addition to evaluating IOPS (Input/Output Operations per Second) and throughput, we also examined the performance dependency on CPU usage by introducing an additional parameter for 4k random read/write patterns: “IOPs per 1% CPU usage.”

This parameter was calculated using the formula:

IOPS per 1% CPU usage = IOPS / Proxmox Node count / Node CPU usage

Where:

  • IOPS represents the number of I/O operations per second for each pattern.
  • Proxmox Node count is 2 in our case (indicating the number of Proxmox nodes).
  • Node CPU usage denotes the CPU usage of one Proxmox node during the test.

By incorporating this additional metric, we aimed to provide deeper insights into how CPU usage correlates with IOPS, offering a more nuanced understanding of performance characteristics.

Now let’s delve into the detailed benchmark results for each storage configuration.

StarWind VSAN NVMe-oF over RDMA

For the StarWind VSAN NVMe-oF over RDMA configuration, the performance was evaluated across multiple workload patterns. Notably, the 4k random read pattern showcased significant scalability with increasing IOdepth, reaching up to 4.8 million IOPS with an IOdepth of 64. Similarly, the 4k random read/write (70%/30%) pattern exhibited considerable performance, achieving up to 1.2 million IOPS with an IOdepth of 64. These results indicate efficient utilization of resources and high throughput capabilities in demanding scenarios.

Pattern Numjobs IOdepth IOPS MiB/s Latency (ms) Node CPU usage % IOPS per 1% CPU usage
4k random read 3 8 2,276,000 8,891 0.168 47.00% 24,213
4k random read 3 16 3,349,000 13,082 0.227 53.00% 31,594
4k random read 3 32 4,220,000 16,484 0.363 55.00% 38,364
4k random read 3 64 4,840,000 18,906 0.630 56.00% 43,214
4k random read 3 128 4,664,000 18,218 1.334 56.00% 41,643
4k random read/write (70%/30%) 3 8 1,066,000 4,164 0.477 40.00% 13,325
4k random read/write (70%/30%) 3 16 1,182,000 4,617 0.942 42.00% 14,071
4k random read/write (70%/30%) 3 32 1,193,000 4,660 1.900 41.00% 14,549
4k random read/write (70%/30%) 3 64 1,199,000 4,684 3.294 38.00% 15,776
4k random write 3 4 319,000 1,246 0.600 28.00% 5,696
4k random write 3 8 384,000 1,500 0.999 29.00% 6,621
4k random write 3 16 390,000 1,523 1.965 29.50% 6,610
4k random write 3 32 386,000 1,508 3.975 28.50% 6,772
64k random read 3 2 320,000 20,000 0.291 24.00%
64k random read 3 4 462,000 28,875 0.415 26.00%
64k random read 3 8 509,000 31,812 0.752 27.00%
64k random read 3 16 507,000 31,687 1.519 26.00%
64k random write 3 1 68,400 4,274 0.702 23.00%
64k random write 3 2 70,700 4,416 1.358 24.00%
64k random write 3 4 68,100 4,256 2.817 24.00%
1024k read 1 1 18,400 18,400 0.868 19.00%
1024k read 1 2 27,600 27,600 1.157 20.00%
1024k read 1 4 31,700 31,700 2.019 20.00%
1024k read 1 8 31,800 31,800 4.027 20.00%
1024k write 1 1 4,214 4,214 3.793 23.00%
1024k write 1 2 4,197 4,197 7.622 23.00%
1024k write 1 4 4,168 4,168 15.287 23.00%

StarWind VSAN NVMe-oF over TCP

In the StarWind VSAN NVMe-oF over TCP setup, the performance remained competitive, but with slightly lower IOPS compared to the RDMA configuration. However, the system demonstrated robust scalability, with the 4k random read pattern achieving over 2.3 million IOPS at an IOdepth of 32. The 4k random read/write (70%/30%) pattern also exhibited impressive performance, surpassing 887,000 IOPS at an IOdepth of 64. These results indicate the adaptability of the TCP protocol for NVMe-oF communication, providing high throughput and low latency.

Pattern Numjobs IOdepth IOPS MiB/s Latency (ms) Node CPU usage % IOPs per 1% CPU usage
4k random read 3 8 1,126,000 4,399 0.337 40.00% 14,075
4k random read 3 16 1,682,000 6,572 0.458 46.00% 18,283
4k random read 3 32 2,349,000 9,178 0.644 50.00% 23,490
4k random read 3 64 2,808,000 10,969 1.070 51.00% 27,529
4k random read 3 128 2,944,000 11,500 2.182 52.00% 28,308
4k random read/write (70%/30%) 3 8 463,000 1,808 1.068 32.00% 7,234
4k random read/write (70%/30%) 3 16 559,000 2,183 1.804 34.00% 8,221
4k random read/write (70%/30%) 3 32 710,000 2,773 2.814 37.00% 9,595
4k random read/write (70%/30%) 3 64 887,000 3,465 4.353 38.00% 11,671
4k random write 3 4 203,000 793 0.944 25.00% 4,060
4k random write 3 8 221,000 863 1.729 27.00% 4,093
4k random write 3 16 253,000 989 3.048 28.00% 4,518
4k random write 3 32 301,000 1,176 5.102 30.00% 5,017
64k random read 3 2 230,000 14,375 0.418 25.00%
64k random read 3 4 323,000 20,188 0.587 29.00%
64k random read 3 8 371,000 23,188 1.025 31.00%
64k random read 3 16 381,000 23,813 1.995 31.00%
64k random write 3 1 51,700 3,229 0.928 23.00%
64k random write 3 2 62,000 3,876 1.551 25.00%
64k random write 3 4 65,500 4,096 2.931 26.00%
1024k read 1 1 16,200 16,200 0.987 21.00%
1024k read 1 2 20,200 20,200 1.550 23.00%
1024k read 1 4 21,700 21,700 2.863 24.00%
1024k read 1 8 22,900 22,900 5.573 25.00%
1024k write 1 1 3,567 3,567 4.469 23.00%
1024k write 1 2 4,050 4,050 7.924 24.00%
1024k write 1 4 3,528 3,528 18.040 21.00%

Linstor/DRBD over TCP

In the Linstor/DRBD over TCP configuration, the system showcased commendable performance across various workload patterns. The 4k random read pattern demonstrated consistent scalability, achieving up to 1.5 million IOPS with an IOdepth of 32. Additionally, the 4k random read/write (70%/30%) pattern exhibited substantial performance, surpassing 1 million IOPS with an IOdepth of 64. These results highlight the effectiveness of the DRBD solution over TCP for storage replication, delivering high throughput and low latency.

Pattern Numjobs IOdepth IOPS MiB/s Latency (ms) Node CPU usage % IOPs per 1% CPU usage
4k random read 3 8 1,454,000 5,679 0.263 49.00% 14,837
4k random read 3 16 1,494,000 5,836 0.513 38.00% 19,658
4k random read 3 32 1,500,000 5,859 1.022 38.00% 19,737
4k random read 3 64 1,479,000 5,777 2.076 44.00% 16,807
4k random read 3 128 1,483,000 5,792 4.291 44.00% 16,852
4k random read/write (70%/30%) 3 8 646,000 2,523 0.889 32.00% 10,094
4k random read/write (70%/30%) 3 16 821,000 3,207 1.449 40.00% 10,263
4k random read/write (70%/30%) 3 32 918,000 3,586 2.670 47.00% 9,766
4k random read/write (70%/30%) 3 64 1,088,000 4,250 4.436 70.00% 7,771
4k random write 3 4 159,000 621 1.205 17.00% 4,676
4k random write 3 8 221,000 864 1.734 21.00% 5,262
4k random write 3 16 272,000 1,061 2.823 28.00% 4,857
4k random write 3 32 320,000 1,251 4.795 29.00% 5,517
64k random read 3 2 327,000 20,437 0.292 9.50%
64k random read 3 4 480,000 30,000 0.399 15.00%
64k random read 3 8 588,000 36,750 0.652 19.00%
64k random read 3 16 622,000 38,875 1.234 20.00%
64k random write 3 1 53,000 3,312 0.904 11.00%
64k random write 3 2 69,200 4,325 1.383 17.00%
64k random write 3 4 77,100 4,821 2.488 20.00%
1024k read 1 1 12,500 12,500 1.273 4.00%
1024k read 1 2 20,800 20,800 1.534 8.00%
1024k read 1 4 30,200 30,200 2.116 10.00%
1024k read 1 8 33,100 33,100 3.867 10.00%
1024k write 1 1 4,763 4,763 3.329 11.00%
1024k write 1 2 5,026 5,026 6.444 12.00%
1024k write 1 4 4,984 4,984 12.452 12.00%

Ceph

In the Ceph configuration, the performance was evaluated across different workload patterns, demonstrating the system’s capability to handle diverse workloads. Benchmark results show solid performance, though slightly lower compared to other solutions in some scenarios. For example, in the 4k random read pattern, IOPS ranged from 381,000 to 406,000, depending on the number of jobs and IOdepth.

Pattern Numjobs IOdepth IOPS MiB/s Latency (ms) Node CPU usage % IOPs per 1% CPU usage
4k random read 3 8 381,000 1,488 1.005 66.00% 2,886
4k random read 3 16 385,000 1,505 1.992 67.00% 2,873
4k random read 3 32 394,000 1,537 3.873 67.00% 2,940
4k random read 3 64 396,000 1,549 7.660 67.00% 2,955
4k random read 3 128 406,000 1,586 12.238 66.00% 3,076
4k random read/write (70%/30%) 3 8 253,100 989 1.797 77.00% 1,644
4k random read/write (70%/30%) 3 16 264,400 1,033 3.232 78.00% 1,695
4k random read/write (70%/30%) 3 32 269,000 1,050 6.074 80.00% 1,681
4k random read/write (70%/30%) 3 64 270,100 1,055 11.749 80.00% 1,688
4k random write 3 4 109,000 426 1.756 55.00% 991
4k random write 3 8 141,000 551 2.712 72.00% 979
4k random write 3 16 163,000 638 4.696 80.00% 1,019
4k random write 3 32 181,000 707 8.489 85.00% 1,065
64k random read 3 2 133,000 8,312 0.720 25.00%
64k random read 3 4 201,000 12,562 0.950 43.00%
64k random read 3 8 219,000 13,687 1.747 48.00%
64k random read 3 16 220,000 13,750 3.485 47.50%
64k random write 3 1 36,200 2,262 1.323 19.50%
64k random write 3 2 55,900 3,494 1.712 30.00%
64k random write 3 4 76,300 4,769 2.516 47.00%
1024k read 1 1 8,570 8,570 1.864 5.00%
1024k read 1 2 15,100 15,100 2.120 10.00%
1024k read 1 4 22,600 22,600 2.822 17.50%
1024k read 1 8 29,100 29,100 4.387 26.50%
1024k write 1 1 5,627 5,627 2.807 9.00%
1024k write 1 2 9,045 9,045 3.486 16.00%
1024k write 1 4 12,500 12,500 5.134 26.00%

Visualizing benchmarking results in charts

With all benchmarks completed and data collected, let’s use graphs to easily compare the results.

 

4K RR (IOPS)

Figure 1: 4K RR (IOPS)

Figure 1 shows the number of Input/Output Operations Per Second (IOPS) achieved during 4k random read operations with Numjobs = 3. You can see that the performance of StarWind VSAN NVMe-oF over RDMA and StarWind VSAN NVMe-oF over TCP increases steadily as the queue depth increases, up to a point. This suggests that StarWind VSAN can benefit from queuing additional I/O requests to achieve better performance. Meanwhile, both Linstor/DRBD over TCP and Ceph reach their peak performance straight away. Increasing the queue depth for them only results in higher latency without performance growth.

StarWind VSAN NVMe-oF over RDMA raises the bar, showing a significant performance increase starting from queue depth = 8. At an IOdepth of 128, StarWind VSAN NVMe-oF over RDMA achieves 4,664,000 IOPS, while StarWind VSAN NVMe-oF over TCP achieves 2,944,000 IOPS. The performance gap widens even further when compared with Linstor/DRBD over TCP, where StarWind VSAN NVMe-oF over RDMA outperforms it by around 214.1%, compared to Linstor’s 1,483,000 IOPS. Against Ceph, the difference is even bigger, with StarWind VSAN NVMe-oF over RDMA surpassing Ceph’s 406,000 IOPS by more than x10 times.

 

4K RR (Latency)

Figure 2: 4K RR (Latency)

Figure 2 compares the latency when reading 4k data blocks. Not surprisingly, StarWind VSAN NVMe-oF over RDMA exhibits the lowest latency among the tested solutions, with an average result of 0.5 milliseconds (ms) and a minimum of 0.168 ms. Moreover, for both RDMA and TCP configurations, StarWind VSAN NVMe-oF maintains lower latency compared to other solutions across all queue depths. In contrast, Linstor/DRBD over TCP and Ceph demonstrate a higher average latency increase, with Ceph being the slowest of all contenders.

 

4K RR (IOPS per 1% CPU Usage)

Figure 3: 4K RR (IOPS per 1% CPU Usage)

CPU utilization efficiency is also a very important factor, as it is directly related to the cost-efficiency of the resulting solution, especially in a hyperconverged infrastructure (HCI).

Figure 3 shows IOPS per 1% CPU usage during a 4k random read test. StarWind VSAN NVMe-oF over RDMA and TCP demonstrate the most efficient CPU utilization, followed by Linstor/DRBD over TCP. Ceph lags behind the other solutions in terms of CPU efficiency, exhibiting the lowest IOPS per 1% CPU usage values, ranging from 2,886 to 3,076, and suggesting room for improvement in CPU resource utilization. At the same time, StarWind VSAN NVMe-oF over RDMA demonstrates notable performance improvements over other configurations when comparing their highest values: outperforming its TCP counterpart by approximately 52.67%, Linstor/DRBD over TCP by about 118.92%, and Ceph by impressive x13 times.

 

4K RR/RW 70%/30% (IOPS)

Figure 4: 4K RR/RW 70%/30% (IOPS)

Pure read or write workloads don’t occur all the time in real production scenarios, which is why benchmarking storage performance with a mixed 70/30 read-write pattern is important to get the full picture.

Figure 4 demonstrates the number of IOPS achieved during 70%/30% 4k random read/write operations with Numjobs = 3. Here, StarWind VSAN NVMe-oF over RDMA again takes the lead, achieving the maximum result of 1,199,000 IOPS and outperforming other storage solutions, especially in tests with lower IOdepths.

Interestingly, Linstor/DRBD handled mixed workloads better than other solutions, achieving remarkable results. Specifically, when compared head-to-head with StarWind VSAN NVMe-oF over TCP, it scored 1,088,000 IOPS, gaining a solid advantage of 22.6% over the maximum 887,000 IOPS achieved by StarWind over TCP.

Ceph, on the other hand, consistently showed lower performance across the board, achieving a maximum of 270,000 IOPS.

 

4K RR/RW 70%/30% (Latency)

Figure 5: 4K RR/RW 70%/30% (Latency)

Delving into latency dynamics, Figure 5 scrutinizes the delay times during 4k random read/write operations at a 70%/30% ratio with Numjobs = 3. Again, StarWind VSAN NVMe-oF over RDMA stands out with notably lower latency throughout the test, with Linstor/DRBD taking second place.

 

4K RR/RW 70%/30% (IOPS per 1% CPU Usage)

Figure 6: 4K RR/RW 70%/30% (IOPS per 1% CPU Usage)

Figure 6 explores the efficiency of IOPS relative to CPU utilization during the same mixed workload. Here, StarWind VSAN NVMe-oF over RDMA demonstrates superior efficiency during the ‘heaviest’ IOdepth 64 pattern, delivering better IOPS per 1% CPU usage by 35.17% over its TCP version, 103.0% over Linstor/DRBD over TCP, and a striking 834.5% over Ceph, marking a significant performance advantage.

Interestingly, when comparing the ‘lighter’ IOdepth patterns of 8 and 16, Linstor/DRBD takes the lead over StarWind’s NVMe-oF over TCP, with an average 25% advantage in CPU utilization. However, this advantage is completely mitigated during the ‘heavier’ tests within the IOdepth 32 pattern. Moreover, StarWind VSAN NVMe-oF over TCP regains the lead, achieving 50.17% better CPU utilization than Linstor/DRBD during IOdepth 64 tests. Fundamentally, this means that StarWind VSAN NVMe-oF over TCP maintains the same level of CPU usage efficiency across all patterns, while Linstor/DRBD starts wasting more CPU resources on heavier workloads.

 

4K RW (IOPS)

Figure 7: 4K RW (IOPS)

Figure 7 presents the IOPS achieved during 4k random write operations, revealing that StarWind VSAN NVMe-oF over RDMA not only tops its TCP configuration by about 29% but also surpasses Linstor/DRBD by 22% and Ceph by 115%, showcasing the benefits of RDMA in write-intensive scenarios.

Once again, the close competition between StarWind VSAN NVMe-oF over TCP and Linstor/DRBD continues here, with StarWind VSAN showing a 27.6% advantage over DRBD at IOdepth=4, going neck-to-neck during IOdepth=8, and finally, falling behind at heavier workloads, scoring 5.9% lower than DRBD. What a race!

 

4K RW (Latency)

Figure 8: 4K RW (Latency)

In assessing write latencies, Figure 8 confirms the previous IOPS results, with StarWind VSAN NVMe-oF over RDMA leading in minimizing delays and outperforming other solutions across the board.

The close race between Linstor/DRBD and StarWind VSAN’s TCP configuration is also prominent here.

 

4K RW (IOPS per 1% CPU Usage)

Figure 9: 4K RW (IOPS per 1% CPU Usage)

Efficiency takes center stage in Figure 9, which measures IOPS per 1% CPU usage during a 4k random write test. StarWind VSAN NVMe-oF over RDMA demonstrates remarkable CPU utilization, significantly outstripping other configurations and illustrating its ability to deliver higher performance with less resource consumption. It is more efficient by approximately 34.95% over StarWind VSAN NVMe-oF over TCP, 22.75% over Linstor/DRBD over TCP, and a dramatic 536.07% over Ceph in 32 queue depth tests.

What’s even more interesting, Linstor/DRBD takes a respectable second place, outperforming StarWind VSAN NVMe-oF over TCP configuration by 9.96% in IOdepth=32 tests.

 

64K RR (Throughput)

Figure 10: 64K RR (Throughput)

Turning our focus to throughput, Figure 10 highlights how Linstor/DRBD over TCP excels in handling 64KB blocks, particularly leveraging its local read policies and avoiding the TCP stack usage to outperform others. During this test, Linstor/DRBD over TCP emerges as a standout performer, achieving about 22.17% higher throughput than StarWind VSAN NVMe-oF over RDMA, 63.23% over StarWind VSAN NVMe-oF over TCP, and a whopping 182.27% over Ceph.

 

64K RR (Latency)

Figure 11: 64K RR (Latency)

Figure 11 compares latency during 64K random reads, where Linstor/DRBD over TCP (surprise surprise) consistently shows the lowest latency, emphasizing its efficiency in reading large-sized data blocks. In IOdepth=16 tests, Linstor/DRBD over TCP demonstrates 18.76% lower latency compared to StarWind VSAN NVMe-oF over RDMA, about 38.17% lower than StarWind VSAN NVMe-oF over TCP, and approximately 64.57% lower than Ceph.

 

64K RR (CPU Usage)

Figure 12: 64K RR (CPU Usage)

CPU usage during 64K random reads is analyzed in Figure 12, showcasing Linstor/DRBD over TCP with the most efficient CPU utilization. Compared to every other solution, Linstor/DRBD over TCP maintains CPU usage reductions by an average of 60.41% during IOdepth=2 tests, highlighting its efficiency on lighter workloads. Interestingly, the difference in efficiency decreases with heavier workloads, with Linstor/DRBD being more efficient than StarWind VSAN NVMe-oF over RDMA by 23.07%, 35.48% more efficient than StarWind VSAN NVMe-oF over TCP, and 57.89% more efficient than Ceph.

 

64K RW (Throughput)

Figure 13: 64K RW (Throughput)

The throughput for 64K random writes is presented in Figure 13. While Linstor/DRBD over TCP and Ceph demonstrate higher throughput values at higher IOdepths, StarWind VSAN NVMe-oF over RDMA provides better throughput at lower IOdepth tests. At an IOdepth of 4, Linstor/DRBD surpasses StarWind VSAN NVMe-oF over RDMA by 13.3%, and its TCP configuration by 17.7%. Meanwhile, at IOdepth = 1, StarWind VSAN NVMe-oF over RDMA shows 29.04% better throughput compared to DRBD and 88.94% better than Ceph. Meanwhile, StarWind VSAN in TCP configuration shows 2.50% lower throughput than DRBD over TCP in IOdepth=1 tests and a 15.03% decrease in IOdepth=4 tests.

It’s also interesting how Ceph’s throughput ramps up as the IOdepth increases. This suggests that Ceph can handle more large-sized requests in parallel, reducing individual queuing times and leading to higher overall throughput.

 

64K RW (Latency)

Figure 14: 64K RW (Latency)

In Figure 14, the latency of 64K random writes presents the same pattern as seen in previous throughput figures. StarWind VSAN NVMe-oF over RDMA offers the fastest response times at lower IOdepths. However, at an IOdepth of 4, Linstor/DRBD over TCP takes the lead, showing a lower latency of 2.49 ms. Ceph also does a great job at higher IOdepths, showing similar results to DRBD.

 

64K RW (CPU usage)

Figure 15: 64K RW (CPU usage)

Let’s move on to Figure 15, which details CPU usage during 64K random writes. Linstor/DRBD over TCP comes out on top in this race, demonstrating the most efficient use of CPU resources across all tested configurations. Even at higher IOdepths, it showcases up to 16.66% lower CPU utilization compared to StarWind VSAN NVMe-oF over RDMA, 23.07% lower than VSAN over TCP, and 57.44% lower than Ceph.

 

1024K R (Throughput)

Figure 16: 1024K R (Throughput)

The throughput results for 1024k reads, depicted in Figure 16, show that StarWind VSAN NVMe-oF over RDMA significantly outperforms all competitors from IOdepth 1 to 4. However, at IOdepth 8, Linstor/DRBD takes a slight edge, outperforming StarWind VSAN NVMe-oF over RDMA by 4.08%. StarWind VSAN’s TCP configuration shows 29.6% better performance than Linstor/DRBD at IOdepth=1, but at higher IOdepths, DRBD regains the edge and surpasses StarWind VSAN NVMe-oF over TCP by 44.54% at IOdepth 8. Ceph, once again, starts much slower than the others, but shows remarkable results at IOdepth 8, surpassing StarWind VSAN’s TCP configuration and falling behind DRBD and StarWind VSAN NVMe-oF by 13.74% and 9.27% respectively.

 

1024K R (Latency)

Figure 17: 1024K R (Latency)

Latency during the 1024k read test is explored in Figure 17, which reflects the same pattern observed in previous throughput figures. StarWind VSAN NVMe-oF over RDMA offers the fastest response times at IOdepths 1 to 4. However, at IOdepth 8, Linstor/DRBD over TCP takes the lead, demonstrating lower latency across the board.

 

1024K R (CPU Usage)

Figure 18: 1024K R (CPU Usage)

CPU usage during 1024k reads is recorded in Figure 18, where both StarWind configurations exhibit higher resource use compared to competitors. Remarkably, Linstor/DRBD over TCP uses considerably less CPU, even at the maximum IOdepth—up to 50% less compared to StarWind VSAN NVMe-oF over RDMA. Ceph’s average CPU utilization is 20%, placing it around the middle of the pack when compared to the other three configurations, taking into account the respective throughput figures.

 

1024K W (Throughput)

Figure 19: 1024K W (Throughput)

Figure 19 examines 1024k sequential write throughput, highlighting Ceph’s superior performance, especially as IOdepth increases. It surpasses Linstor/DRBD over TCP by approximately 148.95% and StarWind VSAN NVMe-oF over RDMA by about 196.67%, showcasing its effectiveness in handling large write operations. Similarly, against StarWind VSAN NVMe-oF over TCP, Ceph exhibits a substantial lead of roughly 208.64%.

It’s important to note that this level of Ceph’s performance is largely due to its operation in a mirror (RAID 1) configuration over the network in our case, while StarWind and Linstor/DRBD operate in RAID 51 (local RAID 5 combined with RAID 1 over the network).

 

1024K W (Latency)

Figure 20: 1024K W (Latency)

Figure 20 presents the same pattern, with Ceph leading at all IOdepths and showing lower latency than other competitors.

 

1024K W (CPU Usage)

Figure 21: 1024K W (CPU Usage)

Lastly, Figure 21 focuses on CPU usage during 1024k writes. Linstor/DRBD over TCP stands out for its minimal CPU demand, demonstrating an improvement of about 34.04% in CPU efficiency over StarWind VSAN NVMe-oF over RDMA. Interestingly, both the RDMA and TCP configurations of StarWind VSAN show similar CPU usage figures.

Conclusion

Our in-depth benchmarking analysis has shed light on how different storage setups perform across various tasks. We’ve seen that StarWind VSAN NVMe-oF over RDMA delivers impressive performance in 4k workloads, significantly outperforming competitors in terms of both performance and lower CPU utilization.

Meanwhile, StarWind VSAN NVMe-oF over TCP also shows commendable performance in 4k workloads, although it slightly trails its RDMA counterpart and begins to lag behind Linstor in some mixed and large-sized, write-intensive tests.

Linstor/DRBD over TCP holds its own as well, particularly in mixed and large-size workloads such as 64K and 1024K. In some write tests, Linstor even outperformed StarWind VSAN over RDMA, which is a notable achievement for a TCP-based solution.

On the other hand, Ceph shows mixed results. While it excels at writing 1024K blocks, it falls short in almost all other scenarios, particularly in read workloads, and it lags behind in CPU efficiency.

The simple conclusion is that StarWind VSAN is perfect for running most kinds of virtualized workloads: general-use VMs, VDI, databases, and transaction-intensive tasks. Linstor/DRBD is also a good choice for most virtualization workloads, especially in environments where larger sequential reads and writes are typical. Ceph, meanwhile, is a good option for running data warehousing and analytics applications where 4K-64K IOPS are less relevant than high throughput during large sequential reads and writes. It’s also well-suited for environments dealing with video streaming and large file processing.

Ultimately, storage performance is not the sole factor to consider when building an HCI infrastructure. Ease of configuration and maintenance, necessary data protection and optimization features, commercial support, and other factors are also very important. Stay tuned, and we will help you get a bigger picture in our upcoming articles.

Hey! Found Volodymyr’s insights useful? Looking for a cost-effective, high-performance, and easy-to-use hyperconverged platform?
Taras Shved
Taras Shved StarWind HCI Appliance Product Manager
Look no further! StarWind HCI Appliance (HCA) is a plug-and-play solution that combines compute, storage, networking, and virtualization software into a single easy-to-use hyperconverged platform. It's designed to significantly trim your IT costs and save valuable time. Interested in learning more? Book your StarWind HCA demo now to see it in action!