Introduction
Hyperconverged infrastructure (HCI) has revolutionized the way organizations manage their IT environments. By blending compute, storage, and networking into a single platform, HCI offers simplicity, scalability, and cost-effectiveness.
But with so many HCI storage options available, choosing the right one can feel like navigating a maze. That’s where we come in. In this article, we’re going to take a deep dive into three popular HCI storage solutions, all running on a 2-node Proxmox cluster.
Here’s what we’ll be looking at:
- StarWind Virtual SAN (VSAN) NVMe-oF over RDMA/TCP: This solution promises fast performance with its NVMe technology, coupled with host mirroring and MDRAID-5.
- Linstor/DRBD over TCP: Using DRBD technology, this solution offers robust host mirroring and MDRAID-5, promising reliability and efficiency.
- Ceph: While not intended for production use in a 2-node setup, we’ll still explore Ceph’s capabilities for the sake of understanding its performance in this scenario.
Our goal? To give you the inside scoop on how these solutions perform in real-world situations. We’ll be running tests, crunching numbers, and providing practical insights to help you make informed decisions about your HCI storage needs.
So, whether you’re an IT pro, a system architect, or just curious about HCI storage, buckle up as we dive into the world of hyperconvergence and uncover the secrets behind these HCI storage solutions.
Solutions Overview
StarWind VSAN NVMe-oF over RDMA:
StarWind VSAN NVMe-oF over RDMA leverages Remote Direct Memory Access (RDMA) technology for high-speed data transfers between nodes. In this scenario, each Proxmox node is equipped with 5 NVMe drives, ensuring fast and efficient storage access. To enable RDMA compatibility for the Mellanox Network Interface Cards (NICs) on the Proxmox nodes, Single Root I/O Virtualization (SR-IOV) is configured. This allows the NICs to create Virtual Functions (VFs), which are then passed through to the StarWind Controller Virtual Machine (CVM). Within the StarWind CVM, the 5 NVMe drives from each Proxmox node are assembled into a single MDRAID5 array. This is achieved using the mdadm utility with specific creation parameters, including RAID level 5, a chunk size of 4K, an internal bitmap, and 5 RAID devices.
1 2 3 4 5 6 |
#MDRAID5 creation parameters mdadm --create /dev/md0 --level=5 --chunk=4K --bitmap=internal --raid-devices=5 /dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1 /dev/nvme3n1p1 /dev/nvme4n1p1 #udev rule for performance tuning cat /etc/udev/rules.d/60-mdraid5.rules: SUBSYSTEM=="block", KERNEL=="md*", ACTION=="add|change", ATTR{md/group_thread_cnt}="6" SUBSYSTEM=="block", KERNEL=="md*", ACTION=="add|change", ATTR{md/stripe_cache_size}="512" |
On top of the MDRAID5 array, two StarWind High Availability (HA) devices are created. This enhances fault tolerance and ensures continuous availability of data by replicating data between nodes.
StarWind VSAN NVMe-oF over TCP:
The configuration of this scenario is the same as StarWind VSAN NVMe-oF over RDMA. The only difference is in the network setup. Unlike the RDMA setup where SR-IOV is used for Mellanox NICs to enable RDMA compatibility, in the TCP configuration, SR-IOV is not configured. Instead, network communication is facilitated over TCP using Virtio network adapters. The StarWind CVM utilizes Virtio network adapters for both data and replication traffic. These adapters are specifically configured to support multiqueue, with 8 queues installed. This optimization ensures efficient network utilization and enhances overall performance.
Linstor/DRBD over TCP:
Similar to other configurations, the Linstor/DRBD over TCP scenario incorporates 5 NVMe drives on each Proxmox node, assembled into MDRAID5 arrays. On top of the MDRAID5 arrays, Linstor establishes a storage pool and resource group, configuring them with the “–place-count=2” parameter. This parameter determines the number of replicas for data redundancy and fault tolerance.
The Quorum Device Node was used as the diskless node with the “Quorum tiebreaker” feature for ensuring reliable cluster decision-making in case of network partitioning or node failure. Since the Linstor Diskless Node (Quorum Device Node) is connected to the Proxmox nodes via the 1GbE management network and doesn’t have RDMA compatibility, we couldn’t configure Linstor with RDMA transport.
Based on the performance recommendation, we have applied several tweaks, including adjustments to parameters such as max-buffers, sndbuf-size, and al-extents. These tweaks fine-tune the DRBD configuration for enhanced efficiency and throughput.
1 2 3 4 |
linstor controller drbd-options --max-buffers 8000 --max-epoch-size 8000 linstor controller drbd-options --sndbuf-size 2M linstor controller drbd-options --al-extents 6007 linstor controller drbd-options --disk-barrier no --disk-flushes no |
Ceph:
In this scenario, the focus is on evaluating the performance of a 2-node Ceph cluster, but don’t forget that such a setup is a non-production scenario. Instead, the aim is to gain insights into the potential performance capabilities of Ceph in this specific context, purely “for science purposes”.
Based on this blog, for setting up the Ceph cluster, each Proxmox node is configured to host two OSDs per NVMe drive, resulting in a total of 20 OSDs across the cluster (10 per node). This configuration is established using the command pveceph osd create /dev/nvme0n1 –osds-per-device=2.
Furthermore, a pool for RBD (RADOS Block Device) devices is created with a replicated rule, ensuring data redundancy and fault tolerance. The size and min_size parameters are set to 2 and 1 respectively, defining the replication factor. Additionally, the placement group count is set to 256 to optimize data distribution and balancing within the cluster.
For performance tuning, only one parameter, osd_mclock_profile, is adjusted. This parameter, set to high_client_ops, is intended to optimize Ceph’s handling of client operations, potentially enhancing overall system responsiveness and throughput.
1 |
ceph config set osd osd_mclock_profile high_client_ops |
Testbed overview:
The evaluation is performed on Supermicro SYS-220U-TNR nodes with Intel Xeon Platinum CPUs and Mellanox ConnectX-6 EN 200GbE NICs. Each node is equipped with 5x NVMe Micron 7450 MAX drives. Proxmox VE 8.1.3 serves as the hypervisor, with specific versions of StarWind VSAN, Linstor/DRBD, and Ceph.
Hardware:
Proxmox Node-{1..2} | Supermicro SYS-220U-TNR |
---|---|
CPU | Intel(R) Xeon(R) Platinum 8352Y @2.2GHz |
Sockets | 2 |
Cores/Threads | 64/128 |
RAM | 256GB |
NICs | 2x Mellanox ConnectX®-6 EN 200GbE (MCX613106A-VDA) |
Storage | 5x NVMe Micron 7450 MAX: U.3 3.2TB |
Software:
Proxmox VE | 8.1.3 (kernel 6.5.13-3-pve) |
---|---|
StarWind VSAN | Version V8 (build 15260, CVM 20231016) |
Linstor/DRBD | 9.2.8 |
Ceph | reef 18.2.2 |
StarWind CVM parameters:
CPU | 32 vCPU |
---|---|
RAM | 32GB |
NICs | 1x e1000 network adapter for management
4x Mellanox ConnectX-6 Virtual Function network adapter (SRIOV) for NVMe-oF over RDMA scenario 4x virtio network adapter for NVMe-oF over TCP scenario |
Storage | MDRAID5 (5x NVMe Micron 7450 MAX: U.3 3.2TB) |
Testing methodology
For our tests, we used a tool called FIO, which helps us measure performance in client/server mode. Here’s how we set things up: we had 16 virtual machines, each with 4 virtual CPUs, 8GB of RAM, and 3 raw virtual disks, each 100GB in size. Before running our tests, we made sure to fill these virtual disks with random data.
We tested various patterns to see how well our setup performs:
- 4k random read;
- 4k random read/write with a 70/30 ratio;
- 4k random write;
- 64k random read;
- 64k random write;
- 1M read;
- 1M write.
It’s important to note that before running some of our tests, we warmed up the virtual disks to ensure consistent results. For example, we ran the 4k random write pattern for 4 hours before testing the 4k random read/write 70/30 and 4k random write patterns. Similarly, we ran the 64k random write pattern for 2 hours before testing the 64k random write pattern.
We repeated each test 3 times to get an average result, and each test had a duration of 600 seconds for read tests and 1800 seconds for write tests. This way, we could be sure our results were reliable and representative.
Benchmarking local NVMe performance
Before starting our evaluation, we wanted to make sure the NVMe drive lived up to its vendor’s promises, so we ran a series of tests to see if its performance matched up. Here is the image with vendor-claimed performance:
Using the FIO utility in client/server mode, we checked how well the NVMe drive handled various workload patterns. The following results have been achieved:
1x NVMe Micron 7450 MAX: U.3 3.2TB | |||||
---|---|---|---|---|---|
Pattern | Numjobs | IOdepth | IOPS | MiB\s | Latency (ms) |
4k random read | 6 | 32 | 997,000 | 3,894 | 0.192 |
4k random read/write 70/30 | 6 | 16 | 531,000 | 2,073 | 0.142 |
4k random write | 4 | 4 | 385,000 | 1,505 | 0.041 |
64k random read | 8 | 8 | 92,900 | 5,807 | 0.688 |
64k random write | 2 | 1 | 27,600 | 1,724 | 0.072 |
1M read | 1 | 8 | 6,663 | 6,663 | 1.200 |
1M write | 1 | 2 | 5,134 | 5,134 | 0.389 |
The NVMe drive performed just as the vendor promised across all our tests. Whether it was handling small 4k reads or large 1M writes, it delivered on speed and consistency.
Tables with benchmark results
To provide a comprehensive analysis of the performance of different storage configurations, we conducted a series of benchmarks across various workload patterns. In addition to evaluating IOPS (Input/Output Operations per Second) and throughput, we also examined the performance dependency on CPU usage by introducing an additional parameter for 4k random read/write patterns: “IOPs per 1% CPU usage.”
This parameter was calculated using the formula:
IOPS per 1% CPU usage = IOPS / Proxmox Node count / Node CPU usage
Where:
- IOPS represents the number of I/O operations per second for each pattern.
- Proxmox Node count is 2 in our case (indicating the number of Proxmox nodes).
- Node CPU usage denotes the CPU usage of one Proxmox node during the test.
By incorporating this additional metric, we aimed to provide deeper insights into how CPU usage correlates with IOPS, offering a more nuanced understanding of performance characteristics.
Now let’s delve into the detailed benchmark results for each storage configuration.
StarWind VSAN NVMe-oF over RDMA
For the StarWind VSAN NVMe-oF over RDMA configuration, the performance was evaluated across multiple workload patterns. Notably, the 4k random read pattern showcased significant scalability with increasing IOdepth, reaching up to 4.8 million IOPS with an IOdepth of 64. Similarly, the 4k random read/write (70%/30%) pattern exhibited considerable performance, achieving up to 1.2 million IOPS with an IOdepth of 64. These results indicate efficient utilization of resources and high throughput capabilities in demanding scenarios.
Pattern | Numjobs | IOdepth | IOPS | MiB/s | Latency (ms) | Node CPU usage % | IOPS per 1% CPU usage |
---|---|---|---|---|---|---|---|
4k random read | 3 | 8 | 2,276,000 | 8,891 | 0.168 | 47.00% | 24,213 |
4k random read | 3 | 16 | 3,349,000 | 13,082 | 0.227 | 53.00% | 31,594 |
4k random read | 3 | 32 | 4,220,000 | 16,484 | 0.363 | 55.00% | 38,364 |
4k random read | 3 | 64 | 4,840,000 | 18,906 | 0.630 | 56.00% | 43,214 |
4k random read | 3 | 128 | 4,664,000 | 18,218 | 1.334 | 56.00% | 41,643 |
4k random read/write (70%/30%) | 3 | 8 | 1,066,000 | 4,164 | 0.477 | 40.00% | 13,325 |
4k random read/write (70%/30%) | 3 | 16 | 1,182,000 | 4,617 | 0.942 | 42.00% | 14,071 |
4k random read/write (70%/30%) | 3 | 32 | 1,193,000 | 4,660 | 1.900 | 41.00% | 14,549 |
4k random read/write (70%/30%) | 3 | 64 | 1,199,000 | 4,684 | 3.294 | 38.00% | 15,776 |
4k random write | 3 | 4 | 319,000 | 1,246 | 0.600 | 28.00% | 5,696 |
4k random write | 3 | 8 | 384,000 | 1,500 | 0.999 | 29.00% | 6,621 |
4k random write | 3 | 16 | 390,000 | 1,523 | 1.965 | 29.50% | 6,610 |
4k random write | 3 | 32 | 386,000 | 1,508 | 3.975 | 28.50% | 6,772 |
64k random read | 3 | 2 | 320,000 | 20,000 | 0.291 | 24.00% | |
64k random read | 3 | 4 | 462,000 | 28,875 | 0.415 | 26.00% | |
64k random read | 3 | 8 | 509,000 | 31,812 | 0.752 | 27.00% | |
64k random read | 3 | 16 | 507,000 | 31,687 | 1.519 | 26.00% | |
64k random write | 3 | 1 | 68,400 | 4,274 | 0.702 | 23.00% | |
64k random write | 3 | 2 | 70,700 | 4,416 | 1.358 | 24.00% | |
64k random write | 3 | 4 | 68,100 | 4,256 | 2.817 | 24.00% | |
1024k read | 1 | 1 | 18,400 | 18,400 | 0.868 | 19.00% | |
1024k read | 1 | 2 | 27,600 | 27,600 | 1.157 | 20.00% | |
1024k read | 1 | 4 | 31,700 | 31,700 | 2.019 | 20.00% | |
1024k read | 1 | 8 | 31,800 | 31,800 | 4.027 | 20.00% | |
1024k write | 1 | 1 | 4,214 | 4,214 | 3.793 | 23.00% | |
1024k write | 1 | 2 | 4,197 | 4,197 | 7.622 | 23.00% | |
1024k write | 1 | 4 | 4,168 | 4,168 | 15.287 | 23.00% |
StarWind VSAN NVMe-oF over TCP
In the StarWind VSAN NVMe-oF over TCP setup, the performance remained competitive, but with slightly lower IOPS compared to the RDMA configuration. However, the system demonstrated robust scalability, with the 4k random read pattern achieving over 2.3 million IOPS at an IOdepth of 32. The 4k random read/write (70%/30%) pattern also exhibited impressive performance, surpassing 887,000 IOPS at an IOdepth of 64. These results indicate the adaptability of the TCP protocol for NVMe-oF communication, providing high throughput and low latency.
Pattern | Numjobs | IOdepth | IOPS | MiB/s | Latency (ms) | Node CPU usage % | IOPs per 1% CPU usage |
---|---|---|---|---|---|---|---|
4k random read | 3 | 8 | 1,126,000 | 4,399 | 0.337 | 40.00% | 14,075 |
4k random read | 3 | 16 | 1,682,000 | 6,572 | 0.458 | 46.00% | 18,283 |
4k random read | 3 | 32 | 2,349,000 | 9,178 | 0.644 | 50.00% | 23,490 |
4k random read | 3 | 64 | 2,808,000 | 10,969 | 1.070 | 51.00% | 27,529 |
4k random read | 3 | 128 | 2,944,000 | 11,500 | 2.182 | 52.00% | 28,308 |
4k random read/write (70%/30%) | 3 | 8 | 463,000 | 1,808 | 1.068 | 32.00% | 7,234 |
4k random read/write (70%/30%) | 3 | 16 | 559,000 | 2,183 | 1.804 | 34.00% | 8,221 |
4k random read/write (70%/30%) | 3 | 32 | 710,000 | 2,773 | 2.814 | 37.00% | 9,595 |
4k random read/write (70%/30%) | 3 | 64 | 887,000 | 3,465 | 4.353 | 38.00% | 11,671 |
4k random write | 3 | 4 | 203,000 | 793 | 0.944 | 25.00% | 4,060 |
4k random write | 3 | 8 | 221,000 | 863 | 1.729 | 27.00% | 4,093 |
4k random write | 3 | 16 | 253,000 | 989 | 3.048 | 28.00% | 4,518 |
4k random write | 3 | 32 | 301,000 | 1,176 | 5.102 | 30.00% | 5,017 |
64k random read | 3 | 2 | 230,000 | 14,375 | 0.418 | 25.00% | |
64k random read | 3 | 4 | 323,000 | 20,188 | 0.587 | 29.00% | |
64k random read | 3 | 8 | 371,000 | 23,188 | 1.025 | 31.00% | |
64k random read | 3 | 16 | 381,000 | 23,813 | 1.995 | 31.00% | |
64k random write | 3 | 1 | 51,700 | 3,229 | 0.928 | 23.00% | |
64k random write | 3 | 2 | 62,000 | 3,876 | 1.551 | 25.00% | |
64k random write | 3 | 4 | 65,500 | 4,096 | 2.931 | 26.00% | |
1024k read | 1 | 1 | 16,200 | 16,200 | 0.987 | 21.00% | |
1024k read | 1 | 2 | 20,200 | 20,200 | 1.550 | 23.00% | |
1024k read | 1 | 4 | 21,700 | 21,700 | 2.863 | 24.00% | |
1024k read | 1 | 8 | 22,900 | 22,900 | 5.573 | 25.00% | |
1024k write | 1 | 1 | 3,567 | 3,567 | 4.469 | 23.00% | |
1024k write | 1 | 2 | 4,050 | 4,050 | 7.924 | 24.00% | |
1024k write | 1 | 4 | 3,528 | 3,528 | 18.040 | 21.00% |
Linstor/DRBD over TCP
In the Linstor/DRBD over TCP configuration, the system showcased commendable performance across various workload patterns. The 4k random read pattern demonstrated consistent scalability, achieving up to 1.5 million IOPS with an IOdepth of 32. Additionally, the 4k random read/write (70%/30%) pattern exhibited substantial performance, surpassing 1 million IOPS with an IOdepth of 64. These results highlight the effectiveness of the DRBD solution over TCP for storage replication, delivering high throughput and low latency.
Pattern | Numjobs | IOdepth | IOPS | MiB/s | Latency (ms) | Node CPU usage % | IOPs per 1% CPU usage |
---|---|---|---|---|---|---|---|
4k random read | 3 | 8 | 1,454,000 | 5,679 | 0.263 | 49.00% | 14,837 |
4k random read | 3 | 16 | 1,494,000 | 5,836 | 0.513 | 38.00% | 19,658 |
4k random read | 3 | 32 | 1,500,000 | 5,859 | 1.022 | 38.00% | 19,737 |
4k random read | 3 | 64 | 1,479,000 | 5,777 | 2.076 | 44.00% | 16,807 |
4k random read | 3 | 128 | 1,483,000 | 5,792 | 4.291 | 44.00% | 16,852 |
4k random read/write (70%/30%) | 3 | 8 | 646,000 | 2,523 | 0.889 | 32.00% | 10,094 |
4k random read/write (70%/30%) | 3 | 16 | 821,000 | 3,207 | 1.449 | 40.00% | 10,263 |
4k random read/write (70%/30%) | 3 | 32 | 918,000 | 3,586 | 2.670 | 47.00% | 9,766 |
4k random read/write (70%/30%) | 3 | 64 | 1,088,000 | 4,250 | 4.436 | 70.00% | 7,771 |
4k random write | 3 | 4 | 159,000 | 621 | 1.205 | 17.00% | 4,676 |
4k random write | 3 | 8 | 221,000 | 864 | 1.734 | 21.00% | 5,262 |
4k random write | 3 | 16 | 272,000 | 1,061 | 2.823 | 28.00% | 4,857 |
4k random write | 3 | 32 | 320,000 | 1,251 | 4.795 | 29.00% | 5,517 |
64k random read | 3 | 2 | 327,000 | 20,437 | 0.292 | 9.50% | |
64k random read | 3 | 4 | 480,000 | 30,000 | 0.399 | 15.00% | |
64k random read | 3 | 8 | 588,000 | 36,750 | 0.652 | 19.00% | |
64k random read | 3 | 16 | 622,000 | 38,875 | 1.234 | 20.00% | |
64k random write | 3 | 1 | 53,000 | 3,312 | 0.904 | 11.00% | |
64k random write | 3 | 2 | 69,200 | 4,325 | 1.383 | 17.00% | |
64k random write | 3 | 4 | 77,100 | 4,821 | 2.488 | 20.00% | |
1024k read | 1 | 1 | 12,500 | 12,500 | 1.273 | 4.00% | |
1024k read | 1 | 2 | 20,800 | 20,800 | 1.534 | 8.00% | |
1024k read | 1 | 4 | 30,200 | 30,200 | 2.116 | 10.00% | |
1024k read | 1 | 8 | 33,100 | 33,100 | 3.867 | 10.00% | |
1024k write | 1 | 1 | 4,763 | 4,763 | 3.329 | 11.00% | |
1024k write | 1 | 2 | 5,026 | 5,026 | 6.444 | 12.00% | |
1024k write | 1 | 4 | 4,984 | 4,984 | 12.452 | 12.00% |
Ceph
In the Ceph configuration, the performance was evaluated across different workload patterns, demonstrating the system’s capability to handle diverse workloads. Benchmark results show solid performance, though slightly lower compared to other solutions in some scenarios. For example, in the 4k random read pattern, IOPS ranged from 381,000 to 406,000, depending on the number of jobs and IOdepth.
Pattern | Numjobs | IOdepth | IOPS | MiB/s | Latency (ms) | Node CPU usage % | IOPs per 1% CPU usage |
---|---|---|---|---|---|---|---|
4k random read | 3 | 8 | 381,000 | 1,488 | 1.005 | 66.00% | 2,886 |
4k random read | 3 | 16 | 385,000 | 1,505 | 1.992 | 67.00% | 2,873 |
4k random read | 3 | 32 | 394,000 | 1,537 | 3.873 | 67.00% | 2,940 |
4k random read | 3 | 64 | 396,000 | 1,549 | 7.660 | 67.00% | 2,955 |
4k random read | 3 | 128 | 406,000 | 1,586 | 12.238 | 66.00% | 3,076 |
4k random read/write (70%/30%) | 3 | 8 | 253,100 | 989 | 1.797 | 77.00% | 1,644 |
4k random read/write (70%/30%) | 3 | 16 | 264,400 | 1,033 | 3.232 | 78.00% | 1,695 |
4k random read/write (70%/30%) | 3 | 32 | 269,000 | 1,050 | 6.074 | 80.00% | 1,681 |
4k random read/write (70%/30%) | 3 | 64 | 270,100 | 1,055 | 11.749 | 80.00% | 1,688 |
4k random write | 3 | 4 | 109,000 | 426 | 1.756 | 55.00% | 991 |
4k random write | 3 | 8 | 141,000 | 551 | 2.712 | 72.00% | 979 |
4k random write | 3 | 16 | 163,000 | 638 | 4.696 | 80.00% | 1,019 |
4k random write | 3 | 32 | 181,000 | 707 | 8.489 | 85.00% | 1,065 |
64k random read | 3 | 2 | 133,000 | 8,312 | 0.720 | 25.00% | |
64k random read | 3 | 4 | 201,000 | 12,562 | 0.950 | 43.00% | |
64k random read | 3 | 8 | 219,000 | 13,687 | 1.747 | 48.00% | |
64k random read | 3 | 16 | 220,000 | 13,750 | 3.485 | 47.50% | |
64k random write | 3 | 1 | 36,200 | 2,262 | 1.323 | 19.50% | |
64k random write | 3 | 2 | 55,900 | 3,494 | 1.712 | 30.00% | |
64k random write | 3 | 4 | 76,300 | 4,769 | 2.516 | 47.00% | |
1024k read | 1 | 1 | 8,570 | 8,570 | 1.864 | 5.00% | |
1024k read | 1 | 2 | 15,100 | 15,100 | 2.120 | 10.00% | |
1024k read | 1 | 4 | 22,600 | 22,600 | 2.822 | 17.50% | |
1024k read | 1 | 8 | 29,100 | 29,100 | 4.387 | 26.50% | |
1024k write | 1 | 1 | 5,627 | 5,627 | 2.807 | 9.00% | |
1024k write | 1 | 2 | 9,045 | 9,045 | 3.486 | 16.00% | |
1024k write | 1 | 4 | 12,500 | 12,500 | 5.134 | 26.00% |
Visualizing benchmarking results in charts
With all benchmarks completed and data collected, let’s use graphs to easily compare the results.
Figure 1 shows the number of Input/Output Operations Per Second (IOPS) achieved during 4k random read operations with Numjobs = 3. You can see that the performance of StarWind VSAN NVMe-oF over RDMA and StarWind VSAN NVMe-oF over TCP increases steadily as the queue depth increases, up to a point. This suggests that StarWind VSAN can benefit from queuing additional I/O requests to achieve better performance. Meanwhile, both Linstor/DRBD over TCP and Ceph reach their peak performance straight away. Increasing the queue depth for them only results in higher latency without performance growth.
StarWind VSAN NVMe-oF over RDMA raises the bar, showing a significant performance increase starting from queue depth = 8. At an IOdepth of 128, StarWind VSAN NVMe-oF over RDMA achieves 4,664,000 IOPS, while StarWind VSAN NVMe-oF over TCP achieves 2,944,000 IOPS. The performance gap widens even further when compared with Linstor/DRBD over TCP, where StarWind VSAN NVMe-oF over RDMA outperforms it by around 214.1%, compared to Linstor’s 1,483,000 IOPS. Against Ceph, the difference is even bigger, with StarWind VSAN NVMe-oF over RDMA surpassing Ceph’s 406,000 IOPS by more than x10 times.
Figure 2 compares the latency when reading 4k data blocks. Not surprisingly, StarWind VSAN NVMe-oF over RDMA exhibits the lowest latency among the tested solutions, with an average result of 0.5 milliseconds (ms) and a minimum of 0.168 ms. Moreover, for both RDMA and TCP configurations, StarWind VSAN NVMe-oF maintains lower latency compared to other solutions across all queue depths. In contrast, Linstor/DRBD over TCP and Ceph demonstrate a higher average latency increase, with Ceph being the slowest of all contenders.
CPU utilization efficiency is also a very important factor, as it is directly related to the cost-efficiency of the resulting solution, especially in a hyperconverged infrastructure (HCI).
Figure 3 shows IOPS per 1% CPU usage during a 4k random read test. StarWind VSAN NVMe-oF over RDMA and TCP demonstrate the most efficient CPU utilization, followed by Linstor/DRBD over TCP. Ceph lags behind the other solutions in terms of CPU efficiency, exhibiting the lowest IOPS per 1% CPU usage values, ranging from 2,886 to 3,076, and suggesting room for improvement in CPU resource utilization. At the same time, StarWind VSAN NVMe-oF over RDMA demonstrates notable performance improvements over other configurations when comparing their highest values: outperforming its TCP counterpart by approximately 52.67%, Linstor/DRBD over TCP by about 118.92%, and Ceph by impressive x13 times.
Pure read or write workloads don’t occur all the time in real production scenarios, which is why benchmarking storage performance with a mixed 70/30 read-write pattern is important to get the full picture.
Figure 4 demonstrates the number of IOPS achieved during 70%/30% 4k random read/write operations with Numjobs = 3. Here, StarWind VSAN NVMe-oF over RDMA again takes the lead, achieving the maximum result of 1,199,000 IOPS and outperforming other storage solutions, especially in tests with lower IOdepths.
Interestingly, Linstor/DRBD handled mixed workloads better than other solutions, achieving remarkable results. Specifically, when compared head-to-head with StarWind VSAN NVMe-oF over TCP, it scored 1,088,000 IOPS, gaining a solid advantage of 22.6% over the maximum 887,000 IOPS achieved by StarWind over TCP.
Ceph, on the other hand, consistently showed lower performance across the board, achieving a maximum of 270,000 IOPS.
Delving into latency dynamics, Figure 5 scrutinizes the delay times during 4k random read/write operations at a 70%/30% ratio with Numjobs = 3. Again, StarWind VSAN NVMe-oF over RDMA stands out with notably lower latency throughout the test, with Linstor/DRBD taking second place.
Figure 6 explores the efficiency of IOPS relative to CPU utilization during the same mixed workload. Here, StarWind VSAN NVMe-oF over RDMA demonstrates superior efficiency during the ‘heaviest’ IOdepth 64 pattern, delivering better IOPS per 1% CPU usage by 35.17% over its TCP version, 103.0% over Linstor/DRBD over TCP, and a striking 834.5% over Ceph, marking a significant performance advantage.
Interestingly, when comparing the ‘lighter’ IOdepth patterns of 8 and 16, Linstor/DRBD takes the lead over StarWind’s NVMe-oF over TCP, with an average 25% advantage in CPU utilization. However, this advantage is completely mitigated during the ‘heavier’ tests within the IOdepth 32 pattern. Moreover, StarWind VSAN NVMe-oF over TCP regains the lead, achieving 50.17% better CPU utilization than Linstor/DRBD during IOdepth 64 tests. Fundamentally, this means that StarWind VSAN NVMe-oF over TCP maintains the same level of CPU usage efficiency across all patterns, while Linstor/DRBD starts wasting more CPU resources on heavier workloads.
Figure 7 presents the IOPS achieved during 4k random write operations, revealing that StarWind VSAN NVMe-oF over RDMA not only tops its TCP configuration by about 29% but also surpasses Linstor/DRBD by 22% and Ceph by 115%, showcasing the benefits of RDMA in write-intensive scenarios.
Once again, the close competition between StarWind VSAN NVMe-oF over TCP and Linstor/DRBD continues here, with StarWind VSAN showing a 27.6% advantage over DRBD at IOdepth=4, going neck-to-neck during IOdepth=8, and finally, falling behind at heavier workloads, scoring 5.9% lower than DRBD. What a race!
In assessing write latencies, Figure 8 confirms the previous IOPS results, with StarWind VSAN NVMe-oF over RDMA leading in minimizing delays and outperforming other solutions across the board.
The close race between Linstor/DRBD and StarWind VSAN’s TCP configuration is also prominent here.
Efficiency takes center stage in Figure 9, which measures IOPS per 1% CPU usage during a 4k random write test. StarWind VSAN NVMe-oF over RDMA demonstrates remarkable CPU utilization, significantly outstripping other configurations and illustrating its ability to deliver higher performance with less resource consumption. It is more efficient by approximately 34.95% over StarWind VSAN NVMe-oF over TCP, 22.75% over Linstor/DRBD over TCP, and a dramatic 536.07% over Ceph in 32 queue depth tests.
What’s even more interesting, Linstor/DRBD takes a respectable second place, outperforming StarWind VSAN NVMe-oF over TCP configuration by 9.96% in IOdepth=32 tests.
Figure 10 illustrates the throughput for random reads in 64KB blocks. During this test, Linstor/DRBD over TCP emerges as a standout performer, achieving about 22.17% higher throughput than StarWind VSAN NVMe-oF over RDMA, 63.23% over StarWind VSAN NVMe-oF over TCP, and a whopping 182.27% over Ceph.
Linstor/DRBD’s high performance is attributed to its read-balancing feature, available since version 8.4.1, which balances read requests between the primary and secondary nodes. This feature allows you to define the read policy. In our testing, we used the default read-balancing policy, prefer-local, as it provided maximum performance through local reading and avoided the TCP stack usage. In your case, you can use another policy, such as read striping, round-robin, least-pending, and when-congested-remote, and your performance might improve with different parameters. If you want to explore this topic in more detail, please read here.
Figure 11 compares latency during 64K random reads, where Linstor/DRBD over TCP (surprise surprise) consistently shows the lowest latency, emphasizing its efficiency in reading large-sized data blocks. In IOdepth=16 tests, Linstor/DRBD over TCP demonstrates 18.76% lower latency compared to StarWind VSAN NVMe-oF over RDMA, about 38.17% lower than StarWind VSAN NVMe-oF over TCP, and approximately 64.57% lower than Ceph.
CPU usage during 64K random reads is analyzed in Figure 12, showcasing Linstor/DRBD over TCP with the most efficient CPU utilization. Compared to every other solution, Linstor/DRBD over TCP maintains CPU usage reductions by an average of 60.41% during IOdepth=2 tests, highlighting its efficiency on lighter workloads. Interestingly, the difference in efficiency decreases with heavier workloads, with Linstor/DRBD being more efficient than StarWind VSAN NVMe-oF over RDMA by 23.07%, 35.48% more efficient than StarWind VSAN NVMe-oF over TCP, and 57.89% more efficient than Ceph.
The throughput for 64K random writes is presented in Figure 13. While Linstor/DRBD over TCP and Ceph demonstrate higher throughput values at higher IOdepths, StarWind VSAN NVMe-oF over RDMA provides better throughput at lower IOdepth tests. At an IOdepth of 4, Linstor/DRBD surpasses StarWind VSAN NVMe-oF over RDMA by 13.3%, and its TCP configuration by 17.7%. Meanwhile, at IOdepth = 1, StarWind VSAN NVMe-oF over RDMA shows 29.04% better throughput compared to DRBD and 88.94% better than Ceph. Meanwhile, StarWind VSAN in TCP configuration shows 2.50% lower throughput than DRBD over TCP in IOdepth=1 tests and a 15.03% decrease in IOdepth=4 tests.
It’s also interesting how Ceph’s throughput ramps up as the IOdepth increases. This suggests that Ceph can handle more large-sized requests in parallel, reducing individual queuing times and leading to higher overall throughput.
In Figure 14, the latency of 64K random writes presents the same pattern as seen in previous throughput figures. StarWind VSAN NVMe-oF over RDMA offers the fastest response times at lower IOdepths. However, at an IOdepth of 4, Linstor/DRBD over TCP takes the lead, showing a lower latency of 2.49 ms. Ceph also does a great job at higher IOdepths, showing similar results to DRBD.
Let’s move on to Figure 15, which details CPU usage during 64K random writes. Linstor/DRBD over TCP comes out on top in this race, demonstrating the most efficient use of CPU resources across all tested configurations. Even at higher IOdepths, it showcases up to 16.66% lower CPU utilization compared to StarWind VSAN NVMe-oF over RDMA, 23.07% lower than VSAN over TCP, and 57.44% lower than Ceph.
The throughput results for 1024k reads, depicted in Figure 16, show that StarWind VSAN NVMe-oF over RDMA significantly outperforms all competitors from IOdepth 1 to 4. However, at IOdepth 8, Linstor/DRBD takes a slight edge, outperforming StarWind VSAN NVMe-oF over RDMA by 4.08%. StarWind VSAN’s TCP configuration shows 29.6% better performance than Linstor/DRBD at IOdepth=1, but at higher IOdepths, DRBD regains the edge and surpasses StarWind VSAN NVMe-oF over TCP by 44.54% at IOdepth 8. Once again, DRBD’s performance lead is attributed to its read-balancing feature and the ‘prefer-local policy’ chosen for our testing. Ceph, on the other hand, starts much slower than the others, but shows remarkable results at IOdepth 8, surpassing StarWind VSAN’s TCP configuration and falling behind DRBD and StarWind VSAN NVMe-oF by 13.74% and 9.27% respectively.
Latency during the 1024k read test is explored in Figure 17, which reflects the same pattern observed in previous throughput figures. StarWind VSAN NVMe-oF over RDMA offers the fastest response times at IOdepths 1 to 4. However, at IOdepth 8, Linstor/DRBD over TCP takes the lead, demonstrating lower latency across the board.
CPU usage during 1024k reads is recorded in Figure 18, where both StarWind configurations exhibit higher resource use compared to competitors. Remarkably, Linstor/DRBD over TCP uses considerably less CPU, even at the maximum IOdepth—up to 50% less compared to StarWind VSAN NVMe-oF over RDMA. Ceph’s average CPU utilization is 20%, placing it around the middle of the pack when compared to the other three configurations, taking into account the respective throughput figures.
Figure 19 examines 1024k sequential write throughput, highlighting Ceph’s superior performance, especially as IOdepth increases. It surpasses Linstor/DRBD over TCP by approximately 148.95% and StarWind VSAN NVMe-oF over RDMA by about 196.67%, showcasing its effectiveness in handling large write operations. Similarly, against StarWind VSAN NVMe-oF over TCP, Ceph exhibits a substantial lead of roughly 208.64%.
It’s important to note that this level of Ceph’s performance is largely due to its operation in a mirror (RAID 1) configuration over the network in our case, while StarWind and Linstor/DRBD operate in RAID 51 (local RAID 5 combined with RAID 1 over the network).
Figure 20 presents the same pattern, with Ceph leading at all IOdepths and showing lower latency than other competitors.
Lastly, Figure 21 focuses on CPU usage during 1024k writes. Linstor/DRBD over TCP stands out for its minimal CPU demand, demonstrating an improvement of about 34.04% in CPU efficiency over StarWind VSAN NVMe-oF over RDMA. Interestingly, both the RDMA and TCP configurations of StarWind VSAN show similar CPU usage figures.
Additional benchmarking: 1 VM 1 numjobs 1 iodepth
This section provides a closer look at the performance metrics of storage solutions under a single VM scenario with numjobs = 1 and an IO depth of 1. The aim is to evaluate each configuration’s efficiency in handling random 4k read/write and 4k random write (synchronous) workloads. The benchmarks highlight how each solution performs in terms of IOPS, throughput (MiB/s), and latency (ms).
Benchmark results in a table:
StarWind VSAN NVMe-oF over RDMA – Host mirroring + MDRAID5 (1 VM) | |||||
---|---|---|---|---|---|
Pattern | Numjobs | IOdepth | IOPs | MiB\s | Latency (ms) |
4k random read | 1 | 1 | 8,296 | 32 | 0.120 |
4k random write | 1 | 1 | 5,091 | 20 | 0.195 |
4k random write (synchronous) | 1 | 1 | 3,330 | 13 | 0.298 |
StarWind VSAN NVMe-oF over TCP – Host mirroring + MDRAID5 (1 VM) | |||||
Pattern | Numjobs | IOdepth | IOPs | MiB\s | Latency (ms) |
4k random read | 1 | 1 | 6,701 | 26 | 0.148 |
4k random write | 1 | 1 | 3,586 | 14 | 0.277 |
4k random write (synchronous) | 1 | 1 | 2,170 | 9 | 0.459 |
Linstor (DRBD) over TCP – Host mirroring + MDRAID5 (1 VM) | ||||||
---|---|---|---|---|---|---|
Pattern | Numjobs | IOdepth | IOPs | MiB\s | Latency (ms) | |
4k random read | 1 | 1 | 10,600 | 42 | 0.094 | |
4k random write | 1 | 1 | 5,254 | 21 | 0.189 | |
4k random write (synchronous) | 1 | 1 | 4,273 | 17 | 0.233 | |
Ceph – Host mirroring (1 VM) | ||||||
Pattern | Numjobs | IOdepth | IOPs | MiB\s | Latency (ms) | |
4k random read | 1 | 1 | 5,145 | 20 | 0.193 | |
4k random write | 1 | 1 | 2,184 | 9 | 0.456 | |
4k random write (synchronous) | 1 | 1 | 1,972 | 8 | 0.506 |
Benchmark results in graphs:
This section presents visual comparisons of the performance and latency metrics across configurations under research.
4k random read:
Figure 1 highlights the performance in IOPS for the 4K random read test, where Linstor (DRBD) takes the lead, outperforming StarWind VSAN NVMe-oF over RDMA by 28% and the TCP version by 58%.
This significant advantage is due to Linstor’s local read capability and host-level operation, while StarWind’s deployment inside a VM results in a longer IO datapath, impacting its performance. Ceph lags behind with 5,145 IOPS, trailing Linstor by 106%.
Figure 2 illustrates the latency results for the same test.
Linstor (DRBD) again shows its superiority with the lowest latency at 0.094 ms due to local read capability and host-level execution, making it 22% faster than StarWind VSAN NVMe over RDMA and 36% faster than the TCP version of StarWind. Ceph records a latency of 0.193 ms, which is 105% higher than Linstor.
4k random write:
Now let’s move to Figure 3 which demonstrates the performance results for the 4K random write test.
StarWind VSAN NVMe-oF over RDMA achieves 5,091 IOPS, closely matching Linstor (DRBD) at 5,254 IOPS, with only a 3% difference. The TCP version of StarWind shows a 47% decrease, delivering 3,586 IOPS.
The lower performance of StarWind stems from the VM-based architecture, leading to a longer IO datapath.
Ceph, with 2,184 IOPS, is 140% slower than Linstor, reinforcing the trend observed in previous tests.
Latency for 4K random writes, as depicted in Figure 4, further supports these findings. Linstor (DRBD) registers the lowest latency at 0.189 ms, edging out StarWind VSAN NVMe-oF over RDMA by 3%.
The TCP version of StarWind, at 0.277 ms, is 35% slower than Linstor.
Ceph again records the highest latency at 0.456 ms, making it 141% slower than Linstor.
4k random write (synchronous):
In the 4K random write synchronous performance test, shown in Figure 5, Linstor (DRBD) achieves 4,273 IOPS, surpassing StarWind VSAN NVMe over RDMA by 28%.
The TCP version of StarWind trails further behind with 2,170 IOPS, marking a 97% decrease compared to Linstor.
Ceph, at 1,972 IOPS, falls 117% behind Linstor. The performance gap here mirrors the trend seen in the 4K random write results, driven by similar underlying factors.
Finally, in terms of latency for synchronous writes, as displayed in Figure 6, Linstor (DRBD) continues to lead with 0.233 ms, 22% faster than StarWind VSAN NVMe-oF over RDMA at 0.298 ms.
StarWind VSAN NVMe-oF over TCP, at 0.459 ms, lags behind Linstor by 97%. Ceph, with a latency of 0.506 ms, is 117% slower than Linstor.
These latency results align with the pattern observed in the 4K random write test, as the same factors contribute to the differences in performance.
Conclusion
Our in-depth benchmarking analysis has shed light on how different storage setups perform across various tasks. We’ve seen that StarWind VSAN NVMe-oF over RDMA delivers impressive performance in 4k workloads, significantly outperforming competitors in terms of both performance and lower CPU utilization.
Meanwhile, StarWind VSAN NVMe-oF over TCP also shows commendable performance in 4k workloads, although it slightly trails its RDMA counterpart and begins to lag behind Linstor in some mixed and large-sized, write-intensive tests.
Linstor/DRBD over TCP holds its own as well, particularly in mixed and large-size workloads such as 64K and 1024K. In some write tests, Linstor even outperformed StarWind VSAN over RDMA, which is a notable achievement for a TCP-based solution.
On the other hand, Ceph shows mixed results. While it excels at writing 1024K blocks, it falls short in almost all other scenarios, particularly in read workloads, and it lags behind in CPU efficiency.
The simple conclusion is that StarWind VSAN is perfect for running most kinds of virtualized workloads: general-use VMs, VDI, databases, and transaction-intensive tasks. Linstor/DRBD is also a good choice for most virtualization workloads, especially in environments where larger sequential reads and writes are typical. Ceph, meanwhile, is a good option for running data warehousing and analytics applications where 4K-64K IOPS are less relevant than high throughput during large sequential reads and writes. It’s also well-suited for environments dealing with video streaming and large file processing.
Ultimately, storage performance is not the sole factor to consider when building an HCI infrastructure. Ease of configuration and maintenance, necessary data protection and optimization features, commercial support, and other factors are also very important. Stay tuned, and we will help you get a bigger picture in our upcoming articles.
kni