StarWind VSAN vs Microsoft S2D: NVMe-oF RDMA Performance

Introduction

In the fast-paced world of hyperconverged infrastructure (HCI), performance and efficiency aren’t just buzzwords – they’re essential. As organizations push the boundaries of what their IT infrastructure can deliver, selecting the most effective solution becomes a critical decision. In this context, StarWind Virtual SAN (VSAN) and Microsoft Storage Spaces Direct (S2D) are two software-defined storage products that offer distinct approaches to leveraging NVMe and RDMA for high-performance HCI storage.

This article is the second in a series exploring the performance of StarWind VSAN and Microsoft S2D in a 2-node Hyper-V cluster setup. In the first article, we compared these two solutions using NVMe-oF over TCP, exploring their performance, capacity efficiency, and practical application. If you missed it, you can catch up here. Now, we’re turning our attention to RDMA-based configurations to give you an even clearer picture of which solution might be your ideal fit.

In this article, we’ll evaluate how these solutions perform in a 2-node Hyper-V cluster across two key scenarios:

StarWind VSAN NVMe over RDMA
- Host Mirroring + MDRAID-5.
Microsoft Storage Spaces Direct over RDMA
- Nested mirror-accelerated parity, workload placed in the mirror tier.
- Nested mirror-accelerated parity, workload placed in both tiers – mirror and parity.

By examining these configurations, we aim to provide insights into how each solution performs under varying workloads and how these performance characteristics translate into real-world benefits. In the sections that follow, we’ll walk you through our testbed setup, benchmarking methodology, and the results of our performance tests.

Solutions overview

StarWind Virtual SAN NVMe over RDMA scenario:

StarWind Virtual SAN (VSAN) setup was designed to leverage the full potential of NVMe drives and RDMA for high-performance storage. Here’s how it was configured:

NVMe drives: Each Hyper-V node was equipped with 5x NVMe drives, which were directly passed through to the StarWind VSAN Controller Virtual Machine (CVM). This direct pass-through ensures that the drives can fully leverage the speed and performance benefits of NVMe technology.
RDMA: To enable RDMA (Remote Direct Memory Access) and achieve ultra-low latency communication between the nodes, Mellanox NICs were used. These NICs were configured with SR-IOV (Single Root I/O Virtualization), allowing their Virtual Functions to be passed through to the StarWind VSAN CVM. This setup provides the necessary RDMA compatibility for high-speed data transfer.
MDRAID5 array creation: Inside the StarWind VSAN CVM, the 5x NVMe drives were assembled into an MDRAID-5 array. This RAID configuration provides a nice balance between performance, capacity, and redundancy.
High Availability (HA): On top of the MDRAID-5 array, we created two StarWind High Availability (HA) devices. These HA devices replicate data between the two nodes, ensuring continuous availability even in the event of a node failure.
NVMe-oF connectivity: The StarWind HA devices were connected to the nodes using StarWind NVMe-oF Initiator. The NVMe initiator plays a key role in establishing the high-speed NVMe-oF connection across the RDMA network, which is critical for maintaining low-latency and high-throughput operations.
Cluster Shared Volumes: Finally, Cluster Shared Volumes (CSVs) were created on top of the connected HA devices. These CSVs allow both nodes to access the same storage simultaneously, enabling efficient load balancing and resource utilization.

It’s worth noting that we used StarWind NVMe-oF Initiator because, currently, Microsoft does not offer a native NVMe-oF initiator. Microsoft has announced plans to release an NVMe initiator for Windows Server 2025, but it will support NVMe over TCP only, with no confirmation yet regarding RDMA support.

Microsoft Storage Spaces Direct over RDMA scenario – Nested mirror-accelerated parity:

DISCLAIMER: We’re aware that disk witness isn’t officially supported with S2D. However, for the sake of our benchmarking and to speed up deployment, we chose to proceed with it. That said, do not use disk witness in your production S2D cluster.

For the S2D setup, we implemented a Nested mirror-accelerated parity configuration, which offers an optimal balance between performance and capacity efficiency. This setup allows us to evaluate how well S2D handles different workloads, particularly in scenarios where the workload is either fully placed in the high-performance mirror tier or spread across both the mirror and parity tiers.

Here’s how we structured the solution:

Storage tiers: We created two distinct storage tiers, each configured to optimize specific aspects of data handling:
- NestedPerformance tier: Configured with the mirror resiliency setting, this tier uses SSDs and ensures high data redundancy by storing four copies of each piece of data. The command used to create this tier was:

New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName NestedPerformance -ResiliencySettingName Mirror -MediaType SSD -NumberOfDataCopies 4

1	New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName NestedPerformance -ResiliencySettingName Mirror -MediaType SSD -NumberOfDataCopies 4

- NestedCapacity tier: This tier focuses on capacity efficiency, using a parity resiliency setting. It stores two copies of each piece of data with one parity stripe, configured using the following command:

New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName NestedCapacity -ResiliencySettingName Parity -MediaType SSD -NumberOfDataCopies 2 -PhysicalDiskRedundancy 1 -NumberOfGroups 1 -FaultDomainAwareness StorageScaleUnit -ColumnIsolation PhysicalDisk -NumberOfColumns 4

1	New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName NestedCapacity -ResiliencySettingName Parity -MediaType SSD -NumberOfDataCopies 2 -PhysicalDiskRedundancy 1 -NumberOfGroups 1 -FaultDomainAwareness StorageScaleUnit -ColumnIsolation PhysicalDisk -NumberOfColumns 4

Volumes setup: Following Microsoft’s recommendations, two volumes were created across these tiers:
- Volume01 and Volume02: Both volumes were configured with 20% of their data in the high-performance mirror tier and the remaining 80% in the capacity-focused parity tier. This setup allows us to observe how the system handles data as it moves between tiers, particularly when the mirror tier reaches its capacity limits. The commands used to create these volumes were:

New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume01 -StorageTierFriendlyNames NestedPerformance, NestedCapacity -StorageTierSizes 820GB, 3276GB

New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume02 -StorageTierFriendlyNames NestedPerformance, NestedCapacity -StorageTierSizes 820GB, 3276GB

New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume01 -StorageTierFriendlyNames NestedPerformance, NestedCapacity -StorageTierSizes 820GB, 3276GB

New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume02 -StorageTierFriendlyNames NestedPerformance, NestedCapacity -StorageTierSizes 820GB, 3276GB

ReFS data movement: The Resilient File System (ReFS) is configured to automatically move data between the tiers when the mirror tier reaches 85% capacity. This threshold was left at its default setting to simulate a typical production environment.
Testing Scenarios:
- Scenario 1: Workload in the mirror tier: Here, the entire workload was placed within the mirror tier, leveraging its high performance and redundancy.
- Scenario 2: Workload spilling into the parity tier: In the second scenario, we explored the performance impact when the workload exceeds the mirror tier’s capacity, forcing ReFS to start moving data to the slower parity tier. We also simulated conditions where writes were directed straight to the parity tier, representing a worst-case scenario in terms of performance.

In real-world applications, performance would likely fall somewhere between these two scenarios, depending on the specific workload and how much data resides in each tier. This dual-tier approach provides valuable insights into how S2D manages different types of data and how it balances performance with capacity efficiency.

Capacity efficiency

In evaluating the capacity efficiency of these configurations, it’s essential to understand how each solution optimizes storage use while balancing performance and resiliency.

StarWind Virtual SAN
Achieves a capacity efficiency of 40%, thanks to its combination of host mirroring and MDRAID-5.
Microsoft S2D Nested mirror-accelerated parity
Delivers a capacity efficiency of 35.7% (20% mirror, 80% parity), though this can vary depending on the percentage of the volume allocated to the mirror tier. For more details on how to calculate capacity efficiency for Nested mirror-accelerated parity, please refer to the provided link.

Microsoft also recommends keeping some storage capacity unallocated, about 20% of the total pool size, to enable “in-place” repairs if drives fail. This reserve space, in our case, 5.82 TB, allows for immediate parallel repairs, which means your data remains safe and the system stays resilient even if something goes wrong. This happens automatically. It’s an added layer of security that can be very important in maintaining uptime and performance.

So, when you’re planning your storage solution, it’s definitely something to keep in mind.

Testbed overview

Our testbed setup is designed to push the limits of both StarWind VSAN and Microsoft S2D in a high-performance environment.

Hardware:

Server model	Supermicro SYS-220U-TNR
CPU	Intel(R) Xeon(R) Platinum 8352Y @2.2GHz
Sockets	2
Cores/Threads	64/128
RAM	256GB
NICs	2x Mellanox ConnectX®-6 EN 200GbE (MCX613106A-VDA)
Storage	5x NVMe Micron 7450 MAX: U.3 3.2TB

Software:

Windows Server	Windows Server 2022 Datacenter 21H2 OS build 20348.2527
StarWind VSAN	Version V8 (build 15469, CVM 20240530) (kernel – 5.15.0-113-generic)
StarWind NVMe-oF Initiator	StarWind NVMe-oF Initiator.2.0.0.672(rev 674).Setup.486

StarWind VSAN CVM parameters:

CPU	24 vCPU
RAM	32GB
NICs	1x network adapter for management 4x Mellanox ConnectX-6 Virtual Function network adapter (SRIOV)
Storage	MDRAID5 (5x NVMe Micron 7450 MAX: U.3 3.2TB)

Testing methodology

To accurately assess the performance of both StarWind VSAN and Microsoft S2D, we conducted a series of benchmarks using the FIO utility in client/server mode. Here’s a breakdown of the testing setup and methodology:

Virtual Machine Configuration:

Total VMs: 20 (10 per host)
VM Specs:
- vCPUs: 4 per VM
- RAM: 8GB per VM
- Disks: 3x RAW virtual disks per VM, each connected to a separate SCSI controller

Virtual Disk Sizes:

For Microsoft S2D (Nested mirror-accelerated parity):
- Mirror-only: 10GB per virtual disk
- Both tiers: 100GB per virtual disk
For StarWind VSAN NVMe-oF: 100GB per virtual disk

Preparation:

Virtual disks were pre-filled with random data to simulate real-world usage conditions before running the tests.

Test Patterns: We evaluated the performance using the following I/O patterns:

4k random read
4k random read/write (70/30)
4k random write
64k random read
64k random write
1M read
1M write

Warm-Up Procedures:

4k random read/write (70/30) and 4k random write patterns: VM disks were warmed up using the 4k random write pattern for 4 hours.
64k random write pattern: VM disks were warmed up using the 64k random write pattern for 2 hours.

Test Execution:

Each test was conducted three times, and the average result was used as the final performance metric.
Duration:
- Read tests: 600 seconds
- Write tests: 1800 seconds

Microsoft S2D Specifics:

Following Microsoft’s recommendations, the testing VMs were placed on the node that owns the volume. This setup minimizes network utilization by ensuring local data reads without using the network stack, thus reducing latency during write operations.
Each VHDX file was placed in different subdirectories, which helps optimize ReFS performance by minimizing metadata operation size and allowing parallel execution, reducing overall application latency.

StarWind VSAN Specifics:

VMs were evenly distributed across both hosts without being pinned to the node that owns the volume, which ensures a balanced load.
Similar to the S2D setup, each VHDX file was placed in different subdirectories to optimize performance.

Benchmarking local NVMe performance

Before diving into our performance verification, we took a moment to set the stage with vendor-claimed performance figures for the NVMe drives. Here is the image with vendor-claimed performance:

Using the FIO utility in client/server mode, we conducted a series of tests on a single Micron 7450 MAX U.3 3.2TB NVMe drive. The following results were observed:

	1x NVMe Micron 7450 MAX: U.3 3.2TB
Pattern	Numjobs	IOdepth	IOPs	MiB\s	Latency (ms)
4k random read	6	32	997,000	3,894	0.192
4k random read/write 70/30	6	16	531,000	2,073	0.142
4k random write	4	4	385,000	1,505	0.041
64k random read	8	8	92,900	5,807	0.688
64k random write	2	1	27,600	1,724	0.072
1M read	1	8	6,663	6,663	1.200
1M write	1	2	5,134	5,134	0.389

Our tests confirmed that the NVMe drive’s performance is fully in line with the vendor’s claims. This validation step is crucial for ensuring that our subsequent benchmarks are based on accurate and trustworthy hardware performance.

Benchmark results in a table

The benchmarking results are presented in tables to illustrate performance metrics such as IOPS, throughput (MiB/s), latency (ms), and CPU usage. An additional metric, “IOPS per 1% CPU usage,” highlights the performance dependency on the CPU usage for 4k random read/write patterns. This parameter is calculated using the following formula:

IOPS per 1% CPU usage = IOPS / Node count / Node CPU usage

Where:

IOPS represents the number of I/O operations per second for each pattern.
Node count is 2 nodes in our case.
Node CPU usage denotes the CPU usage of one node during the test.

By incorporating this additional metric, we aimed to provide deeper insights into how CPU usage correlates with IOPS, offering a more nuanced understanding of performance characteristics.

Now let’s delve into the detailed benchmark results for each storage configuration.

StarWind VSAN NVMe over RDMA scenario

The table provides a detailed breakdown of StarWind VSAN’s performance under the Hyper-V NVMe over RDMA scenario, focusing on various workload patterns and configurations.

For 4k random reads, the IOPS ranges from 893,000 at lower queue depths to 1,624,000 at higher depths.

In mixed 4k random read/write (70%/30%) scenarios, the solution delivers up to 856,000 IOPS, maintaining strong performance even under mixed workloads.

For larger workloads, such as the 64k random read pattern, StarWind VSAN achieves up to 19,062 MiB/s while maintaining consistent latency and CPU utilization. In write-heavy scenarios like the 1024k write pattern, the throughput peaks at 4,479 MiB/s, with latency increasing as queue depth rises, yet the CPU usage remains stable between 16% and 19%.

VM count	Pattern	Numjobs	IOdepth	IOPs	MiB/s	Latency (ms)	Node CPU usage %	IOPs per 1% CPU usage
20	4k random read	3	4	893,000	3,488	0.267	44.00%	10,148
	4k random read	3	8	1,092,000	4,266	0.438	45.00%	12,133
	4k random read	3	16	1,399,000	5,465	0.683	50.00%	13,990
	4k random read	3	32	1,624,000	6,344	1.172	53.00%	15,321
	4k random read	3	64	1,558,000	6,086	2.461	53.00%	14,698
	4k random read	3	128	1,551,000	6,059	4.967	52.00%	14,913
	4k random read/write (70%/30%)	3	2	396,000	1,547	0.355	32.00%	6,188
	4k random read/write (70%/30%)	3	4	596,000	2,328	0.487	41.00%	7,268
	4k random read/write (70%/30%)	3	8	756,000	2,953	0.785	47.00%	8,043
	4k random read/write (70%/30%)	3	16	856,000	3,344	1.346	48.00%	8,917
	4k random read/write (70%/30%)	3	32	854,000	3,336	2.656	47.00%	9,085
	4k random read/write (70%/30%)	3	64	736,000	2,875	6.001	41.00%	8,976
	4k random write	3	2	201,000	785	0.595	25.00%	4,020
	4k random write	3	4	288,000	1,126	0.826	31.00%	4,645
	4k random write	3	8	341,000	1,332	1.406	34.00%	5,015
	4k random write	3	16	330,000	1,290	2.906	32.00%	5,156
	4k random write	3	32	196,000	766	9.818	21.00%	4,667
	64k random read	3	2	243,000	15,187	0.493	25.00%
	64k random read	3	4	280,000	17,500	0.856	26.00%
	64k random read	3	8	297,000	18,562	1.613	27.00%
	64k random read	3	16	302,000	18,875	3.182	28.00%
	64k random read	3	32	305,000	19,062	6.292	28.00%
	64k random write	3	1	42,200	2,638	1.420	17.00%
	64k random write	3	2	48,800	3,050	2.459	18.00%
	64k random write	3	4	52,900	3,306	4.532	18.00%
	64k random write	3	8	57,800	3,613	8.312	19.00%
	64k random write	3	16	62,300	3,894	15.389	19.00%
	64k random write	3	32	67,100	4,194	28.611	21.00%
	1024k read	1	1	13,800	13,800	1.451	15.00%
	1024k read	1	2	16,200	16,200	2.433	16.00%
	1024k read	1	4	17,600	17,600	4.551	17.00%
	1024k read	1	8	18,300	18,300	8.759	18.00%
	1024k read	1	16	18,900	18,900	16.976	18.00%
	1024k write	1	1	3,703	3,703	5.399	16.00%
	1024k write	1	2	3,744	3,744	10.636	17.00%
	1024k write	1	4	3,853	3,853	20.747	18.00%
	1024k write	1	8	4,479	4,479	35.707	19.00%

Overall, StarWind VSAN shows great performance at 4k random read/write patterns, consistent read and write performance regardless of VM location, and good capacity efficiency at 40%.

Microsoft Storage Spaces Direct over RDMA scenario (Mirror tier only)

The next table presents S2D’s performance with a Nested mirror-accelerated parity configuration, focusing on workloads in the mirror tier.

For 4k random read patterns, IOPS ranges from 858,000 at lower queue depths to 2,615,000 at higher depths, with corresponding latencies between 0.278 ms and 2.921 ms.

In the 4k random read/write (70%/30%) scenarios, IOPS ranges from 58,200 to 941,000, with latency fluctuating from 0.305 ms to 8.247 ms as queue depth increases. The node CPU usage varies from 3% to 52%, reflecting how the system manages mixed workloads.

For larger data patterns like the 64k random read and 1024k write, S2D demonstrates robust throughput, reaching up to 10,500 MiB/s in the 1024k write pattern. Latency remains relatively low at the lower queue depths but increases significantly as the queue depth rises. CPU utilization is kept within a range of 5% to 26% for these larger workloads, showing the system’s ability to handle high-throughput tasks efficiently.

VM count	Pattern	Numjobs	IOdepth	IOPs	MiB/s	Latency (ms)	Node CPU usage %	IOPs per 1% CPU usage
20	4k random read	3	4	858,000	3,352	0.278	28.00%	15,321
	4k random read	3	8	782,000	3,055	0.620	21.00%	18,619
	4k random read	3	16	1,079,000	4,216	0.888	29.00%	18,603
	4k random read	3	32	1,615,000	6,308	1.189	41.00%	19,695
	4k random read	3	64	2,306,000	9,008	1.663	54.00%	21,352
	4k random read	3	128	2,615,000	10,215	2.921	67.00%	19,515
	4k random read/write (70%/30%)	3	2	410,000	1,602	0.305	29.00%	7,069
	4k random read/write (70%/30%)	3	4	113,400	443	2.112	7.00%	8,100
	4k random read/write (70%/30%)	3	8	58,200	227	8.247	3.00%	9,700
	4k random read/write (70%/30%)	3	16	667,000	2,605	1.607	38.00%	8,776
	4k random read/write (70%/30%)	3	32	908,000	3,547	2.791	48.00%	9,458
	4k random read/write (70%/30%)	3	64	941,000	3,676	6.017	52.00%	9,048
	4k random write	3	2	102,000	398	1.171	13.00%	3,923
	4k random write	3	4	50,100	196	4.794	7.00%	3,579
	4k random write	3	8	34,300	134	13.994	5.00%	3,430
	4k random write	3	16	66,100	258	14.504	8.00%	4,131
	4k random write	3	32	294,000	1,149	6.527	34.00%	4,324
	64k random read	3	2	319,000	19,938	0.374	17.00%
	64k random read	3	4	504,000	31,500	0.475	26.00%
	64k random read	3	8	439,000	27,438	1.081	22.00%
	64k random read	3	16	611,000	38,187	1.572	27.00%
	64k random read	3	32	851,000	53,187	2.252	38.00%
	64k random write	3	1	120,000	7,475	0.500	19.00%
	64k random write	3	2	130,000	8,153	0.919	20.00%
	64k random write	3	4	51,150	3,197	4.696	7.00%
	64k random write	3	8	38,700	2,419	12.334	6.00%
	64k random write	3	16	46,500	2,906	20.895	6.00%
	64k random write	3	32	161,000	10,063	11.905	26.00%
	1024k read	1	1	19,900	19,900	1.004	5.00%
	1024k read	1	2	31,800	31,800	1.257	7.00%
	1024k read	1	4	44,000	44,000	1.815	11.00%
	1024k read	1	8	50,300	50,300	3.176	14.00%
	1024k read	1	16	52,300	52,300	6.114	16.00%
	1024k write	1	1	9,887	9,887	2.022	8.00%
	1024k write	1	2	10,150	10,150	3.912	8.00%
	1024k write	1	4	10,200	10,200	7.841	9.00%
	1024k write	1	8	10,500	10,500	15.250	10.00%

Microsoft Storage Spaces Direct over RDMA scenario (Mirror + Parity tiers)

The performance metrics for the dual-tier configuration in S2D highlight workload management across both mirror and parity tiers.

In 4k random read patterns, IOPS ranges from 803,000 to 2,450,000, with latencies increasing from 0.297 ms to 3.133 ms as queue depth rises. Node CPU usage scales from 26% to 68%, with IOPS per 1% CPU usage showing efficient resource utilization, peaking at 19,773.

For the 4k random read/write (70%/30%) pattern, IOPS spans from 102,600 to 298,700, and latency escalates from 1.035 ms to 20.281 ms as queue depths increase. Node CPU usage varies between 20% and 50%, highlighting the system’s ability to manage mixed workloads, although the efficiency, measured by IOPS per 1% CPU usage, peaks at a more modest 3,075.

In the 64k random read and 1024k write patterns, throughput is substantial for reads, reaching up to 49,600 MiB/s, but write performance significantly declines in the 1024k write pattern, with throughput peaking at 2,341 MiB/s and latency increasing dramatically to 68.424 ms at higher queue depths. Despite the high node CPU efficiency in read scenarios, write performance shows noticeable degradation across tiers.

VM count	Pattern	Numjobs	IOdepth	IOPs	MiB/s	Latency (ms)	Node CPU usage %	IOPs per 1% CPU usage
20	4k random read	3	4	803,000	3,137	0.297	27.00%	14,870
	4k random read	3	8	774,000	3,023	0.620	26.00%	14,885
	4k random read	3	16	977,000	3,816	0.982	29.00%	16,845
	4k random read	3	32	1,531,000	5,980	1.252	42.00%	18,226
	4k random read	3	64	2,175,000	8,496	1.764	55.00%	19,773
	4k random read	3	128	2,450,000	9,570	3.133	68.00%	18,015
	4k random read/write (70%/30%)	3	2	152,700	598	1.035	32.00%	2,386
	4k random read/write (70%/30%)	3	4	157,200	614	1.924	32.00%	2,456
	4k random read/write (70%/30%)	3	8	102,600	400	4.926	20.00%	2,565
	4k random read/write (70%/30%)	3	16	260,200	1,016	4.759	45.00%	2,891
	4k random read/write (70%/30%)	3	32	298,700	1,167	9.019	50.00%	2,987
	4k random read/write (70%/30%)	3	64	282,900	1,105	20.281	46.00%	3,075
	4k random write	3	2	57,500	225	2.085	29.00%	991
	4k random write	3	4	70,600	276	3.398	33.00%	1,070
	4k random write	3	8	83,300	326	5.761	37.00%	1,126
	4k random write	3	16	89,000	348	10.774	41.00%	1,085
	4k random write	3	32	86,800	339	22.360	39.00%	1,113
	64k random read	3	2	312,000	19,500	0.383	18.00%
	64k random read	3	4	470,000	29,375	0.510	26.00%
	64k random read	3	8	386,000	24,125	1.259	22.00%
	64k random read	3	16	555,600	34,725	1.728	27.00%
	64k random read	3	32	776,000	48,500	2.474	38.00%
	64k random write	3	1	14,100	881	4.258	13.00%
	64k random write	3	2	13,700	856	8.771	14.00%
	64k random write	3	4	14,300	894	16.719	14.00%
	64k random write	3	8	15,400	962	31.095	16.00%
	64k random write	3	16	14,800	925	64.890	19.00%
	64k random write	3	32	14,800	925	129.896	18.00%
	1024k read	1	1	19,700	19,700	1.015	5.00%
	1024k read	1	2	31,000	31,000	1.256	8.00%
	1024k read	1	4	41,800	41,800	1.914	11.00%
	1024k read	1	8	47,600	47,600	3.358	13.00%
	1024k read	1	16	49,600	49,600	6.452	16.00%
	1024k write	1	1	1,904	1,904	10.707	4.00%
	1024k write	1	2	1,810	1,810	22.290	5.00%
	1024k write	1	4	1,981	1,981	40.353	5.00%
	1024k write	1	8	2,341	2,341	68.424	5.00%

Overall, S2D shows exceptional performance in both test cases, however, the storage capacity efficiency is about 35.7% and could be even less if additional space is assigned for in-place repairs.

Benchmarking results in graphs

With all benchmarks completed and data collected, we can now compare the results using graphical charts for a clearer understanding.

4k random read:

Figure 1: 4K RR (IOPS)

Let’s start with the 4K random read test, where Figure 1 demonstrates the performance in IOPS.

StarWind VSAN NVMe over RDMA starts off strong, delivering 893,000 IOPS at a 4-depth queue and climbing to an impressive 1,624,000 IOPS at a 32-depth queue, and then slightly declining.

Microsoft Storage Spaces Direct (S2D) in both configurations (“mirror-only” and “mirror + parity”) showed significant variability. The “mirror-only” setup achieved a peak of 2,615,000 IOPS at a 128-depth queue, while “mirror + parity” peaked slightly lower at 2,450,000 IOPS. StarWind’s peak performance at 32-depth was about 62% of S2D “mirror-only” and 66% of S2D both tiers at their respective peaks.

This significant variability in S2D’s performance can be traced back to its sophisticated use of Cluster Shared Volumes (CSV). The CSV architecture enables multiple hosts to share access to the same disk, effectively coordinating read and write operations through the SMB 3.0 multichannel protocol. This approach is what gives S2D its impressive peak performance, especially in scenarios where the VM runs on the node that owns the volume. In this case, it can read data directly from the local disk, bypassing the network stack. This local read path minimizes latency and maximizes performance, leading to impressive IOPS numbers (if you want to explore this topic in more detail, please read here or check this article).

However, the very nature of CSV that boosts performance also introduces complexity. S2D’s architecture demands careful monitoring to ensure that VMs are optimally placed, as any deviation can lead to performance dips.

Figure 2: 4K RR (Latency)

Latency is a critical factor, and in Figure 2 we analyze latency metrics for the 4K random read test.

We can see that latency increased with queue depth across all configurations. StarWind began with a low latency of 0.267 ms, rising to 4.967 ms at maximum queue depth.

The S2D “mirror-only” configuration had a low starting latency at 0.278 ms but escalated to 2.921 ms at 128 depth. Both tiers setup had a similar trend, starting at 0.297 ms and ending at 3.133 ms. At maximum queue depth, StarWind’s latency was approximately 70% higher than both S2D configurations.

The latency advantage of S2D is again attributed to local reads. While S2D enjoys lower latency, StarWind VSAN’s performance remains unaffected by VM location, offering simplicity at the cost of slightly higher latency.

Figure 3: 4K RR (IOPS per 1% CPU Usage)

Figure 3 showcases the results of the 4K random read test with a numjob=3, measuring IOPS per 1% CPU usage.

StarWind demonstrated a steady increase in IOPS per 1% CPU usage, peaking at 15,321 IOPS at 32-depth before a slight drop.

S2D “mirror-only” showed the highest efficiency, reaching 21,352 IOPS per 1% CPU at 64-depth. Both tiers configuration had a similar peak efficiency of 19,773 IOPS per 1% CPU at the same depth. StarWind’s efficiency was around 72% of S2D “mirror-only” and 77% of “mirror + parity” at their most efficient points.

4k random read/write 70/30:

Figure 4: 4K RR/RW 70%/30% (IOPS)

In virtualized environments, the mixed 4K random read/write workload serves as the backbone of daily operations. The ability to maintain high performance with mixed I/O across varied queue depths is critical. Figure 4 shows IOPS for the 4K random read/write (70%/30%) pattern.

Interestingly, with Storage Spaces Direct, there’s a noticeable drop in performance at queue depths 4 and 8. This performance drop is not observed in StarWind VSAN tests. StarWind maintains consistent performance, hitting 596,000 IOPS at queue depth 4 and 756,000 IOPS at queue depth 8.

StarWind holds its ground well and demonstrates impressive stability, achieving 856,000 IOPS at a 16-depth queue before experiencing a slight dip. In comparison, the S2D “mirror-only” configuration reached a higher peak of 941,000 IOPS at a 64-depth queue, while “mirror + parity” setup lagged behind with a peak of 298,700 IOPS.

StarWind’s peak performance was about 91% of the S2D “mirror-only” configuration, but it significantly outshined “mirror + parity” setup, delivering nearly three times the IOPS.

The main reason for the lower performance in the S2D “mirror + parity” scenario is the overhead of ReFS, which has to move new data from the mirror to the parity tier, leading to performance degradation. As a result, S2D records 152,700 IOPS at queue depth 2, drops to a low of 102,600 IOPS at QD=8, and then peaks at 298,700 IOPS at queue depth 32. In contrast, StarWind’s more consistent performance makes it a strong contender, especially in virtualization environments where mixed workloads are common.

Figure 5: 4K RR/RW 70%/30% (Latency)

Figure 5 reveals the latency associated with the 4K random read/write (70%/30%) workload.

Here, the picture is the same: S2D’s Nested mirror-accelerated parity setup struggles, especially when the workload spans both mirror and parity tiers, causing data movement delays.

StarWind’s consistent latency, starting with 0.355 ms at queue depth 2 and rising to 6.001 ms at queue depth 64, ensures smoother operations without the need for complex configurations.

StarWind’s latency at maximum depth was almost identical to S2D “mirror-only” but 70% lower compared to the S2D “mirror + parity” configuration.

Figure 6: 4K RR/RW 70%/30% (IOPS per 1% CPU Usage)

In Figure 6, the IOPS per 1% CPU usage for the 4K random read/write (70%/30%) pattern is depicted.

StarWind VSAN shows strong efficiency with 9,085 IOPS per 1% CPU usage at 32 IO depth, nearing the performance of S2D’s “mirror-only” setup, while far surpassing “mirror + parity” configuration. StarWind’s efficiency was approximately 96% of S2D “mirror-only” and three times better than S2D in the “dual-tier” scenario.

For IOPS per 1% CPU usage, S2D’s performance is uneven, fluctuating with workload intensity, whereas StarWind provides steady and reliable results.

4k random write:

Figure 7: 4K RW (IOPS)

The 4K random write performance pattern, as shown in Figure 7, further highlights the disparities between Microsoft Storage Spaces Direct and StarWind VSAN.

S2D’s performance varies greatly depending on the workload’s placement within the mirror or parity tier, with significant drops in performance at higher queue depths. StarWind, meanwhile, maintains stable performance, unaffected by workload placement or queue depth.

In pure 4K random write operations, StarWind stands out, achieving 341,000 IOPS at an 8-depth queue, which is 16% higher than S2D mirror-only’s peak of 294,000 IOPS at QD=32.

The S2D “mirror + parity” configuration struggles even more, peaking at only 89,000 IOPS at QD=16. Here, StarWind represents a remarkable 283% higher performance in write operations than S2D in the “dual-tier” scenario, making it an obvious choice for environments where write speed is critical.

Figure 8: 4K RW (Latency)

Latency during 4K random write operations, depicted in Figure 8, confirms StarWind VSAN’s domination in this test pattern. Starting at 0.595 ms, write latency increases to 9.818 ms, which is still considerably lower than Storage Spaces Direct with workload in mirror tier, which begins at 1.171 ms and peaks at 14.504 ms at a 16 IO depth.

When comparing StarWind VSAN to Microsoft S2D with workload within “mirror + parity”, the performance gap is even more pronounced, with its latency climbing to 22.360 ms at a 32 IO depth. StarWind’s maximum latency was about 68% of the S2D’s “mirror-only” latency and 44% of “mirror + parity” setup.

In 4K RW pattern, we see that latency under S2D can spike, particularly when ReFS is forced to shuffle data between tiers, while StarWind VSAN’s latency remains consistently lower.

Figure 9: 4K RW (IOPS per 1% CPU Usage)

Efficiency in 4K random write workloads is measured in IOPS per 1% CPU usage, as shown in Figure 9.

StarWind’s efficiency in write operations is impressive, with 5,156 IOPS per 1% CPU usage at 32 IO depth, outpacing Storage Spaces Direct with workload in mirror tier by about 19%. Both tiers configuration, once again, falls short, peaking at 1,126 IOPS per 1% CPU at 8 IO depth and being 77.6% lower than StarWind.

64k random read:

Figure 10: 64K RR (Throughput)

As we shift to larger block sizes, Figure 10 presents throughput for the 64K random read test.

StarWind started with a throughput of 15,187 MiB/s, increasing to 19,062 MiB/s at 32 IO depth.

Storage Spaces Direct with the workload in the mirror tier reached a peak throughput of 53,187 MiB/s at 32 IO depth, and “mirror + parity” setup had a slightly lower peak of 48,500 MiB/s at the same IO depth. StarWind’s maximum throughput was approximately 36% of “mirror-only” and 39% of “mirror + parity”.

With 64K random reads, Microsoft S2D shines again by leveraging local data access to push throughput to impressive levels.

Figure 11: 64K RR (Latency)

Figure 11 delves into latency during 64K random reads. The results align with the throughput data discussed earlier.

StarWind’s latency started at 0.493 ms and increased to 6.292 ms at 32 IO depth.

S2D with workload in the mirror tier began with a lower latency of 0.374 ms, peaking at 2.252 ms, while “mirror + parity” configuration began at 0.383 ms, increasing to 2.474 ms at 32 IO depth.

Figure 12: 64K RR (CPU Usage)

In Figure 12, we explore CPU usage during 64K random reads.

StarWind shows stable CPU usage, ranging from 25% to 28%, across various queue depths.

The S2D “mirror-only” scenario started at 17% and increased to 38% at IO depth=32. Both tiers setup followed a similar trend, starting at 18% and reaching 38%.

64k random write:

Figure 13: 64K RW (Throughput)

Figure 13 vividly illustrates the stark differences in 64K random write throughput between StarWind VSAN and Microsoft Storage Spaces Direct (S2D).

The performance of Storage Spaces Direct with the workload in the mirror tier shows considerable fluctuations. It starts highest scoring 7,475 MiB/s at IO depth=1 and 8,153 MiB/s at IO depth=2. At a 4 IO depth, S2D achieves 3,197 MiB/s, which drops to 2,419 MiB/s at an 8 IO depth before slightly increasing to 2,906 MiB/s at a 16 IO depth and rebounding back to the high score of 10,063 at IOdepth=32, outrunning StarWind by 139.9%. This erratic pattern mirrors the behavior observed in other tests, such as the 4K random write operations.

StarWind starts off with lower initial throughput of 2,638 and 3,050 MiB/s at IO depths 1 and 2, but delivers much more consistent performance as the test progresses. At IO depth=4, StarWind clocks in at 3,306 MiB/s, outpacing S2D by 3.4%. The gap widens as we move to an 8 IO depth, where StarWind reaches 3,613 MiB/s – a 49.4% lead over S2D. At IO depth=16, StarWind is still leading with 3,894 MiB/s, outperforming S2D by 33.9%.

A different story unfolds when we examine S2D performance in “mirror + parity” tests. It struggles at IO depth 1, with a low of 881 MiB/s, peaks at 962 MiB/s at IO depth 8, and drops to 925 MiB/s at IO depth 32.

Figure 14: 64K RW (Latency)

Latency for 64K random writes is detailed in Figure 14, where StarWind’s performance remains more consistent, avoiding the severe latency spikes observed in S2D’s at IO depths 4, 8, and 16.

Figure 15: 64K RW (CPU usage)

Let’s move on to Figure 15, which compares CPU usage during 64K random writes.

Here, CPU usage follows a similar trend as in the previous 64K random writes figures: Microsoft S2D’s efficiency is better only under specific conditions, while StarWind delivers more reliable usage metrics, ranging from 17% to 21%.

1M read:

Figure 16: 1024K R (Throughput)

In Figure 16, we see the throughput results for 1024K read operations.

Microsoft S2D again benefits from local data access, achieving high throughput that significantly outpaces StarWind VSAN. Thus, StarWind’s 1024K read throughput ranged from 13,800 MiB/s to 18,900 MiB/s at a 16 IO depth. The S2D “mirror-only” setup peaks at 52,300 MiB/s, while “mirror + parity” reached a slightly lower peak of 49,600 MiB/s. StarWind’s peak throughput was approximately 36% of “mirror-only” and 38% of “mirror + parity”.

Figure 17: 1024K R (Latency)

Figure 17 shows the latency results during the 1024K read test.

The resulting latency is predictably lower in Microsoft Storage Spaces Direct. StarWind’s latency increased from 1.451 ms to 16.976 ms at a 16 IO depth, whereas “mirror-only” S2D showed lower latency, starting at 1.004 ms and peaking at 6.114 ms at a 16 IO depth.

“Mirror + Parity” followed a similar pattern, starting at 1.015 ms and peaking at 6.452 ms. StarWind’s maximum latency was about 278% higher than “mirror-only” and 263% higher than S2D with workloads in both mirror and parity tiers.

Figure 18: 1024K R (CPU Usage)

Figure 18 highlights CPU usage during 1024K reads.

StarWind’s CPU usage ranged from 15% to 18%, while the S2D “mirror-only” setup started at 5% and increased to 16% at 16 IO depth. Even when workloads span both tiers, S2D maintains almost the same CPU usage levels as in the “mirror-only” benchmarks.

As IO depth increases, the CPU usage gap between StarWind VSAN and S2D narrows. S2D consistently uses less CPU across all IO depths, with the difference being most pronounced at lower IO depths (66.67% less at 1 IO depth than StarWind VSAN) and gradually decreasing to 11.11% less at 16 IO depth.

1M write:

Figure 19: 1024K W (Throughput)

When we shift our focus to 1024K sequential write throughput, Figure 19 underlines some clear distinctions in performance between StarWind VSAN and Storage Spaces Direct (S2D).

At IO depth=1, S2D in Nested mirror-accelerated parity mode with workload in the mirror tier, reaches a throughput of 9,887 MiB/s, while StarWind VSAN manages 3,703 MiB/s. This represents an impressive 167% higher throughput for S2D.

As the IO depth increases to 8, S2D maintains its lead, achieving 10,500 MiB/s compared to StarWind’s 4,479 MiB/s. This results in a 134% higher throughput for S2D at this IO depth.

However, this performance advantage for S2D is primarily evident when the workload does not spill out of the mirror tier.

If the workload hits both tiers – mirror and parity – the results change significantly. Under these conditions, StarWind VSAN exhibits a more stable performance curve, delivering 94% higher throughput than S2D in “mirror + parity” at 1 IO depth to an impressive 91% higher at an 8 IO depth.

Figure 20: 1024K W (Latency)

Latency during 1024K writes, as shown in Figure 20, displays exactly the same picture.

StarWind’s latency increases from 5.399 ms at 1 IO depth to 35.707 ms at an 8 IO depth, while the S2D “mirror-only” configuration has a lower latency peak at 15.250 ms. “Mirror + Parity” setup, however, suffers from extremely high latency, peaking at 68.188 ms. StarWind VSAN demonstrates significantly lower latency than the Storage Spaces Direct (S2D) “mirror + parity” configuration, with latency measurements that are approximately 92% lower.

Figure 21: 1024K W (CPU Usage)

Lastly, Figure 21 compares CPU usage during 1024K writes, with StarWind being significantly outmatched by both S2D setups.

StarWind VSAN’s CPU utilization increases from 16% at 1 IO depth to 19% as the queue depth rises. The S2D “mirror-only” configuration demonstrates a much lower CPU usage, capping at 10% at its highest throughput at IO depth=8. This efficiency gives S2D “mirror-only” an edge in terms of IOPS per CPU usage.

What’s really interesting, when the workload spans both tiers of S2D, it continues to exhibit even lower CPU usage, starting at just 4% at 1 IO depth and modestly rising to 5% at 2, 4, and 8 IO depths.

Additional benchmarking: 1 VM, 1 numjobs, 1 iodepth.

To gain a deeper understanding of how StarWind Virtual SAN and Storage Spaces Direct (S2D) perform under specific synthetic conditions, we conducted additional benchmarks focusing on a single-thread scenario, with 1 thread and 1 queue. Typically, this is the most effective way to measure storage access latency in an ideal scenario. The benchmarks focus on 4k random read and write patterns, including synchronous write operations.

Benchmark results in a table

StarWind VSAN NVMe-oF HA (RDMA) – Host mirroring + MDRAID5 (1 VM)
Pattern	Numjobs	IOdepth	IOPs	MiB\s	Latency (ms)
4k random read	1	1	2,974	12	0.335
4k random write	1	1	2,379	10	0.419
4k random write (synchronous)	1	1	967	4	1.032

Storage Spaces Direct (RDMA) – Nested mirror accelerated parity – Data in mirror tier (1 VM)
Pattern	Numjobs	IOdepth	IOPs	MiB\s	Latency (ms)
4k random read	1	1	7,231	28	0.137
4k random write	1	1	5,660	22	0.175
4k random write (synchronous)	1	1	2,816	11	0.353

Storage Spaces Direct (RDMA) – Nested mirror accelerated parity – Data in mirror and parity tiers (1 VM)
Pattern	Numjobs	IOdepth	IOPs	MiB\s	Latency (ms)
4k random read	1	1	5,922	23	0.167
4k random write	1	1	2,575	10	0.387
4k random write (synchronous)	1	1	1,754	7	0.568

Benchmark results in graphs

This section presents visual comparisons of the performance and latency metrics across storage configurations under research.

4k random read:

Figure 1: 4K RR (IOPS)

Figure 1 demonstrates IOPS for the 4K random read test at 1 IO depth and with one numjobs.

Here, Storage Spaces Direct (S2D) with data in the mirror tier outshines the other configurations. It achieves 7,231 IOPS, which is 143% higher than StarWind VSAN’s 2,974 IOPS.

This superior performance is again due to S2D’s ability to perform local reads at the host level, whereas StarWind VSAN operates within a VM, leading to a longer IO datapath.

Even when data spans both the mirror and parity tiers, S2D still leads with 5,922 IOPS, outperforming StarWind by 99%.

Figure 2: 4K RR (Latency)

Latency metrics for the 4K random read test at 1 IO depth, as shown in Figure 2, similarly favor Storage Spaces Direct with the workload in the mirror tier, which records a swift 0.137 ms. S2D’s latency is 59% faster than StarWind’s 0.335 ms.

Even when data spans both tiers, S2D maintains a respectable 0.167 ms, which is still 50% faster than StarWind.

4k random write:

Figure 3: 4K RW (IOPS)

Figure 3 showcases the results of the 4K random write test at IO depth=1 with a numjob=1.

For 4k random writes, S2D with data in the mirror tier proves its prowess, achieving 5,660 IOPS, which is 138% higher than StarWind’s 2,379 IOPS. The superior performance of S2D is due to the direct writing to the mirror tier, bypassing the need to calculate parity, which is resource-intensive. For a more detailed explanation of how reading and writing occur in a Nested mirror-accelerated parity scenario, please refer to the following link.

In scenarios where data spans both the mirror and parity tiers, S2D’s performance drops to 2,575 IOPS, but it still edges out StarWind by 8%. The additional step of invalidating data in the parity tier in S2D slightly reduces performance compared to when the workload is fully contained within the mirror tier. In contrast, StarWind VSAN writes directly to the MDRAID5 array, resulting in read-modify-write (RMW) operations, which further reduce performance.

Figure 4: 4K RW (Latency)

Moving on to Figure 4, we examine the latency metrics for 4K random writes.

No surprises here. S2D in the mirror tier shows a clear advantage with a latency of 0.175 ms, which is 139% lower than StarWind’s 0.419 ms. This advantage stems from S2D’s direct writing to the mirror tier, bypassing the parity calculations that slow down write operations.

When S2D data is spread across both tiers, the latency increases to 0.387 ms but remains 8% faster than StarWind’s latency.

These results suggest that S2D can more effectively manage latency in 4K write operations at IO depth=1 with a numjob=1, ensuring quicker data processing, while StarWind’s longer IO datapath from inside a VM increases latency.

4k random write (synchronous):

Figure 5: 4K RW Synchronous (IOPS)

In our synchronous 4K RW single-threaded IO tests, as shown in Figure 5, S2D in the mirror tier reaches 2,816 IOPS, again outperforming StarWind’s 967 IOPS by a significant 191%. This difference is again due to S2D’s ability to write directly to the mirror tier, avoiding the overhead of parity calculations.

When S2D data is distributed across both tiers, the performance drops to 1,754 IOPS but still surpasses StarWind by 81%.

Figure 6: 4K RW Synchronous (Latency)

The latency figures for synchronous 4K RW single-threaded IO, depicted in Figure 6, tell a similar story, with S2D’s mirror tier configuration offering a quick 0.353 ms, which is 192% lower than StarWind’s 1.032 ms. Even with data in both tiers, S2D’s latency is 0.568 ms – 82% lower than StarWind.

This consistency in performance highlights S2D’s capability in managing synchronous write operations efficiently, while StarWind’s VM-based operation leads to a longer IO datapath and higher latency.

Conclusion

In conclusion, both Storage Spaces Direct and StarWind VSAN bring distinct strengths and weaknesses to the table, each catering to different needs within your IT infrastructure.

Storage Spaces Direct, being a native Microsoft solution, excels in read performance, especially when virtual machines are aligned with their corresponding volume-owning nodes. However, this advantage hinges on careful workload management. If the VMs aren’t perfectly aligned, or if the workload spills over from the mirror tier to the parity tier, you might see a significant dip in write performance as we observed during 4K and 64K random-write tests. Additionally, S2D’s capacity efficiency is somewhat compromised, especially when you factor in the need to reserve extra space for fault tolerance.

On the flip side, StarWind VSAN shines in environments that demand consistent write and mixed IO performance. Its stable read and write performance, regardless of VM placement, and superior capacity efficiency make it a compelling option. However, the absence of local read optimization and the need to deploy an additional VM (StarWind VSAN CVM) are considerations that might tip the scale depending on your specific needs.

Ultimately, if your priority is top-notch read performance and you’re prepared to closely monitor your workloads, Storage Spaces Direct could be your go-to. But if you’re looking for reliable write performance and better capacity efficiency, StarWind VSAN might be the better fit.

StarWind Virtual SAN (VSAN) vs Microsoft Storage Spaces Direct (S2D), Part 2: Hyper-V HCI Performance Benchmarking (RDMA)

Introduction

Solutions overview

StarWind Virtual SAN NVMe over RDMA scenario:

Microsoft Storage Spaces Direct over RDMA scenario – Nested mirror-accelerated parity:

Capacity efficiency

Testbed overview

Testing methodology

Benchmarking local NVMe performance

Benchmark results in a table

StarWind VSAN NVMe over RDMA scenario

Benchmarking results in graphs

Additional benchmarking: 1 VM, 1 numjobs, 1 iodepth.

Benchmark results in a table

Benchmark results in graphs

Conclusion