Microsoft Storage Spaces Direct (S2D): Cluster Shared Volume IO Datapath

Introduction

In our previous articles, we explored how Microsoft Storage Spaces Direct (S2D) and StarWind Virtual SAN (VSAN) performed in various configurations within a 2-node Hyper-V cluster. The first article focused on non-RDMA scenarios with StarWind VSAN NVMe/TCP, compared to S2D with SMB3/TCP. In the second article, we looked at StarWind VSAN NVMe/RDMA alongside S2D configured with SMB Direct, which is SMB3 running over RDMA.

During benchmarking, in both cases (TCP & RDMA), we noticed unexpectedly high read performance, both bandwidth and low latency, from S2D, which led us to a theory… Might it be that S2D is reading data locally, rather than getting any “help” from the other partner nodes in transferring at least part of the data over the network? In our research below, we aim to validate that hypothesis.

We’ll also run tests for write operations, where we expect data replication to occur between nodes as intended. Of course!

Let’s quickly revisit the S2D interconnect diagram (Figure 1) from our previous S2D benchmarking articles:

Figure 1. S2D-based clustering interconnect diagram.

DISCLAIMER: We’re aware that disk witness isn’t officially supported with S2D. However, for the sake of our benchmarking and to speed up deployment, we chose to proceed with it.

That said, do not use disk witness in your production S2D cluster.

Testing methodology

In an S2D Nested Mirror-Accelerated Parity configuration, the volumes are formatted with ReFS, causes them to operate in File System Redirected Mode. You can learn more about ReFS features in the Microsoft article.

In practical terms, this means that I/O requests are routed over SMB3 protocol to the Coordinator Node (referred to as the “Owner Node” in Microsoft’s Failover Cluster Manager), where they are processed locally. Simply put, when a non-Owner Node initiates a data request, it is redirected over SMB3 to the Owner Node for processing.

To avoid the I/O redirection and associated overhead, we placed our test virtual machine (Test VM) directly on the Owner Node.

For our test setup, the VM was hosted on “sw-node-01” (see Figure 2), which is also the Owner Node for CSV “Volume01” (Figure 3). Obviously, the VM’s virtual disks were placed on “Volume01”.

Figure 2

Figure 3

We conducted a series of tests using the 1M read and 1M write patterns.

To maintain clarity and control over the variables, we capped the test performance at 1GiB/s and limited network usage to a single cluster network. The goal was to observe the network interface utilization.

Within the VM, we used the following FIO parameters to measure performance for both read and write operations:

1M Read:

fio --name=read --numjobs=1 --iodepth=16 --bs=1024k --rw=read --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/sdb --runtime=60 --time_based=1 --rate_iops=1000

1M Write:

fio --name=write --numjobs=1 --iodepth=16 --bs=1024k --rw=write --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/sdb --runtime=60 --time_based=1 --rate_iops=1000

During the testing, we closely monitored and recorded the values for the following performance counters:

Read Bytes/sec and Write Bytes/sec at the Cluster Shared Volume level (“Cluster CSVFS” performance counter).
Disk Read Bytes/sec and Disk Write Bytes/sec on the mirror-accelerated parity virtual disk level (“PhysicalDisk” performance counter).
Bytes Total/sec on the cluster network level (“Network Interface” performance counter).

1M Read Test

In the 1M read test we get a clear picture of how read operations behave.

Figure 4

The performance monitor screenshot in Figure 4 shows a read speed of 1 GiB/s at the Cluster Shared Volume level and on the Mirror-Accelerated Parity volume (“PhysicalDisk” performance counter), while the network utilization is negligible. The Bytes Total/sec for the network interfaces remains around 90 KiB/s, which is insignificant compared to the disk read speeds, indicating that the network is not being utilized in this operation.

This confirms that the read operations are handled locally, meaning the data is being directly accessed from the local storage on the same node “sw-node-01”, without requiring the network to transfer data between nodes.

1M Write Test

In this particular test scenario, we are observing 1M sequential write operations.

Figure 5

The performance monitor screenshot in Figure 5 shows a write speed of 1 GiB/s at the Cluster Shared Volume level and on the Mirror-Accelerated Parity volume (measured by the “PhysicalDisk” performance counter).

However, at the network level, we observe 2 GiB/s of network utilization, which is unexpected. We anticipated seeing 1 GiB of traffic for data replication to the partner node. So, why is 2 GiB of data being sent to the partner node instead of just 1 GiB?

What do you think could be causing this? In the meantime, we’ll reach out to our colleagues at Microsoft to investigate further and will share more details in upcoming articles.

Conclusion

Our tests have confirmed the hypothesis that in an S2D Nested Mirror-Accelerated Parity configuration, when a virtual machine is hosted on the Owner Node, data is read locally without involving the network stack, the partner cluster node is not involved in I/O at all. This behavior allows S2D to deliver strong read performance that we observed in our benchmarks: First article, second article.

The write tests confirmed that data is replicated to the partner node (sic!). However, it remains unclear why the amount of data sent to the partner node is 2x the size of the original data.

All these observations are essential to keep in mind for operating in production environments. If network bandwidth becomes a limiting factor, especially when VMs are running on non-Owner Nodes, it can result in performance bottlenecks. By strategically placing VMs on the volume Owner Node, you can avoid these issues and ensure more efficient operation.

Stay tuned for our upcoming articles, where we’ll continue to explore other solutions and provide deeper analysis to help optimize your IT strategy.

Microsoft Storage Spaces Direct (S2D): Is there any Data Locality?

Introduction

Testing methodology

1M Read Test

1M Write Test

Conclusion