Microsoft S2D: Why East-West Network Traffic Higher Than Expected?

Introduction

In the previous article, we explored the I/O data path in a two-node S2D cluster and confirmed our hypothesis regarding local reads: In an S2D Nested Mirror-Accelerated Parity configuration, when a virtual machine is hosted on the Owner Node, data is read locally without involving the network stack, the partner cluster node is not involved in I/O at all.

However, we also encountered unexpected behavior during write operations. Specifically, data was being sent twice to the partner node over the network. We raised questions on community forums to understand why this was happening but did not receive a definitive answer.

After some deliberation, we came up with the following hypothesis, now for writes.

Hypothesis: S2D operates in a way where the number of data copies sent over the network to a partner node equals the number of local copies being written. This means Microsoft S2D sends the same data multiple times rather than sending it just once.

Now, the goal is to verify the hypothesis in this article with the repeatable experiment you can reproduce yourself.

For this, we prepared the following testbed.

High-scope interconnect diagram: 2-node Microsoft S2D cluster, TCP

Disclaimer: For the sake of simplicity, and because this isn’t a production environment, we didn’t use a dedicated hardware cluster witness for proper quorum maintenance. While this setup works fine in a lab, it should be avoided in production.

Figure 1. S2D interconnect diagram.

To confirm our hypothesis, we will test three scenarios by creating three mirrored volumes that differ only in the number of data copies:

Mirror Volume (NumberOfDataCopies = 2): A standard two-way mirror.
Nested Mirror Volume (NumberOfDataCopies = 4): A standard nested mirror.
Nested Mirror Volume (NumberOfDataCopies = 6): Another nested mirror, but with six data copies.

If our hypothesis is correct, we expect the following observations during write operations:

Mirror Volume (NumberOfDataCopies = 2): 1x traffic.
One copy of the data will be written locally (on node sw-node-01) without using the network stack, and one copy will be transferred to the partner node (sw-node-02).
Nested Mirror Volume (NumberOfDataCopies = 4): 2x traffic.
Two copies of the data will be written locally (on node sw-node-01), and two copies will be transferred to the partner node (sw-node-02).
Nested Mirror Volume (NumberOfDataCopies = 6): 3x traffic.

Three copies of the data will be written locally (on node sw-node-01), and three copies will be transferred to the partner node (sw-node-02).

To create the volumes for these three scenarios, we first created Storage Tiers with the following parameters:

New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName Mirror -ResiliencySettingName Mirror -MediaType SSD -NumberOfDataCopies 2

New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName NestedMirror -ResiliencySettingName Mirror -MediaType SSD -NumberOfDataCopies 4

New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName NestedMirror6DataCopies -ResiliencySettingName Mirror -MediaType SSD -NumberOfDataCopies 6 -FaultDomainAwareness PhysicalDisk

Figure 2. Created storage tiers

The parameters used for creating the volumes are also provided below:

New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume01 -StorageTierFriendlyNames Mirror -StorageTierSizes 100GB

New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume02 -StorageTierFriendlyNames NestedMirror -StorageTierSizes 100GB

New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume03 -StorageTierFriendlyNames NestedMirror6DataCopies -StorageTierSizes 100GB

Testbed overview

Hardware:

Windows Server (Hyper-V) Nodes {1..2}	Supermicro SYS-220U-TNR
CPU	Intel(R) Xeon(R) Platinum 8352Y @2.2GHz
Sockets	2
Cores/Threads	64/128
RAM	256GB
NICs	1x Mellanox ConnectX®-6 EN 200GbE (MCX613106A-VDA)
Storage	10x NVMe Micron 7450 MAX: U.3 3.2TB

Software:

Windows Server	Windows Server 2022 Datacenter 21H2 OS build 20348.2849

Test VMs parameters:

Test VMs	sw-test-{01..03}
OS	Ubuntu Server 22.04 (minimal)
CPU	4 vCPU
RAM	4 GB
NICs	1x network adapter for management
Storage	10GB virtual disk for testing purposes

Testing methodology

In an S2D Mirror/Nested Mirror configuration, volumes are formatted with ReFS, which inherently operates in File System Redirected Mode. This operational mode directs I/O requests through the SMB3 protocol to the Coordinator Node (also referred to as the “Owner Node” in Microsoft’s Failover Cluster Manager), where they are processed locally. You can learn more about ReFS features in the Microsoft article.

In practical terms, this means that I/O requests are routed over SMB3 protocol to the Coordinator Node (referred to as the “Owner Node” in Microsoft’s Failover Cluster Manager), where they are processed locally. To put it simply, when a non-Owner Node initiates a data request, it gets redirected via SMB3 to the Owner Node.

To avoid the I/O redirection and associated overhead, we placed our test virtual machines (Test VMs) directly on the Owner Node.

For our test setup, the VMs were hosted on “sw-node-01”, which is also the Owner Node for CSVs “Volume01, Volume02, Volume03”.

Figure 3. Test VMs

The virtual disks of the VMs were placed into corresponding Cluster Shared Volumes (CSVs) as follows: “Volume01” for the sw-test-01 VM, “Volume02” for the sw-test-02 VM, and “Volume03” for the sw-test-03 VM.

Figure 4. Cluster Shared Volumes

We carried out a series of tests, each using a 1M write pattern, with one test assigned to each Test VM.

To ensure consistent results and maintain control over variables, we capped the write performance at 1 GiB/s. The primary objective was to closely monitor the utilization of the network interface during these operations.

To measure write performance, we configured the Test VMs with the following FIO parameters:

fio --name=write --numjobs=1 --iodepth=16 --bs=1024k --rw=write --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/sdb --runtime=60 --time_based=1 --rate_iops=1000

During the testing, we closely monitored and recorded the values for the following performance counters:

Write Bytes/sec at the Cluster Shared Volume level (“Cluster CSVFS” performance counter).
Disk Write Bytes/sec on the virtual disk level (“PhysicalDisk” performance counter).
Bytes Total/sec on the cluster network level (“Network Interface” performance counter).

Tests

1st scenario (NumberOfDataCopies = 2)

Figure 5. Performance counters – 1st scenario

The performance monitor shows write speed of 1 GiB/s at the Cluster Shared Volume level and on the Mirror volume (measured by the “PhysicalDisk” performance counter).

We observe 1 GiB/s of network utilization. One copy of the data is written locally on sw-node-01, and one copy of the data was replicated to the partner node sw-node-02.

The first part of the experiment confirms our hypothesis.

2nd scenario (NumberOfDataCopies = 4)

Figure 6. Performance counters – 2nd scenario

The performance monitor also shows write speed of 1 GiB/s at the Cluster Shared Volume level and on the Nested Mirror volume (measured by the “PhysicalDisk” performance counter).

However, at the network level, we observe 2 GiB/s of network utilization. This behavior surprised us in the previous article, as we expected to see 1 GiB/s.

But for now, this part of our experiment aligns with what we expect to observe – two copies of the data are written locally on sw-node-01, and two copies of the data were replicated to the partner node sw-node-02.

To further confirm the hypothesis, let’s move on to the third scenario.

3rd scenario (NumberOfDataCopies = 6)

Figure 7. Performance counters – 3rd scenario

The performance monitor shows write speed of 1 GiB/s at the Cluster Shared Volume level and on the Nested Mirror volume (measured by the “PhysicalDisk” performance counter). There are no changes here.

However, at the network level, we observed 3 GiB/s of network utilization.

This aligns with the expectation that three copies of the data were replicated to the partner node, sw-node-02, once again providing clear confirmation of our hypothesis.

Conclusion

Our tests have fully confirmed the hypothesis: S2D operates in a way where the number of times data is transferred over the network to a partner equals the number of local data copies being written. As a result, instead of sending the data just once, Microsoft S2D sends it as many times as the configured number of data copies.

As a result, in 2-node S2D Nested scenarios, identical data is sent to the partner node twice by default when “NumberOfDataCopies = 4”.

Additionally, keep in mind that if your VM is not aligned with the volume Owner Node, data will first be redirected over the network to the Owner (Coordinator) Node due to ReFS always operating in redirected mode, further increasing network utilization. This behavior of S2D should be carefully considered when planning your network environment for production.

Microsoft Storage Spaces Direct (S2D): Why Is East-West Network Traffic Much Higher Than Expected?