Microsoft’s Storage Spaces Direct (S2D) has become a core feature in Windows Server since its introduction in Windows Server 2016 Datacenter Edition. S2D pools together server storage, creating scalable and highly available software-defined storage systems. In this article, I’m focusing on how to replace a failed disk in a Storage Spaces Direct setup. This process remains straightforward but requires some PowerShell skills.
S2D Fault Tolerance Levels
Before jumping into replacing the disk, it’s important to understand the fault tolerance mechanisms of S2D. Storage Spaces Direct allows you to create mirrored arrays, which can be either two-way mirroring or three-way mirroring:
Two-way mirroring: This configuration keeps two copies of the data, tolerating one failure (either a disk or a server). It offers 50% storage efficiency, meaning for every 1TB of data, 2TB of storage is needed.
Three-way mirroring: This configuration keeps three copies of the data, tolerating two failures. It has a 33% storage efficiency, meaning for every 1TB of data, 3TB of storage is required.
While other options like parity and mirror-accelerated parity exist, Microsoft recommends using mirroring for most performance-sensitive workloads. For more in-depth information on fault tolerance in S2D, check out Microsoft’s official documentation.
Step 1: Identify the Failed Disk
If a disk enters an Unhealthy state or shows a transient error status, it’s time to replace it. To check the health of your cluster, use this PowerShell command:
Get-StorageSubSystem *Cluster* | Get-StorageJob
A malfunctioning disk will show up in the Failover Cluster Manager under Storage -> Pools, similar to the screenshot below.
You can also get the necessary information about the storage pool’s health using PowerShell:
Get-StoragePool *TESTS2D* | Get-PhysicalDisk
Next, assign the physical disk object to a PowerShell variable for easier manipulation:
$Disk = Get-PhysicalDisk |? OperationalStatus -Notlike ok
Step 2: Retire the Failed Disk and Physically Identify It
Before proceeding, you want to prevent any write operations to the failed disk to avoid data loss. To do this, mark the disk as retired:
Set-PhysicalDisk -InputObject $Disk -Usage Retired
Remove the failed disk from the storage pool:
Get-StoragePool *TESTS2D* | Remove-PhysicalDisk –PhysicalDisk $Disk
You might encounter a warning stating that the device is unresponsive. This is expected, as you are removing a failed disk.
Now, to quickly identify the failed disk physically, enable its LED:
Get-PhysicalDisk |? OperationalStatus -Notlike OK | Enable-PhysicalDiskIdentification
NOTE: LED identification is only available in Windows Server 2016 and later versions. Additionally, your physical server must support the SES Enclosure Storage protocol.
Head over to the server room to locate the failed disk. It will be the one with the LED light on.
Once you have replaced the disk, turn off the LED:
Get-PhysicalDisk |? OperationalStatus -like OK | Disable-PhysicalDiskIdentification
Step 3: Add the New Disk to the Storage Pool
Now, connect the new disk to your server. Ensure the new disk is initialized, online, and formatted as GPT.
To check whether the operating system detects the new disk and whether it’s suitable for joining the S2D pool, run this command:
$Disk = Get-PhysicalDisk | ? CanPool –eq True
Note: Sometimes, you may need to reboot the server for the proper identification of the disk.
Once identified, add the new disk to the storage pool with the following command:
Get-StoragePool *TESTS2D* | Add-PhysicalDisk –PhysicalDisks $Disk –Verbose
After adding the new disk, Storage Spaces Direct will automatically start to rebalance the data across all available disks. To check the progress of the rebalancing process, you can use:
Get-StoragePool *TESTS2D* | Get-StorageJob
This will provide you with detailed information about the ongoing rebalancing process and ensure your system is back to its optimal state.
Conclusion
Replacing a failed physical disk in Storage Spaces Direct on Windows Server 2022 is a fairly straightforward task, but it requires some PowerShell commands and proper handling of the disks. By following these steps, you’ll ensure that your data remains intact, and your storage system continues functioning smoothly.
Remember, the process essentially involves three main steps:
- Mark the failed disk as retired.
- Remove the disk and physically identify it.
- Add the new disk to the storage pool and allow the system to rebalance.
With this approach, you should be able to replace failed disks without significant downtime. For more information, check out Microsoft’s official documentation on Storage Spaces Direct.