Introduction
The reason for writing this post was a recent case from one of our customers, who ran into an issue when their SAN switch failed. The problem was that their VMs were generating an enormous amount of errors that were caused by the switching of active paths at the time of failover.
Problem
A typical fault-tolerant scenario consists of one or more server HBAs connected to one/several processor(s) as well as an active path used by the server, that can be found in the properties of the LUN. The failover path occurs when the LUN is changed from one path to another in situations when a SAN component, which is a part of the path, fails.
In the process of failover (the scenario that can be simulated by pulling out the cable), there is a big chance of the data I/O coming to a halt for 30-60 seconds to determine if the link is available. If you try to access the data/VM or its adapter, the operation will stall until the failover process is completed.
If a disaster caused multiple issues in the LUN path links, and all connections to the drive were lost, the failover process will result in a failure and multiple I/O errors in multiple iSCSI disks.
The scenario mentioned above can be overcome by avoiding any possible disruptions during the path failover (single points of failure), countless backups, snapshots, as well as increasing the Standard Disk Timeout values on the guest operating systems.
Solution
After backing up the registry and using the method of increasing the TimeOutValue parameter described below, it will be possible to eliminate any disruptions during the path of failover.
So, what you will need to do is:
- Right click on Start and select Run command.
- Type regedit.exe, and click OK.
- In the left-panel tree go to HKEY_LOCAL_MACHINE -> System -> CurrentControlSet -> Services -> disk.
- Double-click TimeOutValue parameter and set the value data to 0x3c (hexadecimal) or 60 (decimal) and apply with OK.
- Reboot the guest OS for the change to take effect.
Conclusion
After making this change, Windows will wait for 60 seconds to complete delayed disk operations before generating errors.