Unable to bring 2nd node online after disk offline

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: art (staff), anton (staff), Anatoly (staff), Max (staff)

Post Reply
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Tue Jul 29, 2014 2:32 pm

More lab work around HA. Two-node v8 SANs with HA between them. Took one of the underlying drives off-line whilst copying a 3GB file to the iSCSI mounted target. 2nd node changed to unsynchronised as expected. Copy carried on but I paused it anyway.

Brought the drive back online but a) it's not re-synchronised and b) I'm unable to manually start the synchronisation.

So what should I do now?

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Tue Jul 29, 2014 2:33 pm

I restarted the service on the 2nd node and it has re-started synchronisation. That's not a very clean recovery from just a drive going offline temporarily?
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Tue Jul 29, 2014 3:42 pm

Hmm, it went part way through synchronisation and now it's started again. I'm sure I read a report of somebody else with the same problem.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Tue Jul 29, 2014 8:50 pm

It's still struggling to re-synchronise a 10GB disk on SSD... the second disk did re-synchronise but now the first one has start re-synchronising the other way!

Some observations:

The CPU use on the node that is replicating (i.e. writing to the SSD) is much higher than I would expect. Okay so this is a desktop PC specification with just a quad-core CPU running the two SAN VMs but it still feels too high. Why is writing to a disk such a CPU intensive operation? According to resource monitor, "System interrupts" (Deferred procedure call & interrupt service routines) is top of the CPU pile with StarWindService.exe close behind.

The other node supplying the data is pretty idle as I kind of expected - disk activity reading F:\Storage3.img

The disk re-synchronising is *only* 10GB big and it's on a SSD (both sides) and it's taking hours to re-synchronise. There are two 1Gbit/s virtual NICs for the sync channel - these are on a VMware workstation virtual NIC, i.e. the traffic isn't actually going across a real network - it's all virtual. This virtual network is running at <10MBit/s - somebody else has reported very low NIC speeds during sync.

Copying a 10GB file from A to B over a twin virtual 1GBit/s NIC from/to SSD should take seconds, not hours.

When/if it does finally replicate and settle down, I'll shutdown StarWind service and try copying the 10GB IMG file across the network as a test.

My biggest surprise is that CPU usage on the node that's re-replicating. It's very unresponsive on the console. I've experimented with v6 in the same lab a while back and CPU was never an issue.

Later...
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Wed Jul 30, 2014 10:20 am

Looks pretty weird...
Can I ask you to drop to the support@starwindsoftware.com following (please mention the link to the thread in the email):
· StarWind logs from all problematic SAN boxes
· Windows Application and System logs (in *.csv format) from all problematic SAN boxes
· Detailed network diagram of SAN system
· Description of the actions that were performed before/at the time of the issue
· Approximate time frames when the issue happened
 
I`d appreciate if you`ll separate the logs from different servers into the different folders


Thank you
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Wed Jul 30, 2014 10:36 am

I have found one flaw in my lab re: SSD. I'm using VMware Workstation. I have three drives in my lab PC:

Drive C: - SATA-3
Drive E: - SATA-2 & RAID-5
Drive F: - SSD

The virtual machine itself is on drive E but I then added an extra disk off the SSD drive F: But there is, IMO, a bug in VMware Workstation. If you create a snapshot, the snapshots always go into folder where the VM itself is stored, i.e. drive E: in my case. I have created snapshots so the new writes are going to my much slower drive E:

So this probably explains the speed. Just removing the snapshots and sorting this. Probably explains the speed.

But doesn't explain the sync issues - so later... will repeat the experiment.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Wed Jul 30, 2014 4:09 pm

Okay, speed is much improved but the recovery after a disk outage needs some further explanation.

Test #1 - Shutdown StarWind service to simulate shutdown of one node

Shutdown Starwind service on one node and restart a minute later - automatically re-synchronises.

Test #2 - take disk offline to simulate disk failure

Take the underlying disk (F: drive) offline and wait until it has said "Stopped synchronising". Bring disk back online.

StarWind does not automatically recover even though the F: drive has re-attached.

Attempting to re-resynchronise fails - just does nothing when you right click on the node and say "Synchronise".

Restart StarWind service causes re-sync of all storage - works but rather drastic to just handle a failure of one disk.

I'll do a video to show you...
User avatar
awedio
Posts: 89
Joined: Sat Sep 04, 2010 5:49 pm

Wed Jul 30, 2014 7:38 pm

Rob,

Chk ur PM
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Thu Jul 31, 2014 11:38 am

I tried to do a screen video capture via Camstudio but it exceeded the 2GB video limit - oops! Will do a series of screenshots instead or try TinyTake.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Thu Jul 31, 2014 3:34 pm

Okay, I've gone through this in detail and produced a series of screenshots:

sshot-1: diagram of storage set-up

Showing the two nodes UKMAC-SAN90 and UKMAC-SAN91. Tests are going to be done on Storage3 which is mounted on SSD storage.

sshot-2: copying file from SSD to SSD

UKMAC-TEST90 is a Windows 2012 R2 file server with storage3 mounted via twin-channel MPIO to both SAN nodes (so four iSCSI connections). I'm about to interrupt that copy by shutting down the StarWind service on UKMAC-SAN91

sshot-3: StarWind service shut-down

As expected, connection to SAN lost and the existing node is highlighted as partner node not ready. The file copy above carried on as normal - yeah! HA at it's best.

sshot-4: StarWind service restarted

Re-start the service on UKMAC-SAN91 and as expected, synchronisation restarts automatically and after a short while, it finishes.

WISH: it would be nice if the console tried to automatically reconnect to UKMAC-SAN91

So far so good. But this was a clean shutdown of StarWind. That's not really a disaster scenario - more of a planned down of one of the nodes. In the next tests, we're going to pull the rug out under StarWind.

sshot-5: disk taken offline

Okay, so this is emulating a disk going off-line by taking it off-line in Disk Management. This time, one gets an error message saying state has changed to "Not synchronised". Once again, kind of expected.

sshot-6: current state

An additional screenshot just to show that UKMAC-SAN91 is shown as not synchronised.

sshot-7: drive back online but StarWind not noticed

Disk 2 has been brought back online but StarWind has not noticed. One can leave it in this state for hours and StarWind doesn't appear to re-attach the target or re-start synchronisation. IMO StarWind should have picked itself up here from the drive going offline.

Or is this by design??

sshot-8: restart StarWind

Here I've stopped and started the StarWind service and it's picked itself up. IMO if by restarting StarWind sorts everything out, then it should be able to do it automatically itself, i.e. recover from a disk going offline and then back.

Restarting StarWind on a HA system is a bit drastic but it's not going to require downtime as the other node carries on. But I think there should be some other way to recover from this situation from within the console.

I wonder if the problem is that StarWind isn't automatically reconnecting to the drive after it's gone down?

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Thu Jul 31, 2014 3:35 pm

I wasn't able to upload the screenshot zip file as the board is configured with rather small upload limits :shock:

So here it is in my DropBox:

https://dl.dropboxusercontent.com/u/366 ... 20test.zip

Cheers, Rob.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Sat Aug 09, 2014 12:34 pm

I`m getting Error 404
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Sat Aug 09, 2014 12:45 pm

I had a Dropbox wobble - fixed now. Link should work.

Cheers, Rob.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Thu Aug 14, 2014 5:42 pm

WISH: it would be nice if the console tried to automatically reconnect to UKMAC-SAN91
Already noted!

As about the long story with screenshots:
For now that is the way it works. Nevertheless I`ll discuss it with our developer guys, because I 100% agree that this should be changed somehow.
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
Post Reply