You need to change the way you patch S2D Clusters

If you’re running a Storage Spaces Direct (S2D) Cluster, you might have noticed some instability in recent months, specifically when it comes to patching and performing maintenance. Well you’re in luck because 5 days ago, Microsoft released a new KB article that helps explain why you might have seen issues.

The scenario targeted by the Microsoft article is S2D Clusters running May (KB4103723) or later patch levels, where you experience Event ID 5120 during patching or maintenance, leading to things like CSV timeouts, VM pauses, or even VM crashes. On top of this, the CSV crash might trigger a live dump, which can cause the node to drop out of the cluster when under load.

Crazy right? Microsoft’s explanation for this is:

“In the May 8, 2018, cumulative update, a change was introduced to add SMB Resilient Handles for the Storage Spaces Direct intra-cluster SMB network sessions. This was done to improve resiliency to transient network failures and improve how RoCE handles network congestion. Adding these improvements has also inadvertently increased time-outs when SMB connections try to reconnect and waits to time-out when a node is restarted. These issues can affect a system that is under stress. During unplanned downtime, IO pauses of up to 60 seconds have also been observed while the system waits for connections to time-out.”

So how can you avoid this? Well good news is Microsoft has a workaround for planned maintenance, and that’s rolling back to the old Storage Maintenance Mode that was removed in September CU last year!

Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

If you’re managing a larger deployment of S2D, you’re hopefully taking advantage of Cluster Aware Updating (CAU), so that you don’t sit up all night watching your cluster patch. Unfortunately, CAU no longer triggers this storage maintenance mode (see here), so we need to make use of Pre and Post Scripts. CAU Pre and Post scripts are great tools, they can execute scripts on each node before entering maintenance mode, and after leaving maintenance mode, and as such, they are perfect for executing the workaround for us!

Here’s an example of a Pre-Patching Script that we might use:

# Check for any outstanding Storage Jobs
"Waiting for storage jobs to complete"
do {
    Start-Sleep 5
}until(
    # Running Repair Jobs are less than 1
    ((Get-StorageJob |Where-Object {$_.Name -eq 'Repair' -and $_.JobState -ne 'Completed'})|Measure-Object).Count -lt 1
)
"Storage Jobs finished"

# Suspend Cluster Host to prevent chicken-egg scenario
Suspend-ClusterNode -Name $env:COMPUTERNAME -Drain -ForceDrain -Wait

# Enter Storage Maintenance Mode
try {
    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "$($Env:ComputerName)"} | Enable-StorageMaintenanceMode
}
catch {
    Resume-ClusterNode -Name $Env:COMPUTERNAME -Failback
    throw "Failed to enter storage maintenance mode"
}

This script will skip ahead of CAU by actually putting the host into cluster maintenance mode itself, this is because entering Storage Maintenance mode kicks off storage rebuilds, which will prevent CAU from being able to do it. Now to avoid this issue ourselves, the script first loops through waiting for any outstanding storage jobs to finish, before attempting to enter Cluster Maintenance Mode.

And here is an example of what we might use in a Post-Patch Script:

# Check for any nodes left in a Paused State and Resume them
If ((get-clusternode -Name $Env:COMPUTERNAME).State -eq 'Paused') {
    Resume-ClusterNode -Name $Env:COMPUTERNAME -Failback
}

# Exit Storage Maintenance Mode
Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "$($Env:ComputerName)"} | Disable-StorageMaintenanceMode

Now this script is a lot simpler, this just makes sure the node isn’t in Cluster Maintenance mode, removes it if it is, and then brings the node out of Storage Maintenance mode.

Hopefully this helps a few of the S2D Admins out there with their cluster, seeing as the MS article isn’t very well advertised. Personally I’d like to see them add it as a known issue to the Windows CU Patch notes so that everyone finds it before the hit the issue, but we’ll see if that happens.

Currently there is no ETA on a permanent fix for this, however I’ll update the article if more information on this surfaces.

Original MS Article: https://support.microsoft.com/en-nz/help/4462487/