You need to change the way you patch S2D Clusters

If you’re running a Storage Spaces Direct (S2D) Cluster, you might have noticed some instability in recent months, specifically when it comes to patching and performing maintenance.
Well you’re in luck because 5 days ago, Microsoft released a new KB article that helps explain why you might have seen issues.

The scenario targeted by the Microsoft article is S2D Clusters running May (KB4103723) or later patch levels, where you experience Event ID 5120 during patching or maintenance, leading to things like CSV timeouts, VM pauses, or even VM crashes.
On top of this, the CSV crash might trigger a live dump, which can cause the node to drop out of the cluster when under load.

Crazy right? Microsoft’s explanation for this is:

“In the May 8, 2018, cumulative update, a change was introduced to add SMB Resilient Handles for the Storage Spaces Direct intra-cluster SMB network sessions. This was done to improve resiliency to transient network failures and improve how RoCE handles network congestion.
Adding these improvements has also inadvertently increased time-outs when SMB connections try to reconnect and waits to time-out when a node is restarted. These issues can affect a system that is under stress. During unplanned downtime, IO pauses of up to 60 seconds have also been observed while the system waits for connections to time-out.”

So how can you avoid this? Well good news is Microsoft has a workaround for planned maintenance, and that’s rolling back to the old Storage Maintenance Mode that was removed in September CU last year!

If you’re managing a larger deployment of S2D, you’re hopefully taking advantage of Cluster Aware Updating (CAU), so that you don’t sit up all night watching your cluster patch. Unfortunately, CAU no longer triggers this storage maintenance mode (see here), so we need to make use of Pre and Post Scripts.
CAU Pre and Post scripts are great tools, they can execute scripts on each node before entering maintenance mode, and after leaving maintenance mode, and as such, they are perfect for executing the workaround for us!

Here’s an example of a Pre-Patching Script that we might use:

This script will skip ahead of CAU by actually putting the host into cluster maintenance mode itself, this is because entering Storage Maintenance mode kicks off storage rebuilds, which will prevent CAU from being able to do it.
Now to avoid this issue ourselves, the script first loops through waiting for any outstanding storage jobs to finish, before attempting to enter Cluster Maintenance Mode.

And here is an example of what we might use in a Post-Patch Script:

Now this script is a lot simpler, this just makes sure the node isn’t in Cluster Maintenance mode, removes it if it is, and then brings the node out of Storage Maintenance mode.

Hopefully this helps a few of the S2D Admins out there with their cluster, seeing as the MS article isn’t very well advertised. Personally I’d like to see them add it as a known issue to the Windows CU Patch notes so that everyone finds it before the hit the issue, but we’ll see if that happens.

Currently there is no ETA on a permanent fix for this, however I’ll update the article if more information on this surfaces.

Original MS Article: https://support.microsoft.com/en-nz/help/4462487/

5 thoughts on “You need to change the way you patch S2D Clusters

  1. Hello Ben, perfect post!!

    Only one question. Is it not better when you disable the storage-maintenacemode in the post-patch script before you change the Paused State and Resume them?

    Regards

    Ralf

  2. This is a great article! I found a couple issues with the scripts though…
    PreUpdateScript.ps1 is missing the -ErrorAction Stop in the Try and Immediate for the Failback in the catch for both the Pre and Post. Here is the updated code below.

    Pre
    # Enter Storage Maintenance Mode
    try {
    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq “$($Env:ComputerName)”} | Enable-StorageMaintenanceMode -ErrorAction Stop
    }
    catch {
    Resume-ClusterNode -Name $Env:COMPUTERNAME -Failback Immediate
    throw “Failed to enter storage maintenance mode”
    }

    Post
    # Exit Storage Maintenance Mode
    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq “$($Env:ComputerName)”} | Disable-StorageMaintenanceMode

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.