Best Practices for patching S2D and AzureStack HCI Clusters – Part 1

Overview

While spending a lot of time on the Storage Spaces Direct Slack group, one thing that comes up again and again is patching of S2D Clusters, and what is the best way to do it.

For this blog series, I’m going to break down the patching best practices into 2 separate scenarios:

  1. Offline Patching
  2. Using Cluster Aware Updating

Offline Patching

Offline patching is a pretty common scenario when patching S2D Clusters, and in my mind it is used for 2 reasons, catching up on multiple months of patching where there are known issues, and planned patching in a small window with an outage.

The process is pretty straight forward, shut everything in the cluster down, patch the hosts, and start it all back up again. But this obviously means your highly available platform isn’t that available, so why do it?

Pros

The main advantages to offline patching is that there is no risk of VMs experiencing unexpected reboots because they’re already powered off, and for the same reason, there is no risk to the storage volumes provided by Storage Spaces Direct.

And you can patch and reboot all nodes at the simultaneously because there are no storage jobs that need to run when all of the CSVs are offline.

Cons

The obvious disadvantage of course is the fact that you need to arrange a business outage to shut everything down for patching, however this outage only needs to be 1-2 hours.

Steps

The below steps are based on the instructions provided by Microsoft on Docs.Microsoft.com

  1. Plan your maintenance window.
  2. Shutdown all VMs on the cluster
  3. Take the virtual disks offline.
    • Use Failover Cluster Manager to take the Cluster Shared Volumes offline under ‘Storage > Disks’
    • Or use Powershell to offline all disks with Get-ClusterSharedVolume -Cluster S2D-Cluster | Stop-ClusterResource
  4. Take the Cluster Pool offline
    • Use Failover Cluster Manager to take the Cluster Pool offline under ‘Storage > Pools’
    • Or use Powershell to offline the pool with Get-ClusterResource -Cluster S2D-Cluster | ?{$_.ResourceType -eq "Storage Pool"} | Stop-ClusterResource
  5. Stop the cluster.
    • Run the Stop-Cluster -Cluster S2D-Cluster command
    • Or use Failover Cluster Manager to stop the cluster.
  6. Disable the cluster service on each node. This prevents the cluster service from starting up while being patched.
    • Set Cluster Service to Disabled in services.msc
    • Or use Get-Service clussvc -ComputerName Server01 | Set-Service -StartupType Disabled
  7. Apply the Windows Server Cumulative Update and any required Servicing Stack Updates to all nodes. (You can update all nodes at the same time, no need to wait since the cluster is down).
  8. Restart the nodes, and ensure everything looks good.
  9. Set the cluster service back to Automatic on each node.
    • Set the Cluster Service to Automatic in services.msc
    • Or use Get-Service clussvc -ComputerName Server01 | Set-Service -StartupType Automatic
  10. Start the cluster.
    • Run Start-Cluster -Name S2D-Cluster
  11. Bring the Cluster Pool back online.
    • Use Failover Cluster Manager to bring the Cluster Pool online under ‘Storage > Pools’
    • Or use Powershell to offline the pool with Get-ClusterResource -Cluster S2D-Cluster | ?{$_.ResourceType -eq "Storage Pool"} | Start-ClusterResource
  12. Bring the virtual disks back online.
    • Use Failover Cluster Manager to bring the Cluster Shared Volumes online under ‘Storage > Disks’
    • Or use Powershell to online all disks with Get-ClusterSharedVolume -Cluster S2D-Cluster | Start-ClusterResource
  13. Monitor the status of the virtual disks by running the Get-Volume and Get-VirtualDisk cmdlets.

Simplifying the process

Seeing as this is a 13 step process, it’s 13 times human error can occur. To help with removing human error, and because I love automating things with Powershell, I’ve created some scripts to reduce the number of steps down to just 7.

Now stopping the cluster once your VMs are offline is a single command – Stop-S2DCluster.

Stop-S2DCluster will check all your volumes are healthy, and all VMs are shut down before taking any action. It will then stop all CSVs, and the Storage Pool, before stopping the cluster and setting all the cluster services to disabled on the hosts.

Starting things up after you’ve patched all the hosts is just as easy with Start-S2DCluster.

Unlike Stop-S2DCluster, Start-S2DCluster needs to be run against a cluster node, rather than the cluster, as it will start the cluster service on that node first, and then automatically discover all the other nodes in the cluster. It’ll set all the cluster services back to automatic and start them. After the nodes have joined the cluster, it will being the storage pool and CSVs back online.#

These Powershell commands are part of my S2D-Maintenance functions, and the latest version can be downloaded from by GitHub repo.

Wrapping up

So in Part 1, we’ve gone over the process for performing offline maintenance to an Storage Spaces Direct or AzureStack HCI Cluster, and automated a number of the steps to simplify the process.

Offline maintenance is always advised when your cluster is 6 months or more behind on patching, as it reduces the risk of hitting known bugs and the window required to catch up to date on patches.

Next time we’ll cover off using Cluster Aware Updating to make sure you don’t fall behind in patch level in the first place.


Original Code

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.