Best Practices for patching S2D and AzureStack HCI Clusters - Part 1

Overview

While spending a lot of time on the Storage Spaces Direct Slack group, one thing that comes up, again and again, is patching of S2D Clusters, and what is the best way to do it.

For this blog series, I’m going to break down the patching best practices into 2 separate scenarios:

  1. Offline Patching
  2. Using Cluster Aware Updating

Offline Patching

Offline patching is a pretty common scenario when patching S2D Clusters, and in my mind it is used for 2 reasons, catching up on multiple months of patching where there are known issues, and planned patching in a small window with an outage.

The process is pretty straight forward, shut everything in the cluster down, patch the hosts, and start it all back up again. But this means your highly available platform isn’t that available, so why do it?

Pros

The main advantages to offline patching is that there is no risk of VMs experiencing unexpected reboots because they’re already powered off, and for the same reason, there is no risk to the storage volumes provided by Storage Spaces Direct.

And you can patch and reboot all nodes at the simultaneously because there are no storage jobs that need to run when all of the CSVs are offline.

Cons

The obvious disadvantage, of course, is the fact that you need to arrange a business outage to shut everything down for patching, however, this outage only needs to be 1-2 hours.

Steps

The below steps are based on the instructions provided by Microsoft on Docs.Microsoft.com

  1. Plan your maintenance window.
  2. Shutdown all VMs on the cluster
  3. Take the virtual disks offline.
    1. Use Failover Cluster Manager to take the Cluster Shared Volumes offline under ‘Storage > Disks’
    2. Or use Powershell to offline all disks with Get-ClusterSharedVolume -Cluster S2D-Cluster | Stop-ClusterResource
  4. Take the Cluster Pool offline
    1. Use Failover Cluster Manager to take the Cluster Pool offline under ‘Storage > Pools’
    2. Or use Powershell to offline the pool with Get-ClusterResource -Cluster S2D-Cluster | ?{$_.ResourceType -eq "Storage Pool"} | Stop-ClusterResource
  5. Stop the cluster.
    1. Run the Stop-Cluster -Cluster S2D-Cluster command
    2. Or use Failover Cluster Manager to stop the cluster.
  6. Disable the cluster service on each node. This prevents the cluster service from starting up while being patched.
    1. Set Cluster Service to Disabled in services.msc
    2. Or use Get-Service clussvc -ComputerName Server01 | Set-Service -StartupType Disabled
  7. Apply the Windows Server Cumulative Update and any required Servicing Stack Updates to all nodes. (You can update all nodes at the same time, no need to wait since the cluster is down).
  8. Restart the nodes, and ensure everything looks good.
  9. Set the cluster service back to Automatic on each node.
    1. Set the Cluster Service to Automatic in services.msc
    2. Or use Get-Service clussvc -ComputerName 'Server01' | Set-Service -StartupType Automatic
  10. Start the cluster.
    1. Run Start-Cluster -Name S2D-Cluster
  11. Bring the Cluster Pool back online.
    1. Use Failover Cluster Manager to bring the Cluster Pool online under ‘Storage > Pools’
    2. Or use Powershell to offline the pool with Get-ClusterResource -Cluster S2D-Cluster | ?{$_.ResourceType -eq "Storage Pool"} | Start-ClusterResource
  12. Bring the virtual disks back online.
    1. Use Failover Cluster Manager to bring the Cluster Shared Volumes online under ‘Storage > Disks’
    2. Or use Powershell to online all disks with Get-ClusterSharedVolume -Cluster S2D-Cluster | Start-ClusterResource
  13. Monitor the status of the virtual disks by running the Get-Volume and Get-VirtualDisk cmdlets.

Simplifying the process

Seeing as this is a 13 step process, it’s 13 times human error can occur. To help with removing human error, and because I love automating things with Powershell, I’ve created some scripts to reduce the number of steps down to just 7.

Now stopping the cluster once your VMs are offline is a single command - Stop-S2DCluster.

Stop-S2DCluster will check all your volumes are healthy, and all VMs are shut down before taking any action. It will then stop all CSVs, and the Storage Pool, before stopping the cluster and setting all the cluster services to disabled on the hosts.

Starting things up after you’ve patched all the hosts is just as easy with Start-S2DCluster.

Unlike Stop-S2DCluster, Start-S2DCluster needs to be run against a cluster node, rather than the cluster, as it will start the cluster service on that node first, and then automatically discover all the other nodes in the cluster. It’ll set all the cluster services back to automatic and start them. After the nodes have joined the cluster, it will bring the storage pool and CSVs back online.

# Arrange a maintenance window for the outage
...

# Stop the VMs
Get-VM -ComputerName (Get-ClusterNode -Cluster S2D-Cluster).Name | Stop-VM

# Shutdown the S2D Cluster
Stop-S2DCluster -Name S2D-Cluster

# Patch Hosts and Reboot
...

# Start the S2D Cluster
Start-S2DCluster -ComputerName S2DHost01

# Start the VMs
Get-VM -ComputerName (Get-ClusterNode -Cluster S2D-Cluster).Name | Start-VM

These Powershell commands are part of my S2D-Maintenance functions, and the latest version can be downloaded from by GitHub repo.

# Download location
$FileLocation = C:\Scripts\S2D-Maintenance.ps1
# Download link
$URL = "https://github.com/comnam90/bcthomas.com-scripts/raw/master/Powershell/Functions/S2D-Maintenance.ps1"

# Download the file
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
Invoke-WebRequest -UseBasicParsing -OutFile $FileLocation

# Import Functions for use
. $FileLocation

Wrapping up

So in Part 1, we’ve gone over the process for performing offline maintenance to a Storage Spaces Direct or AzureStack HCI Cluster and automated a number of the steps to simplify the process.

Offline maintenance is always advised when your cluster is 6 months or more behind on patching, as it reduces the risk of hitting known bugs and the window required to catch up to date on patches.

Next time we’ll cover off using Cluster Aware Updating to make sure you don’t fall behind in patch level in the first place.


Original Code

Function Stop-S2DCluster {

    <#
        .Synopsis
        Used to shutdown an S2D Cluster for Maintenance.
        .Description
        This command can be used to completely shutdown an S2D Cluster before
        performing offline maintenance. It can be run against multiple clusters
        remotely. The command will confirm shutting down each component by default
        but this can be skipped by using -Confirm:$false
        .Parameter Name
        The target cluster name that you want to shutdown.
        .Parameter SkipVirtualDiskCheck
        When the command executes, it will make sure all volumes are online and
        healthy before shutting anything down. This switch can be used to skip
        these checks if you know things are unhealthy and need to shutdown anyway.
    #>
    [cmdletbinding(SupportsShouldProcess, ConfirmImpact = 'High')]
    param(
        [parameter(Mandatory)]
        [alias('Cluster')]
        [string[]]$Name,
        [switch]$SkipVirtualDiskCheck
    )
    begin {
        $results = @()
    }
    process {
        Foreach ($Cluster in $Name) {
            try {
                Write-Verbose "$Cluster - Gathering required information"
                $ClusterResources = Get-ClusterResource -Cluster $Cluster
                $ClusterPool = $ClusterResources | where-object { $_.ResourceType -eq "Storage Pool" }
                $CSVs = Get-ClusterSharedVolume -Cluster $Cluster
                $ClusterNodes = Get-ClusterNode -Cluster $Cluster
                $VirtualDisks = Get-VirtualDisk -CimSession $Cluster
                # Check Virtual Disks are healthy before shutting down
                Write-Verbose "$Cluster - Checking for unhealthy volumes"
                $UnhealthyDisks = $VirtualDisks | Where-Object {
                    $_.HealthStatus -ine "Healthy" -and $_.OperationalStatus -ine "OK"
                }
                if ($UnhealthyDisks.Count -gt 0 -and $SkipVirtualDiskCheck) {
                    Write-Warning "There are $($UnhealthyDisks.Count) unhealthy volumes on $Cluster"
                }
                elseif ($UnhealthyDisks.Count -gt 0) {
                    Throw "$Cluster has $($UnhealthyDisks.Count) unhealthy disks.`nResolve issues with volume health before continuing`nor use -SkipVirtualDiskCheck and try again."
                }
                # Check there are no running VMs
                Write-Verbose "$Cluster - Checking for running VMs"
                $RunningVMs = $ClusterResources | Where-Object {
                    $_.ResourceType -eq "Virtual Machine" -and $_.State -eq "Online"
                }
                if ($RunningVMs.Count -gt 0) {
                    # Possibly use ShoudlProcess here instead to offer stopping VMs
                    Throw "$Cluster cannot to shutdown because there are still running VMs`nVMs: $( ( $RunningVMs -join ", " ) )"
                }
                # Stop CSVs
                Write-Verbose "$Cluster - Starting shutdown proceedures"
                Foreach ($CSV in $CSVs) {
                    if ($PSCmdlet.ShouldProcess(
                            ("Stopping {0} on {1}" -f $CSV.Name, $Cluster),
                            ("Would you like to stop {0} on {1}?" -f $CSV.Name, $Cluster),
                            "Stop Cluster Shared Volume"
                        )
                    ) {
                        try {
                            $CSV | Stop-ClusterResource -Cluster $Cluster -ErrorAction Stop | Out-Null
                        }
                        catch {
                            Throw "Something went wrong when trying to stop $($CSV.Name)`nRerun the command.`n$($PSItem.ToString())"
                        }
                    }
                }
                # Stop Cluster Pool
                if ($PSCmdlet.ShouldProcess(
                        ("Stopping {0} on {1}" -f $ClusterPool.Name, $Cluster),
                        ("Would you like to stop {0} on {1}?" -f $ClusterPool.Name, $Cluster),
                        "Stop Cluster Pool"
                    )
                ) {
                    try {
                        $ClusterPool | Stop-ClusterResource -Cluster $Cluster -ErrorAction Stop | Out-Null
                    }
                    catch {
                        Throw "Something went wrong when trying to stop $($ClusterPool.Name)`nRerun the command.`n$($PSItem.ToString())"
                    }
                }

                # Stop Cluster
                if ($PSCmdlet.ShouldProcess(
                        ("Stopping {0}" -f $Cluster),
                        ("Would you like to stop {0}?" -f $Cluster),
                        "Stop Cluster"
                    )
                ) {
                    try {
                        # Stop Cluster
                        Write-Verbose "$Cluster - Shutting down Cluster"
                        Stop-Cluster -Cluster $Cluster -Force -Confirm:$false -ErrorAction Stop
                        foreach ($Node in $ClusterNodes.Name) {
                            $Service = Get-Service clussvc -ComputerName $Node
                            # Stop Cluster Service on hosts
                            Write-Verbose "$Cluster - $Node - Stopping Cluster Service"
                            $Service | Stop-Service
                            # Set Cluster Service to disabled on hosts
                            Write-Verbose "$Cluster - $Node - Disabling Cluster Service Startup"
                            $Service | Set-Service -StartupType Disabled
                        }
                    }
                    catch {
                        Throw "Something went wrong when trying to stop $($Cluster)`nRerun the command.`n$($PSItem.ToString())"
                    }
                }
            }
            catch {
                Write-Warning "$($PSItem.ToString())"
                $results += [pscustomobject][ordered]@{
                    Name   = $Cluster
                    Result = "Failed"
                }
                continue
            }
            Write-Verbose "$Cluster - Writing results"
            $results += [pscustomobject][ordered]@{
                Name   = $Cluster
                Result = "Succeeded"
            }
        }
    }
    end {
        Write-Verbose "Returning results"
        $results
    }
}

Function Start-S2DCluster {
    <#
        .Synopsis
        Used to start an S2D Cluster after maintenance.
        .Description
        This command can be used to start up an S2D Cluster afte performing
        offline maintenance. It will run locally or remotely but only against
        a single target.
        .Parameter ComputerName
        The name of a host in the cluster you want to start.
    #>
    [cmdletbinding()]
    param(
        [alias('ClusterNode')]
        [string]$ComputerName = $Env:ComputerName
    )
    begin {
    }
    process {
        try {
            # Force Cluster Online on single Node
            Write-Verbose "$ComputerName - Starting Cluster on a single node"
            $NodeSvc = Get-Service clussvc -ComputerName $ComputerName
            if ($NodeSvc.Status -eq "Running") {
                Write-Verbose "$ComputerName - Cluster Service is already running"
            }
            else {
                Invoke-Command -ComputerName $ComputerName -ErrorAction Stop -ScriptBlock {
                    Get-Service clussvc | Set-Service -StartupType Automatic
                    net start clussvc /forcequorum
                }
            }
            # Get cluster information
            Write-Verbose "Gathering cluster information"
            # Sleep for 5sec to make sure cluster is online
            Start-Sleep -Seconds 5
            $Cluster = Get-Cluster $ComputerName -ErrorAction Stop
            $ClusterName = $Cluster.Name
            $ClusterNodes = Get-ClusterNode -Cluster $ClusterName -ErrorAction Stop
            $ClusterPool = Get-ClusterResource -Cluster $ClusterName -ErrorAction Stop | Where-Object { $_.ResourceType -eq "Storage Pool" }
            $CSVs = Get-ClusterSharedVolume -Cluster $ClusterName -ErrorAction Stop

            # Start other Nodes
            Write-Verbose "$ClusterName - Starting Cluster Service on remaining Nodes"
            Foreach ( $Node in ($ClusterNodes | Where-Object { $_.State -eq "Down" }).Name ) {
                $Service = Get-Service clussvc -ComputerName $Node
                # Set to automatic start
                Write-Verbose "$ClusterName - $Node - Setting Cluster Service back to Automatic Startup"
                $Service | Set-Service -StartupType Automatic -ErrorAction Stop
                # Start Service
                Write-Verbose "$ClusterName - $Node - Starting Cluster Service"
                $Service | Start-Service -ErrorAction Stop
            }
            # Sleep for 5sec to make sure cluster nodes are online
            Write-Verbose "$ClusterName - Wait for Cluster Nodes to join"
            do {
                Start-Sleep -Seconds 5
                $DownNodesCount = Get-ClusterNode -Cluster $ClusterName | Where-Object {
                    $_.State -ne 'Up'
                } | Measure-Object | Select-Object -ExpandProperty Count
            }until(
                $DownNodesCount -eq 0
            )
            # Start Pool
            Write-Verbose "$ClusterName - Starting Cluster Pool $($ClusterPool.Name)"
            $ClusterPool | Start-ClusterResource -ErrorAction Stop | Out-Null

            # Start CSVs
            Write-Verbose "$ClusterName - Starting Cluster Shared Volumes"
            $CSVs | Start-ClusterResource -ErrorAction Stop | Out-Null
        }
        catch {
            throw "Something went wrong when trying to start the cluster back up`n$($PSItem.ToString())"
        }
    }
    end {
        "Successfully started cluster on $ComputerName"
    }
}