If you’re running a Storage Spaces Direct (S2D) Cluster, you might have noticed some instability in recent months, specifically when it comes to patching and performing maintenance. Well you’re in luck because 5 days ago, Microsoft released a new KB article that helps explain why you might have seen issues.
The scenario targeted by the Microsoft article is S2D Clusters running May (KB4103723) or later patch levels, where you experience Event ID 5120 during patching or maintenance, leading to things like CSV timeouts, VM pauses, or even VM crashes.
UPDATE(2017-09-19): Microsoft have officially recognized the bug and have a KB describing the symptoms and workaround much like the below. See here: https://support.microsoft.com/en-us/help/4043361/disks-in-maintenance-mode-status-after-september-cumulative-update-kb
I was patching our dev cluster the other day and came across a new issue when applying the latest September Cumulative Update (KB4038782), and it seems others on the internet have hit this issue as well.
Background First, a bit of background on the expected behaviour when performing maintenance:
I’ve been deploying a few Storage Spaces Direct (S2D) clusters lately, and I noticed a slight mis-configuration that can occur on deployment.
Normally when deploying S2D, the disk types in the nodes are detected and the fastest disk (usually NVMe or SSD) is assigned to the cache, while the next fastest is used for the Performance Tier and the slowest being used in the Capacity Tier. So if you have NVMe, SSD and HDD, you would end up with an NVMe Cache, a SSD Performance Tier and a HDD Capacity Tier.