Forecasting Azure Stack HCI Cache Wear

So you’ve set up an Azure Stack HCI Cluster and everything’s running great, but there is this nagging feeling in the back of your mind. It’s a hybrid setup, with some type of flash cache sitting in front of spinning disk, and you start to wonder how hard you’re pushing that cache, and how long it will last.

Thankfully with Windows Server 2019, there are many in-built tools and commands to help work out just that!

In this post, we are going to look at:

  • Identifying your cache disks
  • Querying cache physical disk storage history
  • Querying cache statistics from the Cluster Performance History engine
  • Combining those tools into a script to check your whole environment

Where’s my cache?

Storage Spaces Direct, the technology responsible for that super-fast storage in your Azure Stack HCI deployment, does a great job of hiding away all the boring details and steps when you build a cluster. It simplifies the whole process down to two commands, New-Cluster and Enable-ClusterS2D.

But don’t worry, identifying your cache drives is still just as simple once things are up and running. They’re identifiable from the usage property of a physical disk, and you can find them with a simple command

Get-PhysicalDisk -Usage Journal

This will return a nice table of all those cache drives (see below). So now we have our cache drives, let’s look at what they’ve been up to.

Storage History commands

One of the many commands added in Windows Server 2019 to make our lives easier is Get-StorageHistory

This command will go retrieve several stored stats, some from the SMART data on the disks, and others maintained by the OS.

Retrieving data about a disk is as easy as passing it through to the command!

As you can see, the default output of the command has a heavy interest in the latency of the disk, but nothing about how many writes are going to our disks, or what timeframe this is over.

Focusing in on a single disk, and using Format-List we get more of a picture about the details hidden away about our disk.

Hey there we go, those look more interesting. Now we have both a timeframe to work with and a bytes written counter for the disk. From here we can use some simple maths to determine the average amount of data being written every day.

Daily Write = \frac{TotalWriteBytes}{(EndTime - StartTime)}

In Powershell, this is what it would look like

Looking back with Cluster Performance History

Another great feature introduced in Windows Server 2019 is the Cluster Performance History, and I could write a whole post just on this. At a high level, it gathers performance counters for a huge number of components in a Storage Spaces Direct cluster and saves them to a database over time, allowing for easy querying via Powershell.

This is great in our case, as we can drill into the performance data of our cache drives over time without having to worry about having the right monitoring setup in the first place.

Just as with the Get-StorageHistory command, the Get-ClusterPerf command can be fed physical disks through the pipeline to find their related data.

The obvious performance counter here is PhysicalDisk.Throughput.Write. While this tells us the write throughput of our cache drives, the more interesting stat here is PhysicalDisk.Cache.Size.Dirty. This counter shows how much data is currently in the write cache portion of the disk, over time it will shrink if no new writes come in and the data is flushed through to the capacity disk behind the cache.

By default, the Get-ClusterPerf command will only return the most recent data point, giving a limited snapshot of what is going on. Using the -Timeframe parameter we can access data for the last hour, day, week, month or even year!

Using a longer period, we can feed the data into Measure-Object to find the average over time.

Pulling it all together into a new tool

While accessing all this data has been pretty easy so far, if you want to start looking at it across multiple drives, and multiple servers in a cluster, then currently that’s a lot of manual work.

And so I wrote Get-S2DCacheChurn.ps1, a script that allows you to query a cluster and return this data from all cache disks in all cluster nodes.

Using the commands we’ve already looked at, we can use the size of the cache drives, and the average daily write we calculated, to estimate the Drive Writes per Day (DWPD) stat.

So putting it all together, the output looks a little like this

Now we can compare these stats to the specs sheets provided by the drive manufacturers to see if everything is healthy, or if the drives are going to burn through their expected lifetime of writes before you’re ready to decommission your cluster.

This might seem like something you don’t need to worry about, because you’ve got warranty after all, but if all of your cache drives have been running for the same amount of time, with similar write usage, then it won’t go well for your cluster if they all fail around the same time.

As always, the script is up in my Github repo, and you can find it here

Or if you want to download it and try it out, simply run the below command

The script has the following parameters:

  • Cluster
    • This can be a single Azure Stack HCI Cluster or multiple Clusters
  • LastDay
    • Returns data for only the last 24 hours
  • Anonymize
    • Removes identifiable information from the results, so that they can be shared.

So what’s next?

Going back to that shiny new Azure Stack HCI Deployment you put in, and how well it’s running, remember the job isn’t done. Check-in on it, use the tools available to monitor show it’s going overtime.

Have a link about using tools like Azure Monitor, Grafana, InfluxDB, and other modern tools to not just extract this data Adhoc, but continuously. Allowing you to monitor any degradation over time and also alert on major issues.

Any come on over to the Azure Stack HCI Slack Community, chat to others running clusters like you, hear about what works well for them and issues encountered.

And as always, let me know if you have any further questions, on here, Twitter, or Slack.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.