Forecasting Azure Stack HCI Cache Wear

Forecasting Azure Stack HCI Cache Wear

So you’ve set up an Azure Stack HCI Cluster and everything’s running great, but there is this nagging feeling in the back of your mind. It’s a hybrid setup, with some type of flash cache sitting in front of spinning disk, and you start to wonder how hard you’re pushing that cache, and how long it will last.

Thankfully with Windows Server 2019, there are many in-built tools and commands to help work out just that!

In this post, we are going to look at:

Where’s my cache?

Storage Spaces Direct, the technology responsible for that super-fast storage in your Azure Stack HCI deployment, does a great job of hiding away all the boring details and steps when you build a cluster. It simplifies the whole process down to two commands, New-Cluster and Enable-ClusterS2D.

But don’t worry, identifying your cache drives is still just as simple once things are up and running. They’re identifiable from the usage property of a physical disk, and you can find them with a simple command

Get-PhysicalDisk -Usage Journal

This will return a nice table of all those cache drives (see below). So now we have our cache drives, let’s look at what they’ve been up to.

DeviceId FriendlyName         SerialNumber       MediaType CanPool OperationalStatus HealthStatus Usage        Size
-------- ------------         ------------       --------- ------- ----------------- ------------ -----        ----
2001     ATA INTEL SSDSC2BA80 BTHV00000000000OGN SSD       False   OK                Healthy      Journal 745.21 GB
1003     ATA INTEL SSDSC2BA80 BTHV00000000000OGN SSD       False   OK                Healthy      Journal 745.21 GB
1002     ATA INTEL SSDSC2BA80 BTHV00000000000OGN SSD       False   OK                Healthy      Journal 745.21 GB
2003     ATA INTEL SSDSC2BA80 BTHV00000000000OGN SSD       False   OK                Healthy      Journal 745.21 GB
2002     ATA INTEL SSDSC2BA80 BTHV00000000000OGN SSD       False   OK                Healthy      Journal 745.21 GB
1000     ATA INTEL SSDSC2BA80 BTHV00000000000OGN SSD       False   OK                Healthy      Journal 745.21 GB
1001     ATA INTEL SSDSC2BA80 BTHV00000000000OGN SSD       False   OK                Healthy      Journal 745.21 GB
2000     ATA INTEL SSDSC2BA80 BTHV00000000000OGN SSD       False   OK                Healthy      Journal 745.21 GB

Storage History commands

One of the many commands added in Windows Server 2019 to make our lives easier is Get-StorageHistory

This command will go retrieve several stored stats, some from the SMART data on the disks, and others maintained by the OS.

Retrieving data about a disk is as easy as passing it through to the command!

PS > Get-PhysicalDisk -Usage Journal | Get-StorageHistory

DeviceNumber FriendlyName         SerialNumber       BusType MediaType  TotalIoCount FailedIoCount AvgIoLatency(us) MaxIoLatency(us) EventCount 256us 1ms 4ms 16ms 64ms 128ms 256ms 2s 6s 10s
------------ ------------         ------------       ------- ---------  ------------ ------------- ---------------- ---------------- ---------- ----- --- --- ---- ---- ----- ----- -- -- ---
2001         ATA INTEL SSDSC2BA80 BTHV00000000000OGN SAS     SSD         645,141,521            84            598.9        513,106.8        246    61   3 110   34   27     1       10
1003         ATA INTEL SSDSC2BA80 BTHV00000000000OGN SAS     SSD       1,317,886,434            73          1,375.2        515,510.1        244    62   1 104   37   16             24
1002         ATA INTEL SSDSC2BA80 BTHV00000000000OGN SAS     SSD       1,326,895,280            76          1,522.3        517,003.1        244    62   2 100   40   18             22
2003         ATA INTEL SSDSC2BA80 BTHV00000000000OGN SAS     SSD         969,169,213           136            710.7        513,710.2        246    61   4  96   45   22     2       16
2002         ATA INTEL SSDSC2BA80 BTHV00000000000OGN SAS     SSD       1,144,926,978           177          1,872.4        514,277.1        246    62   3  95   45   29     1       11
1000         ATA INTEL SSDSC2BA80 BTHV00000000000OGN SAS     SSD       1,171,742,589            71          1,190.9        517,184.0        244    61   3 104   36   20             20
1001         ATA INTEL SSDSC2BA80 BTHV00000000000OGN SAS     SSD       1,112,541,260            65          1,149.3        514,377.9        244    62   2 113   27   19             21
2000         ATA INTEL SSDSC2BA80 BTHV00000000000OGN SAS     SSD       1,079,017,077           157            980.1        513,973.3        246    60   4  92   50   22     1       17

As you can see, the default output of the command has a heavy interest in the latency of the disk, but nothing about how many writes are going to our disks, or what timeframe this is over.

Focusing in on a single disk, and using Format-List we get more of a picture about the details hidden away about our disk.

PS > Get-PhysicalDisk -Usage Journal | Select-Object -First 1 | `
>>   Get-StorageHistory | Format-List *

Version                   : 10
FriendlyName              : ATA INTEL SSDSC2BA80
SerialNumber              : BTHV00000000000OGN
DeviceId                  : {268d880b-33a3-6c8c-bc95-f8361285c068}
DeviceNumber              : 2001
BusType                   : SAS
MediaType                 : SSD
StartTime                 : 2/14/2020 3:18:55 PM
EndTime                   : 2/25/2020 4:03:54 PM
EventCount                : 246
MaxQueueCount             : 36
MaxOutstandingCount       : 32
TotalIoCount              : 645141521
SuccessIoCount            : 645141437
FailedIoCount             : 84
TotalReadBytes            : 10794587081216
TotalWriteBytes           : 8966117642240
TotalIoLatency            : 3864066151996
AvgIoLatency              : 5989
MaxIoLatency              : 5131068
MaxFlushLatency           : 1378
MaxUnmapLatency           : 0
BucketCount               : 12
BucketIoLatency           : {2560, 10000, 40000, 160000...}
BucketSuccessIoCount      : {293181558, 238929981, 109484596, 3536792...}
BucketFailedIoCount       : {84, 0, 0, 0...}
BucketTotalIoCount        : {293181642, 238929981, 109484596, 3536792...}
BucketTotalIoTime         : {346273222942, 1227862537245, 2109283312866, 176980948128...}
BucketIoPercent           : {45, 37, 17, 1...}
BucketHighestLatencyCount : {61, 3, 110, 34...}

Hey there we go, those look more interesting. Now we have both a timeframe to work with and a bytes written counter for the disk. From here we can use some simple maths to determine the average amount of data being written every day.

$$Daily Write = \frac{TotalWriteBytes}{(EndTime - StartTime)}$$

In Powershell, this is what it would look like

1
2
3
4
5
6
7
8
9
# Start by collecting our data
$CacheDisks = Get-PhysicalDisk -Usage Journal
$CacheDisk1 = $CacheDisks | Select-Object -First 1
$StorageHistoryData = $CacheDisk1 | Get-StorageHistory
# Now we need to find the timespan
$Timespan = New-TimeSpan -Start $StorageHistoryData.StartTime `
    -End $StorageHistoryData.EndTime
# Finally we get our daily write
$StorageHistoryData.TotalWriteBytes / $TimeSpan.TotalDays

Looking back with Cluster Performance History

Another great feature introduced in Windows Server 2019 is the Cluster Performance History, and I could write a whole post just on this. At a high level, it gathers performance counters for a huge number of components in a Storage Spaces Direct cluster and saves them to a database over time, allowing for easy querying via Powershell.

This is great in our case, as we can drill into the performance data of our cache drives over time without having to worry about having the right monitoring setup in the first place.

Just as with the Get-StorageHistory command, the Get-ClusterPerf command can be fed physical disks through the pipeline to find their related data.

PS > Get-PhysicalDisk -Usage Journal | Select -First 1 | Get-ClusterPerf

Object Description: PhysicalDisk BTHV00000000000OGN

Series                        Time                 Value Unit
------                        ----                 ----- ----
PhysicalDisk.Cache.Size.Dirty 02/26/2020 18:17:56  24.45 GB
PhysicalDisk.Cache.Size.Total 02/26/2020 18:17:56 709.01 GB
PhysicalDisk.IOPS.Read        02/26/2020 18:18:00      4 /s
PhysicalDisk.IOPS.Total       02/26/2020 18:18:00    116 /s
PhysicalDisk.IOPS.Write       02/26/2020 18:18:00    112 /s
PhysicalDisk.Latency.Average  02/26/2020 18:18:00  99.88 us
PhysicalDisk.Latency.Read     02/26/2020 18:18:00   1.06 ms
PhysicalDisk.Latency.Write    02/26/2020 18:18:00  63.13 us
PhysicalDisk.Throughput.Read  02/26/2020 18:18:00 599.18 KB/S
PhysicalDisk.Throughput.Total 02/26/2020 18:18:00   1.19 MB/S
PhysicalDisk.Throughput.Write 02/26/2020 18:18:00 615.18 KB/S

The obvious performance counter here is PhysicalDisk.Throughput.Write. While this tells us the write throughput of our cache drives, the more interesting stat here is PhysicalDisk.Cache.Size.Dirty. This counter shows how much data is currently in the write cache portion of the disk, over time it will shrink if no new writes come in and the data is flushed through to the capacity disk behind the cache.

By default, the Get-ClusterPerf command will only return the most recent data point, giving a limited snapshot of what is going on. Using the -Timeframe parameter we can access data for the last hour, day, week, month or even year!

Using a longer period, we can feed the data into Measure-Object to find the average over time.

Pulling it all together into a new tool

While accessing all this data has been pretty easy so far, if you want to start looking at it across multiple drives, and multiple servers in a cluster, then currently that’s a lot of manual work.

And so I wrote Get-S2DCacheChurn.ps1, a script that allows you to query a cluster and return this data from all cache disks in all cluster nodes.

Using the commands we’ve already looked at, we can use the size of the cache drives, and the average daily write we calculated, to estimate the Drive Writes per Day (DWPD) stat.

So putting it all together, the output looks a little like this

Cluster  ComputerName Disk   Size      EstDwpd AvgDailyWrite AvgWriteThroughput AvgCacheUsage
-------  ------------ ----   ----      ------- ------------- ------------------ -------------
Cluster1 Node1        Slot 0 745.21 GB 1.6x    1.18 TB       19.71 MB/s         3.35 GB
Cluster1 Node1        Slot 1 745.21 GB 1.0x    756.15 GB     12.30 MB/s         21.51 GB
Cluster1 Node1        Slot 2 745.21 GB 1.8x    1.28 TB       21.25 MB/s         4.45 GB
Cluster1 Node1        Slot 3 745.21 GB 1.4x    1.02 TB       16.92 MB/s         2.44 GB
Cluster1 Node2        Slot 0 745.21 GB 1.3x    1,000.90 GB   16.17 MB/s         2.23 GB
Cluster1 Node2        Slot 1 745.21 GB 1.3x    932.73 GB     15.08 MB/s         2.05 GB
Cluster1 Node2        Slot 2 745.21 GB 1.5x    1.11 TB       18.45 MB/s         2.86 GB
Cluster1 Node2        Slot 3 745.21 GB 1.5x    1.09 TB       18.07 MB/s         2.49 GB

Now we can compare these stats to the specs sheets provided by the drive manufacturers to see if everything is healthy, or if the drives are going to burn through their expected lifetime of writes before you’re ready to decommission your cluster.

This might seem like something you don’t need to worry about, because you’ve got warranty after all, but if all of your cache drives have been running for the same amount of time, with similar write usage, then it won’t go well for your cluster if they all fail around the same time.

As always, the script is up in my Github repo, and you can find it here

Or if you want to download it and try it out, simply run the below command

1
2
3
$URL = "https://raw.githubusercontent.com/comnam90/bcthomas.com-scripts/master/Powershell/Scripts/Get-S2DCacheChurn.ps1"
Invoke-WebRequest -Uri $URL -UseBasicParsing -OutFile Get-S2DCacheChurn.ps1
.\Get-S2DCacheChurn.ps1

The script has the following parameters:

  • Cluster
    • This can be a single Azure Stack HCI Cluster or multiple Clusters
  • LastDay
    • Returns data for only the last 24 hours
  • Anonymize
    • Removes identifiable information from the results, so that they can be shared.

So what’s next?

Going back to that shiny new Azure Stack HCI Deployment you put in, and how well it’s running, remember the job isn’t done. Check-in on it, use the tools available to monitor show it’s going overtime.

Have a link about using tools like Azure Monitor, Grafana, InfluxDB, and other modern tools to not just extract this data Adhoc, but continuously. Allowing you to monitor any degradation over time and also alert on major issues.

Any come on over to the Azure Stack HCI Slack Community, chat to others running clusters like you, hear about what works well for them and issues encountered.

And as always, let me know if you have any further questions, on here, Twitter, or Slack.

Additional reading