Wednesday, May 19, 2010

Variance to Mean Ratio for finding brief 'cursor: pin S' spikes

Tanel Poder has a recent blog posting that describes technical details about 'cursor: pin S' waits. He mentions "if your problem is measured in few seconds, then your performance data should also be sampled at least every few seconds in order to be fully systematic." In Tanel's example, his "20 minute AWR reports still showed IO wait times as main DB time consumers, but that was averaged over 20 minutes. But [Tanel's] problem happened severely and shortly within few seconds in that 20 minutes, so the averaging and aggregation over long period of time did hide the extreme performance issue that happened in a very short time." I show below a method for quickly finding and characterizing such short but severe problems.

Averaged metrics - AAS 
Most database monitoring systems rely upon averaging some metric of a period of time, or its equivalent of calculating the difference in some metric over a time interval. If the duration of the problem is significantly less that the monitoring time interval, the problem might not be detected.

The Average Active Session metric (AAS) is a very useful general-purpose performance metric. It is an estimate of the number of sessions that are on the CPU, or are in active wait states, as averaged over a specific time interval. In general, if the AAS greatly exceeds the number of database host CPUs, users will experience response time delays. I have previously detailed AAS definition and derivation.

Since AAS is based on overall user response time, it can detect a wide variety of problems, regardless of wait event. However, as an average, it has very limited ability to detect response time spikes that are much shorter than its averaging period.

Variance and Variance to Mean Ratio
As often explained by Robyn Sands, the statistical concept of variance can help to detect a metric's brief deviations from typical values, especially when divided by the average to form the Variance to Mean Ratio (VMR). The example below illustrates this technique with a previously described SQL script for hourly AAS. Previous experience with this particular database showed that user response time would suffer at an AAS above twenty. Looking only at hourly AAS in the example below, it would seem that no response time problems were present. However, by including VMR, a problem around 14:00 is quite evident.

----------------- ----- ------------
2008-04-16 07:00    1.8          0.6
2008-04-16 08:00    2.3          1.0
2008-04-16 09:00    2.8          1.3
2008-04-16 10:00    3.2          2.3
2008-04-16 11:00    4.1          2.3
2008-04-16 12:00    3.0          1.1
2008-04-16 13:00    2.9          1.0
2008-04-16 14:00    6.4         95.4   <== spike in variance
2008-04-16 15:00    3.8          1.9
2008-04-16 16:00    3.6          1.6
2008-04-16 17:00    2.7          0.9
2008-04-16 18:00    2.7          2.6

The problematic hour of 14:00 can be examined in more detail by averaging over one minute rather than one hour, using a previously described SQL script. The very brief spike at around 14:08 is very clearly visible. User response time suffered greatly during this brief spike.

---------------- ------ --------
2008-04-16 13:54    4.0        0
2008-04-16 13:55    3.5      0.7
2008-04-16 13:56    4.5      3.4
2008-04-16 13:57    4.7      0.7
2008-04-16 13:58    7.3      1.6
2008-04-16 13:59    3.8      0.2
2008-04-16 14:00    8.8      1.8
2008-04-16 14:01   12.8      0.5
2008-04-16 14:02    4.2      0.5
2008-04-16 14:03    3.8      1.5
2008-04-16 14:04    4.5      1.3
2008-04-16 14:05    8.8      2.1
2008-04-16 14:06    8.0      1.1
2008-04-16 14:07    7.8      3.7
2008-04-16 14:08  130.3      170  <== spike in AAS
2008-04-16 14:09    7.3     10.3
2008-04-16 14:10    3.0      0.8
2008-04-16 14:11    4.3      1.1
2008-04-16 14:12    6.5      1.6
2008-04-16 14:13    3.3      0.7
2008-04-16 14:14    2.5      0.6
2008-04-16 14:15    5.3      2.2
2008-04-16 14:16    4.2      0.4
2008-04-16 14:17    5.2      1.9

Find brief spikes over a long period
 The following chart shows how the above SQL script for hourly AAS can be used to examine a full month's data to quickly find very brief ASS spikes. None of these spikes were apparent in the hourly AAS data itself, but the variance to mean ratio shows these spikes very dramatically. By the way, each of these spikes turned out to be a 'cursor: pin S' wait pile-up, lasting only a minute or two each, and including up to several hundred sessions.

One-second resolution 
To examine brief ASS spikes like the above in more detail, I would query one-second resolution Active Session History (ASH) data, mapping over each EVENT name to custom categories. In the example below, EVENT 'cursor: pin S' was mapped over to category "MUTEX1", while 'cursor: pin S wait on X' was mapped to "MUTEX2". The count of sessions in each category was then counted within each one second ASH observation sample. These time-series data provided clues about the evolution of this event over its brief one-minute duration. These clues would not have been visible in a simple aggregation over this interval; a high-resolution time-series view was essential for troubleshooting.

Using the Variance-to-Mean Ratio (VMR) for the Average Active Session (AAS) metric allows quick and reliable identification of brief spikes, lasting one minute or less. Such spikes are invisible using only averages.

No comments:

Post a Comment