Wednesday, May 19, 2010

Variance to Mean Ratio for finding brief 'cursor: pin S' spikes

Tanel Poder has a recent blog posting that describes technical details about 'cursor: pin S' waits. He mentions "if your problem is measured in few seconds, then your performance data should also be sampled at least every few seconds in order to be fully systematic." In Tanel's example, his "20 minute AWR reports still showed IO wait times as main DB time consumers, but that was averaged over 20 minutes. But [Tanel's] problem happened severely and shortly within few seconds in that 20 minutes, so the averaging and aggregation over long period of time did hide the extreme performance issue that happened in a very short time." I show below a method for quickly finding and characterizing such short but severe problems.

Averaged metrics - AAS 
Most database monitoring systems rely upon averaging some metric of a period of time, or its equivalent of calculating the difference in some metric over a time interval. If the duration of the problem is significantly less that the monitoring time interval, the problem might not be detected.

The Average Active Session metric (AAS) is a very useful general-purpose performance metric. It is an estimate of the number of sessions that are on the CPU, or are in active wait states, as averaged over a specific time interval. In general, if the AAS greatly exceeds the number of database host CPUs, users will experience response time delays. I have previously detailed AAS definition and derivation.

Since AAS is based on overall user response time, it can detect a wide variety of problems, regardless of wait event. However, as an average, it has very limited ability to detect response time spikes that are much shorter than its averaging period.

Variance and Variance to Mean Ratio
As often explained by Robyn Sands, the statistical concept of variance can help to detect a metric's brief deviations from typical values, especially when divided by the average to form the Variance to Mean Ratio (VMR). The example below illustrates this technique with a previously described SQL script for hourly AAS. Previous experience with this particular database showed that user response time would suffer at an AAS above twenty. Looking only at hourly AAS in the example below, it would seem that no response time problems were present. However, by including VMR, a problem around 14:00 is quite evident.

SAMPLE_HOUR         AAS      VMR-AAS
----------------- ----- ------------
2008-04-16 07:00    1.8          0.6
2008-04-16 08:00    2.3          1.0
2008-04-16 09:00    2.8          1.3
2008-04-16 10:00    3.2          2.3
2008-04-16 11:00    4.1          2.3
2008-04-16 12:00    3.0          1.1
2008-04-16 13:00    2.9          1.0
2008-04-16 14:00    6.4         95.4   <== spike in variance
2008-04-16 15:00    3.8          1.9
2008-04-16 16:00    3.6          1.6
2008-04-16 17:00    2.7          0.9
2008-04-16 18:00    2.7          2.6

The problematic hour of 14:00 can be examined in more detail by averaging over one minute rather than one hour, using a previously described SQL script. The very brief spike at around 14:08 is very clearly visible. User response time suffered greatly during this brief spike.

SAMPLE_MINUTE       AAS  VMR-AAS
---------------- ------ --------
2008-04-16 13:54    4.0        0
2008-04-16 13:55    3.5      0.7
2008-04-16 13:56    4.5      3.4
2008-04-16 13:57    4.7      0.7
2008-04-16 13:58    7.3      1.6
2008-04-16 13:59    3.8      0.2
2008-04-16 14:00    8.8      1.8
2008-04-16 14:01   12.8      0.5
2008-04-16 14:02    4.2      0.5
2008-04-16 14:03    3.8      1.5
2008-04-16 14:04    4.5      1.3
2008-04-16 14:05    8.8      2.1
2008-04-16 14:06    8.0      1.1
2008-04-16 14:07    7.8      3.7
2008-04-16 14:08  130.3      170  <== spike in AAS
2008-04-16 14:09    7.3     10.3
2008-04-16 14:10    3.0      0.8
2008-04-16 14:11    4.3      1.1
2008-04-16 14:12    6.5      1.6
2008-04-16 14:13    3.3      0.7
2008-04-16 14:14    2.5      0.6
2008-04-16 14:15    5.3      2.2
2008-04-16 14:16    4.2      0.4
2008-04-16 14:17    5.2      1.9

Find brief spikes over a long period
 The following chart shows how the above SQL script for hourly AAS can be used to examine a full month's data to quickly find very brief ASS spikes. None of these spikes were apparent in the hourly AAS data itself, but the variance to mean ratio shows these spikes very dramatically. By the way, each of these spikes turned out to be a 'cursor: pin S' wait pile-up, lasting only a minute or two each, and including up to several hundred sessions.


One-second resolution 
To examine brief ASS spikes like the above in more detail, I would query one-second resolution Active Session History (ASH) data, mapping over each EVENT name to custom categories. In the example below, EVENT 'cursor: pin S' was mapped over to category "MUTEX1", while 'cursor: pin S wait on X' was mapped to "MUTEX2". The count of sessions in each category was then counted within each one second ASH observation sample. These time-series data provided clues about the evolution of this event over its brief one-minute duration. These clues would not have been visible in a simple aggregation over this interval; a high-resolution time-series view was essential for troubleshooting.


Conclusion
Using the Variance-to-Mean Ratio (VMR) for the Average Active Session (AAS) metric allows quick and reliable identification of brief spikes, lasting one minute or less. Such spikes are invisible using only averages.

Tuesday, May 11, 2010

Importing ISO 8061 timestamps using sqlldr

Many web service APIs use the ISO 8061 standard for test representations of dates and times. For example, the Salesforce Force.com API uses this standard for its dateTime datatype. Even though Force.com stores dateTime values at only one-second resolution, it represents these values at millisecond resolution. It also uses the UTC timezone, abbreviated as "Z". Here is an example dateTime; notice the "T" that separates the date from the time, the three zero digits to the right of the decimal point, and the "Z" for UTC (aka "Zulu time"):

2010-05-11T18:57:15.000Z

It's fairly easy to create such a representation using Oracle's to_char() function, but you need to use double-quote characters:

SQL> select to_char(sysdate,'YYYY-MM-DD"T"HH24:MI:SS".000Z"') from dual;

TO_CHAR(SYSDATE,'YYYY-MM-DDT
----------------------------
2010-05-11T18:57:15.000Z

However, it's a bit trickier to import a text file using Oracle's sqlldr using a "date mask". These sqlldr masks must be enclosed in double-quote characters. But this obviously confuses sqlldr, since it not expecting double-quotes within the mask. Using single quotes, either within the mask or to delimit the mask, also fails.

It turns out that you can use backslashes to escape the double-quotes within a sqlldr date mask. Here is an example sqlldr control file that demonstrates this mask, used to import a Salesforce-generated CSV file:

OPTIONS(DIRECT=true)
LOAD DATA
 INFILE  'salesforce_export.csv'
 APPEND
INTO TABLE salesforce_export
   FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS
(
       ID,
       EMAIL,
       CREATEDDATE DATE "YYYY-MM-DD\"T\"HH24:MI:SS\".000Z\"",
       CREATEDBYID,
       YADA__C,
       YADA_YADA__C,
       YADA_YADA_YADA__C
)