Monday, June 13, 2011

Q-Q plots to examine SQL execution time distributions

Introduction

The purpose of this post is to introduce the use of Q-Q plots for comparing distributions of sampled observational data to standard, reference distributions. For example, we often assume that our sampled observations are normally distributed. A Q-Q plot can quickly show whether this assumption is justified. A Q-Q plot can be used with other distributions beside normal. For example, both log-normal and Poisson distributions are demonstrated below. I also show an estimated density plot, which is similar to a histogram, but is "more robust and easier to read" (R in a Nutshell, Joseph Adler, O'Reilly Media 2009, p.238).

Performance data gathered from a production Oracle database are used to illustrate the technique. Since these data fit none of these standard distributions, the Q-Q plots all look rather strange. To illustrate a Q-Q plot for a good fit, I generate random log-normal data, and compare them to normal and log normal distributions.

A secondary purpose of this post is to show how Q-Q plots can be generated quickly and easily using the R programming language, which is free and open source. The R programming language has become the lingua franca of data science and statistics.

The data and R code I used for this analysis are available for download (links below).

Q-Q Plot Overview

The "Q" in Q-Q plot comes from "quantile." You are already familiar with quantiles if you understand medians, quartiles, and percentiles. Assume you have a large set of X values. The median is the specific value of X where half of the other values are less that the median value, and half are greater. Quartiles are similar: three-fourths of the X values are less than the third quartile. Of course, percentiles are the same: only ten percent of the students score greater that the 90th percentile.

The term "quantile" simply generalizes this concept to an arbitrary number of partitions, rather than just two,four, or one hundred partitions assumed by medians, quartiles, and percentiles.

To generate a Q-Q plot, you first divide the sampled observational data into k quantiles. To compare these to a known, reference distribution, you also divide it into k quantiles. You then work your way through each value of k, plotting each quantile value of the observational data as the x-coordinate, and the corresponding reference distribution quantile as the y-coordinate. 

Bottom line: If the distributions are the same, you will end up with a straight line. Deviations from a straight line suggest differences between the sampled and reference distributions. 

A common practice is to standardize each dataset by subtracting its mean and dividing by its standard deviations (i.e., "z score"). With this transformation, the straight line would have a slope of one and pass through the origin; this standardization can simplify the plot.

Much more thorough discussion, and more complete advice on distributional analysis in R is available at the following links. If you really want to learn this stuff, you should read these sources and not this blog!

as linked from here:
In the R programming language, Q-Q plots are prepared using functions like qqplot() and qqnorm(). In Mathematica, they are prepared using QuantilePlot[]

Data source

Earlier this year, Craig Shallahamer blogged about distributions of elapsed times of SQL query executions in the Oracle Database:
Craig noted the difficulty of comparing distributions by visually examining histograms: "Just because a data set visually looks very much like another data set or statistical distribution, does not mean it statistically matches...be careful." I completely agree, and my goal here is to show how distribution differences are much easier to see with Q-Q plots that they are with histograms.

For the analysis below, I used one of Craig's datasets, which he refers to as "Garr 8q"and it is in file Garret8qtkxy0g5d1p3_1.txt within his AnalysisPack.zip, I have have included a copy of these data for download. Here is Craig's histogram:



Unusual distributions common in Oracle database studies

As a side note, many data distributions observed in Oracle database performance studies are highly skewed or bimodal, obviously not normally distributed. In fact, most of the interesting Oracle database performance problems I encounter are in large part caused by strange distributions. Weird distributions are a fact of life for the database professional. I chose this particular dataset since it was less pathological than the others in Craig's study. 

By the way, I am ignoring here the basic question: why should I care about the distribution of SQL query execution times? See Craig's posts for his discussion on this topic. Bill Huber addresses this general question in detail in his PDF presentation.

Results - actual database performance data

I encourage you to look at the R source code to see how easy it is to do this type of analysis. However, I will not include this code within the text of this post.

Craig concluded that his example SQL query execution elapsed time data did not seem to be coming from normal, log-normal, or Poisson distributions. I agree. Let's first look at the estimated density function,which is similar to a histogram.
The next plot (below) is a normal Q-Q plot. We want to graphically compare the sample quantiles to the expected quantiles. If samples were taken from a normal distribution, the points would line up with a slope of 1 and and intercept of zero. Significant deviations from this line indicate a lack of fit to the normal distribution. We see significant deviations below, so we discard the notion that the samples came from a normal distribution. 
Lets try a log transform of these data to see if they came from a log-normal distribution . We again see significant deviations from a straight line (below), so these data were not sampled from a log-normal distribution.
We now compare the Oracle SQL elapsed time data to a randomly generated Poisson distribution. Once again, we see significant deviations from the line, so we conclude that the observations were not sampled from a Poisson distribution.
Results - randomly generated data

The above Q-Q plots are so completely different from straight line good fits. So let's see a Q-Q plot for a good fit. I generated some random log-normal data to illustrate a good fit, and here is what the distribution looks like.
But first, let's compare these random log-normal data to a normal distribution.  Again, not a good fit, as expected:
Now let's take the log() of the  random log-normal data and compare the transformed data to a normal distribution. This is the near perfect fit we expect.

Conclusions

I have demonstrated the use of Q-Q plots to compare the distribution of sampled observations to known distributions. Deviations between sample data distributions and three reference distributions were obvious. The deviations were easier to see in a Q-Q plot that in histograms. Q-Q plots can be generated quickly and easily using the R programming language, which is free and open source. 

Code and data

1 comment: