Saturday, March 5, 2011

Response time analysis based on AWR data

Summary

The basic idea here was to do a database CPU response time analysis using a whole month's worth of hourly observations of a busy production system. Craig Shallahamer provides an online tool that does a curve-fit of a single observation to the basic response time formula (http://filezone.orapub.com/cgi-bin/msolve.cgi). However, it seems like a better model might be obtained if one could do the curve-fitting with hundreds of observations. For fun, I did the analysis twice: once in using R (http://www.r-project.org/), and once using Mathematica (http://www.wolfram.com/); source code and example data for each program are available for download below. I developed an excellent response time model, and had a few other insights:
  • Observational periods where the CPU utilization is much higher than predicted by a linear regression of utilization versus workload (i.e., outliers) are very interesting from a trouble shooting perspective. This suggests the possibility of monitoring for such outliers.
  • Observational periods where the response time is much higher than predicted based on comparable workload are very interesting from a trouble shooting perspective. Again, this suggests the possibility of monitoring for such outliers.

Data source, metrics, and model

I used hourly snapshots of V$SYSSTAT data: foreground CPU utilization, DB time, buffer gets and user calls. I gathered these data for daytime, weekdays only, one a busy OLTP-like production system. I computed deltas for each metric, so these deltas represent hourly averages. I could have written my own tool to gather these data, but found it more convenient to simply use AWR's snapshots. My query is available for download (below); it uses subquery factoring for easy review and maintenance, the analytic function LEAD() to compute deltas without a self-join, and the max-decode denormalization trick.

I modeled this 4-CPU Oracle database as follows:
  • workload = buffer gets per millisecond
  • service time = milliseconds of foreground CPU per buffer get
  • response time = millisecond of DB time per buffer get
Utilization as function of workload

We expect CPU utilization to be a linear function of workload (Gunther 1998/2000 p. 51, Shallahamer 2007 p. 26, Shallahamer 2010 p. 51). As shown below most of the observations fit very well to a linear regression. Yet we do have some apparent outliers, above and to the right.


Outliers - linear model - practical diagnostic value

Although the above scatter plot visually suggests the presence of outliers, it would be nice to have an objective, quantitative approach to identifying them. A common solution is to use a histogram of the residuals (Miles 2009 p. 88, http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/R/).

The outliers are over on the right side of the histogram, with residuals above about 8 or so. The remaining residuals have a nice, normal-like distribution, as expected.


One of the beautiful and convenient features of the R language is the ability to easily combine regression residuals with the raw data, which allows them to be easily managed and reviewed. These steps were much more cumbersome in Mathematica or Excel. My R source code, available for download below, demonstrates this technique in detail. Here is how this was done with a single R command:

db9.c.cleaned <- subset( cbind(db9.b.milliseconds, residuals(db9.b.milliseconds.lm)),
                         residuals(db9.b.milliseconds.lm) < 8.0 )


I want to emphasize that I did not simply remove these outliers so that the rest of the analysis would have prettier, better fitting lines. Instead, these outliers were important from a diagnostic and database management perspective. A review of database activity during the hours represented by the outliers revealed that bulk operations intended to run at night were mistakenly running during business hours. These bulk operations required more CPU time per buffer get than typical for business hour activity. The reason for this difference was not investigated in detail, yet this regression analysis brought it to our attention so we can now manage it. This insight had escaped identification via routine database monitoring. This discovery opens up new ways of monitoring database health!

Not shown here, but available for download, are the regression model, plots, and residual histograms after removal of outliers. As you can imagine, the regression was quite tight: R-squared was 0.945, and the residual histogram was very normal looking.

Response and service time scatter plot

I show below a scatter plot of response time (i.e., circles, DB time) and service time (i.e., crosses, foreground CPU time) as a function of workload (i.e., buffer gets per millisecond), after removal of the above outliers. 


As expected, the service time was nearly constant with workload. This constancy of service time is related to the linear relationship between CPU utilization and workload. If service time per buffer get was not constant, then we would not expect utilization to be linear.

This plot also shows apparent response time outliers, hours with much more wait time than the workload would suggest. As with the utilization outliers above, I objectively identified them by reviewing a histogram of residuals from a regression (although I cheated by using a linear rather than non-linear regression, details in downloadable source code). And as with the utilization outliers above, these had significant diagnostic value. We had occasional trouble with excessive non-idle waits that stand out much more dramatically and obviously than with conventional database monitoring. Again, this discovery opens up yet another potential new way of monitoring database health!


Response time model - nonlinear regression

This database system can be modeled as a single queue (i.e., the CPU run queue) with multiple "servers" (i.e., CPUs).  An approximate formula for this model is shown below (Gunther 1998/2000 p. 62, equation 2-36, Shallahamer 2007 p. 26, Shallahamer 2010 p. 51).


I was able to simplify this model further, by assuming a constant value of service time, using its average value of 0.009295 milliseconds per buffer get. My goal was to solve for the number of effective "servers" on this 4-CPU database host. I gave this formula to R's nonlinear regression function as shown below (download source code to see in it context).

db9.f.cleaned.nls <- nls(DB_TIME_MSPMS/BUFFER_GETS_PMS~0.009295/(1-(((0.009295*BUFFER_GETS_PMS)/num.servers)^num.servers)),
   data = db9.f.cleaned,
   start=list(num.servers=2.5),
   trace=TRUE)


Results and conclusion

A plot of the fitted model is shown below; a pretty good fit was obtained. As you can see, this system was often running pretty close to the "knee in the curve." Little excess CPU capacity was available, and a CPU upgrade was in order (SQL tuning and workload reduction techniques had already been exhausted). 


Regarding CPU upgrades, a question naturally arises: is it better to add more CPUs, or to increase the speed of the CPUs? The following Mathematica plot compares these two alternatives, assuming that we can either double the number of (effective) CPUs or double their speed (i.e., halve their service time).


Now that the CPUs have been upgraded on this host, a future blog post will compare these predictions to post-upgrade observations. Future posts will also describe the results of practical database monitoring based on finding regression model outliers. 

Downloads



References

Gunther, N. J. (1998, 2000). The Practical Performance Expert. Lincoln, NE: Author's Chioce Press (originally published by McGraw Hill).
Miles, J and Shevlin S. (2001, ... 2009). Applying Regression and Correlation. London, Thousand Oaks CA: SAGE publications. See also http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/R/
Shallahamer, C. (2007). Forecasting Oracle Performance. Berkeley, CA: Apress (distributed by Springer-Verlag).
Shallahamer, C. (2010). Orapub's 2020 Performance Seminar, Northern California Oracle Users' Group (NoCOUG) Training Day, August 18 2010.
Institute of Arctic Biology at the University of Alaska Fairbanks. R tutorial, especially non-linear regression: http://mercury.bio.uaf.edu/mercury/R/R.html#nonlinear