Thursday, April 21, 2011

Referential integrity and optimizer shortcuts

I overheard again just the other day that database performance can be improved by eliminating foreign keys and doing integrity constraint checking in the application. This fallacy is based on the notion that there must be some database overhead for enforcing foreign key constraints, so the database will always be faster if it can avoid this overhead. In fact, Oracle's foreign key implementation is very efficient, and in real world cases, the overhead of foreign keys is negligible, and can sometimes even provide slightly faster DML (see, for example, Tom Kyte).

Moreover, referential integrity can provide huge performance gains in situations where Oracle's optimizer can use it to devise better execution plans. I stumbled upon an example of this recently in a production system and thought I'd share it.

The query below involves five tables related by foreign keys. The joins follow these foreign keys.

select count(*)
 from goats g,
      properties p,
      doelings d,
      goat_versions v,
      farm_animals f
where g.type_id = :1
  and g.property_id = p.property_id
  and g.goat_id = d.doeling_id
  and d.status = 'alive'
  and g.goat_id = v.alive_id
  and f.animal_id = g.goat_id
;

The execution plan with predicates is shown below. In this farm animal database, index names start with their table name prefix. Please do not spend too much time studying this plan; I'll walk you through the important parts below.

-----------------------------------------------------------------------
| ID  | OPERATION                       | NAME                        |
-----------------------------------------------------------------------
|   1 |  SORT AGGREGATE                 |                             |
|   2 |   NESTED LOOPS                  |                             |
|   3 |    NESTED LOOPS                 |                             |
|   4 |     NESTED LOOPS                |                             |
|*  5 |      TABLE ACCESS BY INDEX ROWID| GOATS                       |
|*  6 |       INDEX RANGE SCAN          | GOATS_TYPE_IDX              |
|*  7 |      INDEX UNIQUE SCAN          | GOAT_VERSIONS_ALIVE_ID_UNIQ |
|*  8 |     INDEX UNIQUE SCAN           | FARM_ANIMALS_PK             |
|*  9 |    TABLE ACCESS BY INDEX ROWID  | DOELINGS                    |
|* 10 |     INDEX UNIQUE SCAN           | DOELINGS_DOELING_ID_PK      |
-----------------------------------------------------------------------

PREDICATE INFORMATION (IDENTIFIED BY OPERATION ID):
---------------------------------------------------

   5 - FILTER("G"."PROPERTY_ID" IS NOT NULL)
   6 - ACCESS("G"."TYPE_ID"=:1)
   7 - ACCESS("G"."GOAT_ID"="V"."ALIVE_ID")
       FILTER("V"."ALIVE_ID" IS NOT NULL)
   8 - ACCESS("G"."GOAT_ID"="F"."ANIMAL_ID")
   9 - filter("D"."STATUS"='alive')
  10 - ACCESS("G"."GOAT_ID"="D"."DOELING_ID")

First, notice how the PROPERTIES table (P) is not accessed, nor or any of its indexes! Second, notice how the predicate for operation #5 (g.property_id is not null) is not in the original SQL above; the Oracle optimizer added it on its own!

What is going on?

The Oracle optimizer knows that the PROPERTY_ID column in child table GOATS (G) is a foreign key to parent table PROPERTIES (P). The query SQL text includes the join condition "G.PROPERTY_ID = P.PROPERTY_ID." But due to referential integrity, all values of G.PROPERTY_ID must also exist in P.PROPERTY_ID. Therefore, if G.PROPERTY_ID is not null, then this join condition must be satisfied.

The Oracle optimizer has used this knowledge to eliminate the need to access parent table PROPERTIES (P) or any of its indexes. This provides a significant cost savings. This shortcut would not have been available if referential integrity had been enforced by the application.

Saturday, April 2, 2011

A small data mining library

In my quest to learn data mining, I have studied others' book reviews in order to assemble a small data mining library. I share the results of this quest in this blog post. I've already finished reading some of these books, am studying others, and will probably never finish some of them. Enjoy!

This list is divided into several sections: 
  • R
  • Data Mining
  • Regression, ANOVA, and Statistics
  • Applications



R

R in a Nutshell (Joseph Adler. O'Reilly 2010). A basic, general introduction to the R language. It is comprehensive, but necessarily thin in spots. For example, it devotes over one hundred pages to graphics, but less than a page to support vector machines. Those new to R should probably start with this book.

Data Manipulation with R (Phil Spector. Springer 2008). One the beautiful things about R is how it handles datasets. This book explains all that you need to know about "data frames", and goes into detail about how to use "factors," how to access databases, etc. We often spend more time managing data than analyzing it, and this book show you how to use R effectively. I especially like the lessons on R's "melt" and "cast" functions: they let you "denormalize" data, like you can now do with Oracle 11g's pivot() function.

Modern Applied Statistics with S (W.N. Venables, B.D. Ripley. Springer 2010). This essential book focuses on practical use of R to analyze data. It includes some statistical theory, but only just enough to provide context and clarity. It emphasizes practical examples, and has a very wide scope.

Statistical Models in S (J. M. Chambers, T.J. Hastie. Chapman and Hall/CRC 1991) This reference book on statistical modeling goes into much more implementation detail that Venables and Ripley (above). It provides a firm foundation for understanding R functions and models.

Data Mining

All of Statistics: A Concise Course in Statistical Inference (Larry Wasserman. Springer 2010). This book provides a crash course on statistics. It starts with a basic definition of probability, where the basic problem is "given a data generating process, what are the properties of the outcomes?" contrasted with inference: "given a set of outcomes, what can we say about the process that generated the data?" It provides a firm mathematical foundation: for example it clearly defines samples spaces, random variables, cumulative distribution functions (CDF), and probability density functions (PDF). The reader is assumed to be very comfortable with calculus and linear algebra, and is expected to follow proofs of its many theorems. Bibliographic notes are provided, as well as homework exercises.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Trevor Hastie, Robert Tibshirani and Jerome Friedman. Springer 2009). A complete guide to data mining techniques, with eighteen chapters and seven hundred pages. The first several chapters are considered essential, while the rest can be read independently. It includes computational considerations and extensive bibliographies. Used as the textbook for Machine Learning 201 and 202 taught at the Hacker Dojo.

Data Mining: Practical Machine Learning Tools and Techniques (Ian Witten. Morgan Kaufmann 2011). A basic introduction to data mining. It uses Weka rather than R.

Introduction to Data Mining (Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Addison Wesley 2005). A very accessible, yet comprehensive introduction to data mining. It requires only a minimal mathematical background, and includes data pre- and post-processing. Used as the textbook for Machine Learning 101 and 102 taught at the Hacker Dojo.

Regression, ANOVA, Statistics

Applying Regression and Correlation: A Guide for Students and Researchers (Dr. Jeremy Miles, Dr. Mark Shevlin. Sage 2000). A lovely introduction to regression, targeted to the non-mathematician. I enjoy how it starts by showing how the mean is a least-squares model: the mean minimizes the sum of squared residuals. Starting with the mean and progressing to simple linear regression makes it easier to understand multiple regression. This approach also makes it easier to see analysis of variance (ANOVA) as a form of regression.

Nonlinear Regression with R (Christian Ritz, Jens Carl Streibig. Springer 2008). This book focuses on nonlinear regression, going into deep practical detail about the R functions nls() and model diagnostics. It is not a book on theory. It is designed for self study for those familiar with, but not yet experts in regression and R.

Statistical Methods, 8th Edition (George Snedecor, William Cochran. Iowa State University Press 1989). A classic, covering the basics from a very practical perspective.

Analysis of Variance (Henry Scheffé. Wiley 1959). A complete mathematical foundation for the analysis of variance; another classic. It assumed the reader is comfortable with calculus and linear algebra. It is designed for self-study.

Applications

Profiles in Performance: Business Intelligence Journeys and the Roadmap for Change (Howard Dresner, Wiley 2009). While not a book on data mining per se, it outlines the organizational issues necessary for success with business intelligence technologies. He defines four levels of organizational maturity as measured by six "performance-directed culture criteria." These include "transparency and accountability", and "availability and currency of information." The book is based on very detailed, insightful, and inspiring case studies.

Smart Enough Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions
(James Taylor, Neil Raden. Prentice Hall 2007), This book focuses on the small descions made every day by an organization, such as those made by front-line customer service representatives. While each of these decissins may be very small, their cumulative effect is greater than the strategic descions that get most upper management attention. The author advocates for "enterprise decision management" to optimize precision, consistency, agility, and speed of these routine operational decisions. This can be accomplished by making suitable predictive models easily available and automatic.

PMML in Action (Alex Guazzelli, Wen-Ching Lin, Tridivesh Jena. CreateSpace 2010), The Predictive Model Markup Language (PMML) is an industry standard way of sharing predictive models between software applications. This book provides very practical advice for using and understanding PMML illustrated with thorough examples.