Saturday, April 2, 2011

A small data mining library

In my quest to learn data mining, I have studied others' book reviews in order to assemble a small data mining library. I share the results of this quest in this blog post. I've already finished reading some of these books, am studying others, and will probably never finish some of them. Enjoy!

This list is divided into several sections: 
  • R
  • Data Mining
  • Regression, ANOVA, and Statistics
  • Applications


R in a Nutshell (Joseph Adler. O'Reilly 2010). A basic, general introduction to the R language. It is comprehensive, but necessarily thin in spots. For example, it devotes over one hundred pages to graphics, but less than a page to support vector machines. Those new to R should probably start with this book.

Data Manipulation with R (Phil Spector. Springer 2008). One the beautiful things about R is how it handles datasets. This book explains all that you need to know about "data frames", and goes into detail about how to use "factors," how to access databases, etc. We often spend more time managing data than analyzing it, and this book show you how to use R effectively. I especially like the lessons on R's "melt" and "cast" functions: they let you "denormalize" data, like you can now do with Oracle 11g's pivot() function.

Modern Applied Statistics with S (W.N. Venables, B.D. Ripley. Springer 2010). This essential book focuses on practical use of R to analyze data. It includes some statistical theory, but only just enough to provide context and clarity. It emphasizes practical examples, and has a very wide scope.

Statistical Models in S (J. M. Chambers, T.J. Hastie. Chapman and Hall/CRC 1991) This reference book on statistical modeling goes into much more implementation detail that Venables and Ripley (above). It provides a firm foundation for understanding R functions and models.

Data Mining

All of Statistics: A Concise Course in Statistical Inference (Larry Wasserman. Springer 2010). This book provides a crash course on statistics. It starts with a basic definition of probability, where the basic problem is "given a data generating process, what are the properties of the outcomes?" contrasted with inference: "given a set of outcomes, what can we say about the process that generated the data?" It provides a firm mathematical foundation: for example it clearly defines samples spaces, random variables, cumulative distribution functions (CDF), and probability density functions (PDF). The reader is assumed to be very comfortable with calculus and linear algebra, and is expected to follow proofs of its many theorems. Bibliographic notes are provided, as well as homework exercises.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Trevor Hastie, Robert Tibshirani and Jerome Friedman. Springer 2009). A complete guide to data mining techniques, with eighteen chapters and seven hundred pages. The first several chapters are considered essential, while the rest can be read independently. It includes computational considerations and extensive bibliographies. Used as the textbook for Machine Learning 201 and 202 taught at the Hacker Dojo.

Data Mining: Practical Machine Learning Tools and Techniques (Ian Witten. Morgan Kaufmann 2011). A basic introduction to data mining. It uses Weka rather than R.

Introduction to Data Mining (Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Addison Wesley 2005). A very accessible, yet comprehensive introduction to data mining. It requires only a minimal mathematical background, and includes data pre- and post-processing. Used as the textbook for Machine Learning 101 and 102 taught at the Hacker Dojo.

Regression, ANOVA, Statistics

Applying Regression and Correlation: A Guide for Students and Researchers (Dr. Jeremy Miles, Dr. Mark Shevlin. Sage 2000). A lovely introduction to regression, targeted to the non-mathematician. I enjoy how it starts by showing how the mean is a least-squares model: the mean minimizes the sum of squared residuals. Starting with the mean and progressing to simple linear regression makes it easier to understand multiple regression. This approach also makes it easier to see analysis of variance (ANOVA) as a form of regression.

Nonlinear Regression with R (Christian Ritz, Jens Carl Streibig. Springer 2008). This book focuses on nonlinear regression, going into deep practical detail about the R functions nls() and model diagnostics. It is not a book on theory. It is designed for self study for those familiar with, but not yet experts in regression and R.

Statistical Methods, 8th Edition (George Snedecor, William Cochran. Iowa State University Press 1989). A classic, covering the basics from a very practical perspective.

Analysis of Variance (Henry Scheffé. Wiley 1959). A complete mathematical foundation for the analysis of variance; another classic. It assumed the reader is comfortable with calculus and linear algebra. It is designed for self-study.


Profiles in Performance: Business Intelligence Journeys and the Roadmap for Change (Howard Dresner, Wiley 2009). While not a book on data mining per se, it outlines the organizational issues necessary for success with business intelligence technologies. He defines four levels of organizational maturity as measured by six "performance-directed culture criteria." These include "transparency and accountability", and "availability and currency of information." The book is based on very detailed, insightful, and inspiring case studies.

Smart Enough Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions
(James Taylor, Neil Raden. Prentice Hall 2007), This book focuses on the small descions made every day by an organization, such as those made by front-line customer service representatives. While each of these decissins may be very small, their cumulative effect is greater than the strategic descions that get most upper management attention. The author advocates for "enterprise decision management" to optimize precision, consistency, agility, and speed of these routine operational decisions. This can be accomplished by making suitable predictive models easily available and automatic.

PMML in Action (Alex Guazzelli, Wen-Ching Lin, Tridivesh Jena. CreateSpace 2010), The Predictive Model Markup Language (PMML) is an industry standard way of sharing predictive models between software applications. This book provides very practical advice for using and understanding PMML illustrated with thorough examples.

No comments:

Post a Comment