Tuesday, 10 March 2015

A toast to Occam’s razor; Accuracy vs Interpretability

A question that I get asked a lot these days is when selecting a predictive model how to make the trade-off between model accuracy and model interpretability. Reason for this is that methods like neural nets and random forests are becoming more popular in predictive analytics. They tend to generate more accurate predictions than traditional statistical methods like a logistic regression but are much harder to interpret. Some practitioners, following Occam’s razor principle, prefer simple methods over complex ones in supporting their customers. And I agree, most non mathematically trained people would be able to understand a logistic regression, but would have trouble understanding a neural net or a random forest. But sacrificing accuracy over interpretability? It’s a rather simplistic interpretation of Occam’s razor to prefer simple over complex models. Occam’s advice is to choose the simplest model in case the competing models have the same predictive ability. So he puts model accuracy first!

One of the golden rules in analytics consulting is that a customer needs to trust the analytic methods you use before your customer is willing to accept and implement the outcomes of your analysis. Understanding the analytics method and the outcomes (interpretability) is one way for your customer to gain trust. For a simple model or method this is relatively easy, but what if the method becomes more complex? It would require your customer to become a mathematician to understand the model you created and verify if it is correct, but there is no need to do so. Objectively reporting the model quality is another way. For example by reporting the model calibration results (how well did the model fit the data) or its predictive accuracy. To show the predictive accuracy of a model a simple and straightforward method is to use a confusion matrix and report performance indicators deducted from it.

Suppose you want to predict the quality of wine based on its chemical components. You are considering a logistic regression  and a random forest and want to select the best model.  First both models are trained, in this case using the data from the UCI Machine learning Repository which contains results of the chemical analysis of 6497 Portuguese "Vinho Verde" wines. To test both models, the quality of wines is predicted for a randomly selected subset of the wines which was excluded from the data before training. The results of the tests are summarized in the confusion matrices below. The matrix contains the results of the predicted quality of 1948 wines and compares it with the true classification for both models.

Based on the confusion matrix several criteria can be constructed to assess the prediction quality of the trained models. Criteria such as

  • Accuracy, the portion of correct predictions 
  • Error rate, 1 - Accuracy
  • Sensitivity, the portion of correctly predicted good quality wines versus the total number of good quality wines 
  • Specificity, the portion of correctly predicted bad quality wines versus the total number of bad quality wines
  • Lift, the ratio of the portion of correct good wine classifications to the portion of actual good wines. So, it measures the strength of our model on the basis of positive classifications predicted by it correctly.
  • False Positive Rate, portion of true negatives that are incorrectly predicted positive
  • False Negative Rate, portion of true positives that are incorrectly predicted negative

Based in the computed performance measures the random forest model outperforms the logistic regression on all measures. It is the best model to predict Portuguese "Vinho Verde" wine quality. Of course we need to regularly measure the model performance as new data will become available and update it if required.

The above example shows that accuracy requires more complex prediction models, it’s also a lesson I have learned in using both classical statistical (econometric) methods and machine learning to create prediction models for my customers. Simple models tend to be worse predictors, adding more variables (more information) increases the accuracy of predictions. As the inventor of the random forest algorithm Leo Breiman states in Statistical Modelling: The Two Cultures in predictive modelling the primary goal is to supply accurate predictions, not interpretability. Focus should therefore be on accuracy and when models are level on that score, follow Occam and choose the simplest one.


  1. The R code that I used for this blog can be found on my GitHub
  2. All estimation procedures used for this blog are part of the CARET  (=Classification And REgression Training) package in R