Thursday, 13 October 2016

The Error in Predictive Analytics

For more predictions see :
We are all well aware of the predictive analytical capabilities of companies like Netflix, Amazon and Google. Netflix predicts the next film you are going watch. Amazon shortens delivery times by predicting what you are going to buy next, Google even lets you use their algorithms to build your own prediction models. Following the predictive successes of Netflix, Google and Amazon companies in telecom, finance, insurance and retail have started to use predictive analytical models and developed the analytical capabilities to improve their business. Predictive analytics can be applied to a wide range of business questions and has been a key technique in search, advertising and recommendations.  Many of today's applications of predictive analytics are in the commercial arena, focusing on predicting customer behaviour. First steps in other businesses are being taken. Organisations in healthcare, industry, and utilities are investigating what value predictive analytics can bring. In these first steps much can be learned from the experience the front running industries have in building and using predictive analytical models. However, care must be taken as the context in which predictive analytics has been used is quite different from the new application areas, especially when it comes to the impact of prediction errors.

Leveraging the data

It goes without saying that the success of Amazon comes from, besides the infinite shelf space, its recommendation engine. Similar for Netflix. According to McKinsey, 35 percent of what consumers purchase on Amazon and 75 percent of what they watch on Netflix comes from algorithmic product recommendations. Recommendation engines work well because there is a lot of data available on customers, products and transactions, especially online. This abundance of data is why there are so many predictive analytics initiatives in sales & marketing.  Main objective of these initiatives is to predict customer behaviour, like which customer is likely to churn or buy a specific product/service, which ads will be clicked on or what marketing channel to use to reach a certain type of customer. In these types of applications predictive models are created either using statistical (like regression, probit or logit) or machine learning techniques (like random forests or deep learning) With the insights gained from using these predictive models many organisations have been able to increase their revenues.

Predictions always contain errors!

Predictive analytics has many applications, the above mentioned examples are just the tip of the iceberg. Many of them will add value, but it remains important to stress that the outcome of a prediction model will always contain an error. Decision makers need to know how big that error is. To illustrate, in using historic data to predict the future you assume that the future will have the same dynamics as the past, an assumption which history has proven to be dangerous. The 2008 financial crisis is prove of that. Even though there is no shortage of data nowadays, there will be factors that influence the phenomenon you’re predicting (like churn) that are not included in your data. Also, the data itself will contain errors as measurements always include some kind of error. Last but not last, models are always an abstraction of reality and can't contain every detail, so something is always left out. All of this will impact the accuracy and precision of your predictive model. Decision makers should be aware of these errors and the impact it may have on their decisions.

When statistical techniques are used to build a predictive model the model error can be estimated, it is usually provided in the form of confidence intervals. Any statistical package will provide them, helping you asses the model quality and its prediction errors. In the past few years other techniques have become popular for building predictive models, for example algorithms like deep learning and random forests. Although these techniques are powerful and able to provide accurate predictive models, they are unable to provide a confidence intervals (or error bars) for their predictions. So there is no way of telling how accurate or precise the predictions are. In marketing and sales, this may be less of an issue. The consequence might be that you call the wrong people or show an ad to the wrong audience. The consequences can however be more severe. You might remember the offensive auto tagging by Flickr, labelling images of people with tags like “ape” or “animal” or the racial bias in predictive policing algorithms.


Where is the error bar?

The point that I would like to make is that when adopting predictive modelling be sure to have a way of estimating the error in your predictions, both on accuracy and precision. In statistics this is common practice and helps improve models and decision making. Models constructed with machine learning techniques usually only provide point estimates (for example, the probability of churn for a customer is some percentage) which provides little insight on the accuracy or precision of the prediction. When using machine learning it is possible to construct error estimates (see for example the research of Michael I. Jordan) but it is not common practice yet. Many analytical practitioners are not even aware of the possibility. Especially now that predictive modelling is getting used in environments where errors can have a large impact, this should be top of mind for both the analytics professional and the decision maker. Just imagine your doctor concluding that your liver needs to be taken out because his predictive model estimates a high probability of a very nasty decease? Wouldn’t your first question be how certain he/she is about that prediction? So, my advice to decision makers, only use outcomes of predictive models if accuracy and precision measures are provided. If they are not there, ask for them. Without them, a decision based on these predictions comes close to a blind leap of faith.