Wednesday, 28 October 2015

How bad is having a bacon sandwich really?

Reading this week’s headlines warning us that eating bacon and sausages cause bowel cancer will probably turn you into a vegetarian instantly. But the way this message has been put forward is very misleading. The headlines are referring to a press release  of the International Agency for Research on Cancer (IARC) in which processed and cured meats like bacon are classified as group 1 carcinogens. The IARC reached that conclusion after carefully studying research which convinced them that there is a causal link between consuming these meats and bowel cancer. Eating 50 grams of processed meat a day will increase your risk of bowel cancer by 18%. To put things in perspective, group 1 carcinogens also include tobacco, alcohol, arsenic and asbestos, all known to cause certain cancers. So, is having a bacon sandwich as risky as smoking a cigarette or having a drink? Before diving into the interpretation of the outcome let’s first have a look at the research conducted.

It’s just an opinion

The above estimated risk increase originates from a meta-analysis of prospective studies on meat consumption in relation to bowel cancer published in 2011, it’s a “mashup” of research conducted in the past. In a prospective study a group of individuals with a meat rich diet is monitored over a period of time and compared to a control group that has a different diet (no meat, or to a much lesser extend). So, a relative risk is estimated. To have a like for like comparison between the two groups, as this is not a controlled experiment, the estimated relative risks are corrected with factors like age, BMI, alcohol consumption, sex, hypertension, diabetes, etc. As different studies probably correct for different factors and in different ways, it’s at least questionable whether the results from different studies can be compared. Are apples being compared to apples? More important however, there might be a factor not included in the corrections that explains why an individual eats more meat (red or cured) and develops bowel cancer. This is also known as confounding, which leads to spurious correlations. So, the IARC found that there is a positive correlation between meat consumption and bowel cancer, but can’t provide the proof that the relation is a causal one. It’s the opinion of the IARC.

Is eating bacon as risky as smoking?

What everybody should know is that the classifications of the IARC are based on strength of the evidence not on the degree of risk. So, two risk factors could be classified similarly even if one causes many more types of cancers than the other. Therefore bacon ends up in the same class as cigarettes and asbestos. These classifications are not meant to convey how dangerous something is, just how certain we are that something is dangerous. I can imagine that this way of communicating the risk is really confusing to anyone trying to work out how to lead a healthy life. CancerResearch UK is doing a much better job, by making explicit what the risks are. Clearly smoking is much riskier than eating a bacon sandwich.

Source : Cancer Research UK

Should you change your diet?

The IARC reports an 18% increased risk of getting bowel cancer when eating 50 grams of cured meat a day. But what does this mean? Increased risk compared to what? Clearly the missing information to know whether we should be worried about the increase in risk is how many people will get bowel cancer even if they don’t eat bacon sandwiches.  If there is a large risk, an 18% increase will be a lot worse than when the risk is small. The risk for an individual to get bowel cancer in the Netherlands is about 1 in 20, or 5%. It’s this risk of people getting bowel cancer that increases. So the risk becomes 5.9% instead a 5%. In absolute numbers, when 100 people eat a bacon sandwich every day, the number of people diagnosed with bowel cancer will increase from 5 to 6. So the 18% increase sounds a lot, but in absolute terms we should expect just one extra person to get bowel cancer. What do you think?  Will this stop you from eating bacon sandwiches? I’m not too fond of them, but with the above you should be able to decide if you want one. And in case you do, enjoy!

Monday, 31 August 2015

HR Analytics: Breaking through the wall in HR measurement

Data and analytics are key to solving all kinds of business problems. Already, many organisations are using data and analytics to gain insights on their performance and use mathematical models to find viable directions for improvement while keeping track of the gains of this fact based way of decision making. Organisations apply analytics to all kinds of challenges in business areas like Operations, Customer Services and Marketing & Sales. The business area that seems to lag behind in using these advanced methods is Human Resources (HR). Of course a lot of HR related data is being gathered. However, much of the current HR related analytics create little impact. Even though more and more data is gathered and more sophisticated analysis becomes possible, HR rarely drives a strategic change. As Boudreau and Cascio indicate in Investing in People: “There is increasing sophistication in technology, data availability, and the capacity to report and disseminate HR information, but investments in HR data systems, scorecards and ERP fail to create strategic insights needed to drive organisational effectiveness. In short, many organisations are “hitting the wall” in HR measurement.” It’s my conviction that HR will be able to demolish this wall of measurement by turning to use advanced analytical methods and as a result increase its impact on organisational decision making and performance.

In practice, much of the HR related analytics happens in Business Intelligence (BI) tools like SpotFire, Tableau, ClickView or even MS Excel. These are great at accomplishing routine production of HR related reports and dashboards, but do not provide the support required to for example find the drivers for employee satisfaction or steer preventive measures to reduce turnover. To find those, predictive analytics capabilities are requires which BI tools typically don’t offer nor will the typical dashboard user have the capability to use these methods wisely. A BI tool will allow drill downs and supports the analyses of KPI’s of subgroups, but will not provide the explanation why this subgroup has these scores. You need to find your own explanation (or adopt a belief) which may be incorrect causing you to set up expensive change programs, possible addressing the wrong issues. To find the real drivers and causes, more advanced analytics tools and skill are required.

To illustrate, let’s have a look at employee turnover (data can be found here). Being able to understand the drivers of employee turnover and predict who is going to leave is of crucial importance to any company. It is estimated that for entry-level employees the costs of replacing them is between 30% and 50% of their annual salary. For mid-level employees, it costs upwards of 150% of their annual salary and for high-level or highly specialized employees, you're looking at 400% of their annual salary. Clearly understanding the drivers and being able to react on them can be a huge cost saver.

A typical way of how employee turnover is presented in BI tools is by means of a histogram. The histogram clearly shows that a lot of employees leave the company in the second year. This is especially true for the Human Resources and the Research and Development department. Also, at the Research and Development department, there is another peak of employees leaving the company at 10 years. Question is why? To answer that question, the histogram is not very useful and a more advanced methods are required.

Given the similarity with customer churn, it might be tempting to go for a logistic regression to predict the probability of turnover, or use a decision tree to find the relevant factors that drive turnover. However that would imply that we can’t incorporate an important factor that we are interested in, and that is time till resignation. A method that explicitly takes the time to an event into account is Survival analysis, also known as reliability analysis or duration modelling. The survival curve expresses the probability of survival (in this case staying at the company) over time. Survival analysis allows us to account for censoringand time-dependent explanatory variables, so incorporating the time since last salary raise or the time since last promotion. By estimating survival curves for different departments, job roles or other dimensions of interest, comparisons can be made and differences in resignation probabilities over time analysed.

Using the data from the histogram I created the following survival curves per department and job role. The high level of turnover in year 2 and 10 as seen in the histogram show as a strong reductions in the survival probability. Clearly visible are the differences per department and job roles (Sales reps seems to have a short future). The survival curves help us understand the rate of resignation better than the histograms as it shows how the probability of resignation develops over time, but it doesn’t provide the reason why people resign. For this we need a method that allows for additional explanatory variables to explain the resignations over time. A much used method for answering this type of question is the Cox Proportioned Hazard model, in medicine they are commonly used to describe the outcome of drug studies. To find out why so many people leave the Research & Development department, I used the Cox model to find that Years Since Last Promotion, Overtime and Job Satisfaction are the most significant factors. Job involvement, Job Level and Frequent Business Travel also explain resignation but are less significant. With these insights the HR department can turn to the manager of the Research & Development department and pro-actively come up with ways to reduce resignation levels by addressing the key factors.

The above example is just an illustration of how advanced analytics can be of value to the HR department and the organisation it is part of. With access to these advanced methods strategic impact of HR will increase, tearing down the wall of HR measurement. However, as this type of analysis is typically not routine and hence difficult to capture in a standard tool or way of working, HR departments also need to acquire the right analytical skills and mind set. There is more to using advanced analytical methods than just loading data in some analytics platform and pushing the run button, accepting the outcome as the best possible answer. Adequate business knowledge, being able to select and use the right analytical method and communicating outcomes to business owners are as much a requirement as having access to analytical software. With this in mind, for sure the HR measurement wall will cease to exist.

Sunday, 21 June 2015

Prescriptive analytics, the next big step?

Now that you have hooked all the data of your organisation to your KPI dashboard to monitor every day performance and are busy estimating forecasting models for order intake and customer satisfaction, you’re wondering what will be your next step in analytics. Should it be prescriptive analytics? It’s the most advanced, most promising variant of analytics, at least that’s what vendors of analytics software are saying, but it is also the most demanding.  
Reviewing the literature on analytics you deduct that the only way to be able to use prescriptive analytics is to gradually grow your analytics maturity from descriptive, diagnostic, and predictive to prescriptive analytics. The graph Tom Davenport uses in Competing on Analytics to position the different types of analytics cleary shows that. Gartner positions prescriptive analytics as an emerging technology in the hype cycle, comparable to autonomous vehicles and biochips, suggesting it is a new high tech kind of thing. Something that need to proof it's value still. You wonder if it's the right way to go.....
Gartner hype cycle august 2014
My experience is that analytics maturity is of less importance when it comes to the the kind and complexity of analytics used to solve a business problem. Analytics maturity is about the factors that determine the organisation readiness to adopt analytics in decision making throughout an organisation Davenport uses the DELTA (Data, Enterprise orientation, Leadership, Targets, and Analysts) metaphor to asses an organisations’ maturity. When you review Davenport’s DELTA model, you will see that the complexity of the analytics used is not a driving factor for maturity. the other way around also holds.
Gartners’ positioning of prescriptive analytics as a new technology is strange to me. Prescriptive analytics (or Operations Research as we used to call it) has been around for some time already, it originated from the research done by the British Army to beat the Nazi’s during the 2nd World War. At that time, analytics was essentially the application of common sense and the careful study of data to the messiness of war. With success, as the insights from the analysts let to the defeat of the German U-boat campaign. Since then Operations Research has been applied to all kinds of decision problems within big and small organisations, some of them could be called analytical competitors (organisations like Google and Amazon) many of them analytical impaired.  
In my 25 year career as an analytics professional I have come across many examples in which operations research (or prescriptive analytics) proved to be of immediate value, even though the organisation didn’t have sophisticated analytical skills. I have supported Mon and Pop 3PLs with route optimisation models to create routing schedules for their trucks. With the low margins they get, making the most out of their assets is crucial to them.  In healthcare, not really a sector in which analytics has gained a strong foothold, the use of shift optimisation and shift scheduling has let to better balanced schedules, reducing illness and stress, beneficial to both nurses and patients, lowering the cost of healthcare. Similar, benchmarking using optimisation modelling resulted in better insights in hospital performance and the identification of best practices. Governments also are not very analytical mature, than again using optimisation to construct routes for the de-icing of high ways and local roads reduced cost and improved road safety. I could go on with many more examples, but I guess you get the point, it is not your analytics maturity that determines whether you can use prescriptive analytics, but the problem you need to solve.
In summary, prescriptive analytics is not a concept in a hype stage, nor an approach with little use in every day decision making. The above examples proof that. It doesn’t require big budgets nor is it only available to you when you have mastered predictive or descriptive analytics. It is the problem you need to solve that determines the analytics technique you require. So what’s keeping you? Start optimising and start today!

Monday, 25 May 2015

There is more to analytics than just fishing in the data lake

We live in an era in which we celebrate technology, we live for the latest gadgets. Data is now longer a scarce resource, expectations about what can be done with it are rising fast. On the other hand, lakes of data are overwhelming and frustrating people while hard- and software vendors are inviting us to go on a data fishing trip. They tempt us to spend many Euros on data warehouses, hardware, and state of the art analytics software. However, no matter how many Euros you’re spending, if people who work with the data don’t know how to make sense of it or are unable to clearly present what they find, that investment is clearly wasted.  The problem of our time is not the lack of data, but rather the inability to make sense of it. In a typical analytics project, data is loaded into software and the “find the best model” button is pushed. According to the ads of the software vendors, decision makers can act immediately on the outcomes as the software is guaranteed to find the best possible model. This however can have serious problems.
Best practice before running any statistical analysis is to first visually inspect the data. In 1973, Anscombe presented four data sets that have become a classic illustration for the importance of visualizing data, not merely relying on summary statistics or model fitting procedures of analytics software. The four data sets are now known as "Anscombe's quartet."
When summarizing the data of the four series it becomes clear that the summary statistics are the same. Assuming a simple linear relationship between each X and Y results in four identical models, Y=3.000+ 0.500 X. But are these series indeed the same?
Things turn out to be very different when we visualise the data. As can be seen directly from the graphs it’s dangerous to assume you understand the nature of the data just from its summary statistics or the model output. Each of Anscombe’s examples shows an interesting and valid relationship, but only one of them matches the story drawn out from the summary and the fitted model. Set 2 clearly isn’t linear but quadratic. Set 3 is linear, but the outlier (upper right) skews the fitted model. Set 4 is a more extreme example of the effect of an outlier. A linear relation between X and Y in this case doesn’t make any sense.       
Data visualisations help us perceive and appreciate the features of the data but also let us look behind such features and let us see what else is there. Good analysis is not a routine matter and will require switching between graphical display of the data, model estimation results and crunching the numbers.
To be successful in analytics two skills are essential.  First of all statistical thinking, the ability to find insights that live in the data and make sense of them. Second visual thinking, the ability to see meaningful patterns in data by representing and interacting with them visually. Having lots of data, the latest hard- and software and the urge to go on a fishing trip are no substitute for these skills.
Code to reproduce the data, tabels and graphs can be found on my GitHub page 

Saturday, 25 April 2015

What’s stronger than Moore’s law?

Moore’s law turned 50 this week.  In a now famous paper from 1965 Gordon Moore predicts that every 1-2  years the number of transistors on an integrated circuit will double, lowering production cost and increasing its capabilities. Even more, in the same paper Moore predicts that “integrated circuits will lead to such wonders as home computers, automatic controls for automobiles and personal portable communication equipment”. Can you imagine today’s world without them? This technological progress has boosted computational power enormously and enabled us to solve larger and larger optimisation problems faster and faster.  But, even though the progress has been phenomenal, there is even a greater power available. It’s called mathematics.
from :
The impact of Moore’s law is best illustrated by the cost per transistor. This cost decreased from about $10 per transistor in 1970 to less than $ 0.000000001 in 2010. That’s less than the cost of ink for one letter of newsprint. It allowed Google to develop self-driving cars, NASA to send satellites into space and allows us to navigate to our destination using real time traffic information. Moreover, it puts computing power at our fingertips and stimulates the application of techniques from Operations Research and artificial intelligence to real world problems.

When looking at the performance improvement over the years there is a remarkable development. Martin Grötschel (actually it's work from Robert Bixby) reports a 43 million (!) fold speedup over a period of 15 years for one of the key algorithms in optimisation, the linear programming problem. Algorithms to solve linear programs are the most important ingredient of the techniques for solving combinatorial and integer programming problems. They are one of the key tools for an analytics consultant in solving real world decision problems. Grötschel shows that a benchmark production planning problem would take 85 years to solve on 1988 hard- and software, but that it can be solved within 1(!) minute using the latest hard- and software. Breaking the speedup down in machine independent speedup and  the speedup of computing power shows that the progress in algorithms beats Moore’s law by a factor 43.


With trends like big data, decision models will increase in size and will become more optimisation driven. As Tom Davenport puts it “Although Analytics 3.0 includes all three types [descriptive, predictive, prescriptive], it emphasizes the last”. Davenport predicts that prescriptive models will be embedded into key processes and support us in our everyday decision making. This requires the models to be fast and robust. Technological progress is not the only power that enables this, it´s mathematics. And mathematics seems to have the upper hand on this,

Sunday, 5 April 2015

Do numbers really speak for themselves with big data?
Chris Anderson, former editor in chief of Wired was clear about it in his provocative essay “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”. He states that with enough data, computing power and statistical algorithms we can find patterns where science cannot. No need for theory, formal methods to test validity and causation. Correlation is enough, according to Anderson and with him many others.

How would this work in practice? Suppose we would like to create a prediction model for some variable Y. This could for example be the stock price of a company, the click-through rates of online ads or next week’s weather.  Next we gather all the data we can lay your hands on and put it in some statistical procedure to find the best possible prediction model for Y. A common procedure is to first estimate the model using all the variables, screen out the unimportant ones (the ones not significant at some predefined significance level ) and re-estimate the model with the selected subset of variables and repeat this procedure until a significant model is found. Simple enough, isn't it?

Anderson suggested way of analysis has some serious drawbacks however. Let me illustrate. Following the above example, I created a set of data points for Y by drawing 100 samples from a uniform distribution between zero and one, so it’s random noise. Next I created a set of 50 explanatory variables X(i) by drawing 100 samples from a uniform distribution between zero and one for each of them. So, all 50 explanatory variables are random noise as well. I estimate a linear regression model using all X(i) variables to predict Y. Since nothing is related (all uniform distributed and independent variables) an R squared of zero is expected, but in fact it isn't. It turns out to be 0.5. Not bad for a regression based on random noise! Luckily, the model is not significant. The variables that are not significant are eliminated step by step and the model re-estimated. This procedure is repeated until a significant model is found. After a few steps a significant model is found with an Adjusted R squared of 0.4 and 7 variables at a significance level of at least 99%. Again, we are regressing random noise, there is absolute no relationship in it, but still we find a significant model with 7 significant parameters. This is what would happen if we just feed data to statistical algorithms to go find patterns.

So yes, Chris Anderson is right. With data, enough computing power and statistical algorithms patterns will be found. But are these patterns of any interest? Not many of them will be, as spurious patterns vastly outnumber the meaningful ones. Anderson’s recipe for analysis lacks the scientific rigour required to find meaningful insights that can change our decision making for the better. Data will never speak for itself, we give numbers their meaning, the Volume, Variety or Velocity of data cannot change that.

Remark : Details of the regression example can be found on my GitHib

Tuesday, 10 March 2015

A toast to Occam’s razor; Accuracy vs Interpretability

A question that I get asked a lot these days is when selecting a predictive model how to make the trade-off between model accuracy and model interpretability. Reason for this is that methods like neural nets and random forests are becoming more popular in predictive analytics. They tend to generate more accurate predictions than traditional statistical methods like a logistic regression but are much harder to interpret. Some practitioners, following Occam’s razor principle, prefer simple methods over complex ones in supporting their customers. And I agree, most non mathematically trained people would be able to understand a logistic regression, but would have trouble understanding a neural net or a random forest. But sacrificing accuracy over interpretability? It’s a rather simplistic interpretation of Occam’s razor to prefer simple over complex models. Occam’s advice is to choose the simplest model in case the competing models have the same predictive ability. So he puts model accuracy first!

One of the golden rules in analytics consulting is that a customer needs to trust the analytic methods you use before your customer is willing to accept and implement the outcomes of your analysis. Understanding the analytics method and the outcomes (interpretability) is one way for your customer to gain trust. For a simple model or method this is relatively easy, but what if the method becomes more complex? It would require your customer to become a mathematician to understand the model you created and verify if it is correct, but there is no need to do so. Objectively reporting the model quality is another way. For example by reporting the model calibration results (how well did the model fit the data) or its predictive accuracy. To show the predictive accuracy of a model a simple and straightforward method is to use a confusion matrix and report performance indicators deducted from it.

Suppose you want to predict the quality of wine based on its chemical components. You are considering a logistic regression  and a random forest and want to select the best model.  First both models are trained, in this case using the data from the UCI Machine learning Repository which contains results of the chemical analysis of 6497 Portuguese "Vinho Verde" wines. To test both models, the quality of wines is predicted for a randomly selected subset of the wines which was excluded from the data before training. The results of the tests are summarized in the confusion matrices below. The matrix contains the results of the predicted quality of 1948 wines and compares it with the true classification for both models.

Based on the confusion matrix several criteria can be constructed to assess the prediction quality of the trained models. Criteria such as

  • Accuracy, the portion of correct predictions 
  • Error rate, 1 - Accuracy
  • Sensitivity, the portion of correctly predicted good quality wines versus the total number of good quality wines 
  • Specificity, the portion of correctly predicted bad quality wines versus the total number of bad quality wines
  • Lift, the ratio of the portion of correct good wine classifications to the portion of actual good wines. So, it measures the strength of our model on the basis of positive classifications predicted by it correctly.
  • False Positive Rate, portion of true negatives that are incorrectly predicted positive
  • False Negative Rate, portion of true positives that are incorrectly predicted negative

Based in the computed performance measures the random forest model outperforms the logistic regression on all measures. It is the best model to predict Portuguese "Vinho Verde" wine quality. Of course we need to regularly measure the model performance as new data will become available and update it if required.

The above example shows that accuracy requires more complex prediction models, it’s also a lesson I have learned in using both classical statistical (econometric) methods and machine learning to create prediction models for my customers. Simple models tend to be worse predictors, adding more variables (more information) increases the accuracy of predictions. As the inventor of the random forest algorithm Leo Breiman states in Statistical Modelling: The Two Cultures in predictive modelling the primary goal is to supply accurate predictions, not interpretability. Focus should therefore be on accuracy and when models are level on that score, follow Occam and choose the simplest one.


  1. The R code that I used for this blog can be found on my GitHub
  2. All estimation procedures used for this blog are part of the CARET  (=Classification And REgression Training) package in R

Sunday, 18 January 2015

Is Big Data Objective, Truthful and Credible?

In the past few years the attention for big data has grown enormously. Both business and science are focused on the use of large datasets to find answers to previously unsolvable questions. In the size of the data there seems to hide some kind of magic, which will answer any question that can be imagined. As former Wired editor-in-chief Chris Anderson puts it: “with enough data, the numbers speak for themselves.” As if massive data sets and some predictive analytics always will reflect the objective truth. But can big data really deliver on that promise? Is big data objective, truthful and credible?
Source : IBM Global Data Growth
The amount of digital data has grown tremendously in the past few years. It is predicted that this year we will reach around 8 zetta-bytes worldwide. The amount of data is growing and is expected to grow exponentially because more and more devices are connected to the internet (the internet of things). Second factor stimulating the growth of data is the use of social media. It is expected that the total amount data with an IP address will reach a whopping 44 zetta-bytes by 2020. Can we treat all that new data the same as the data from traditional sources, like the ERP system? Let’s start with social media content. To what level do you trust the content of a customer review, a tweet or Facebook post? How to detect rumours from fact and how to deal with contradicting information? Also, can we really expect that we have all data? There are many examples in which the observation bias negatively impacts the outcomes of an analysis based on social media content. See Kate Crawford’s  HBR post on this. Even companies like Google struggle with it as became clear in their overestimation of the number of flu infections. I guess it’s fair to say that social media content is highly uncertain in both expression and content. With sensor data it’s not better. If you use a satnav system you will probably know what I mean. Try navigating the inner-city streets of Amsterdam and you’ll see measurement error in action. Due to measurement errors, but also senor malfunctions, approximation errors, sampling errors, etc sensor data is highly uncertain as well. So although the amount of data grows (exponentially), the uncertainty in the data grows as well (exponentially).

Decision makers must understand the impact of data uncertainty on their decisions and should think of ways to making this impact explicit. This is not new and is not depended on whether the data comes from a big data source or not. Data uncertainty has been around ever since the first optimisation model was created. In practice this uncertainty is simplified by using a single measure, for example the minimum, maximum or average. The impact of that simplification is manifold as Sam Savage explains in The Flaw of Averages. Without explicitly taking into account the uncertainty in (big) data, the outcomes of optimisation models using that data are no better than a wild guess. With the high level of uncertainty of big data,  explicitly taking into account the data uncertainty is even more important. Luckily Operations Research offers various ways to incorporate this uncertainty into the modelling and changes a wild guess into an informed decision. Some well-known approaches are what-if analysis, fuzzy logic, robust optimisation and simulation.

Big data is not objective nor truthful nor credible; It’s a creation of human design and therefore biased. Numbers get their meaning because we draw inferences from them. Biases in the data collection, data analysis and modelling stages present considerable risks to decision quality, and are as important to the big-data equation as the numbers themselves.  Decision makers must know about this uncertainty, know how it will impact decision making.

”Not everything that counts can be counted, 

and not everything that can be counted counts.”

Albert Einstein