Sunday, 5 April 2015

Do numbers really speak for themselves with big data?

http://xkcd.com/1289/
Chris Anderson, former editor in chief of Wired was clear about it in his provocative essay “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”. He states that with enough data, computing power and statistical algorithms we can find patterns where science cannot. No need for theory, formal methods to test validity and causation. Correlation is enough, according to Anderson and with him many others.

How would this work in practice? Suppose we would like to create a prediction model for some variable Y. This could for example be the stock price of a company, the click-through rates of online ads or next week’s weather.  Next we gather all the data we can lay your hands on and put it in some statistical procedure to find the best possible prediction model for Y. A common procedure is to first estimate the model using all the variables, screen out the unimportant ones (the ones not significant at some predefined significance level ) and re-estimate the model with the selected subset of variables and repeat this procedure until a significant model is found. Simple enough, isn't it?

Anderson suggested way of analysis has some serious drawbacks however. Let me illustrate. Following the above example, I created a set of data points for Y by drawing 100 samples from a uniform distribution between zero and one, so it’s random noise. Next I created a set of 50 explanatory variables X(i) by drawing 100 samples from a uniform distribution between zero and one for each of them. So, all 50 explanatory variables are random noise as well. I estimate a linear regression model using all X(i) variables to predict Y. Since nothing is related (all uniform distributed and independent variables) an R squared of zero is expected, but in fact it isn't. It turns out to be 0.5. Not bad for a regression based on random noise! Luckily, the model is not significant. The variables that are not significant are eliminated step by step and the model re-estimated. This procedure is repeated until a significant model is found. After a few steps a significant model is found with an Adjusted R squared of 0.4 and 7 variables at a significance level of at least 99%. Again, we are regressing random noise, there is absolute no relationship in it, but still we find a significant model with 7 significant parameters. This is what would happen if we just feed data to statistical algorithms to go find patterns.

So yes, Chris Anderson is right. With data, enough computing power and statistical algorithms patterns will be found. But are these patterns of any interest? Not many of them will be, as spurious patterns vastly outnumber the meaningful ones. Anderson’s recipe for analysis lacks the scientific rigour required to find meaningful insights that can change our decision making for the better. Data will never speak for itself, we give numbers their meaning, the Volume, Variety or Velocity of data cannot change that.

Remark : Details of the regression example can be found on my GitHib

Post a Comment