Saturday, 25 April 2015

What’s stronger than Moore’s law?

Moore’s law turned 50 this week.  In a now famous paper from 1965 Gordon Moore predicts that every 1-2  years the number of transistors on an integrated circuit will double, lowering production cost and increasing its capabilities. Even more, in the same paper Moore predicts that “integrated circuits will lead to such wonders as home computers, automatic controls for automobiles and personal portable communication equipment”. Can you imagine today’s world without them? This technological progress has boosted computational power enormously and enabled us to solve larger and larger optimisation problems faster and faster.  But, even though the progress has been phenomenal, there is even a greater power available. It’s called mathematics.
from : https://www.cis.upenn.edu/~cis501/papers/mooreslaw-reprint.pdf
The impact of Moore’s law is best illustrated by the cost per transistor. This cost decreased from about $10 per transistor in 1970 to less than $ 0.000000001 in 2010. That’s less than the cost of ink for one letter of newsprint. It allowed Google to develop self-driving cars, NASA to send satellites into space and allows us to navigate to our destination using real time traffic information. Moreover, it puts computing power at our fingertips and stimulates the application of techniques from Operations Research and artificial intelligence to real world problems.

When looking at the performance improvement over the years there is a remarkable development. Martin Grötschel (actually it's work from Robert Bixby) reports a 43 million (!) fold speedup over a period of 15 years for one of the key algorithms in optimisation, the linear programming problem. Algorithms to solve linear programs are the most important ingredient of the techniques for solving combinatorial and integer programming problems. They are one of the key tools for an analytics consultant in solving real world decision problems. Grötschel shows that a benchmark production planning problem would take 85 years to solve on 1988 hard- and software, but that it can be solved within 1(!) minute using the latest hard- and software. Breaking the speedup down in machine independent speedup and  the speedup of computing power shows that the progress in algorithms beats Moore’s law by a factor 43.

from http://www.math.washington.edu/mac/talks/20090122SeattleCOinAction.ppt

With trends like big data, decision models will increase in size and will become more optimisation driven. As Tom Davenport puts it “Although Analytics 3.0 includes all three types [descriptive, predictive, prescriptive], it emphasizes the last”. Davenport predicts that prescriptive models will be embedded into key processes and support us in our everyday decision making. This requires the models to be fast and robust. Technological progress is not the only power that enables this, it´s mathematics. And mathematics seems to have the upper hand on this,

Sunday, 5 April 2015

Do numbers really speak for themselves with big data?

http://xkcd.com/1289/
Chris Anderson, former editor in chief of Wired was clear about it in his provocative essay “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”. He states that with enough data, computing power and statistical algorithms we can find patterns where science cannot. No need for theory, formal methods to test validity and causation. Correlation is enough, according to Anderson and with him many others.

How would this work in practice? Suppose we would like to create a prediction model for some variable Y. This could for example be the stock price of a company, the click-through rates of online ads or next week’s weather.  Next we gather all the data we can lay your hands on and put it in some statistical procedure to find the best possible prediction model for Y. A common procedure is to first estimate the model using all the variables, screen out the unimportant ones (the ones not significant at some predefined significance level ) and re-estimate the model with the selected subset of variables and repeat this procedure until a significant model is found. Simple enough, isn't it?

Anderson suggested way of analysis has some serious drawbacks however. Let me illustrate. Following the above example, I created a set of data points for Y by drawing 100 samples from a uniform distribution between zero and one, so it’s random noise. Next I created a set of 50 explanatory variables X(i) by drawing 100 samples from a uniform distribution between zero and one for each of them. So, all 50 explanatory variables are random noise as well. I estimate a linear regression model using all X(i) variables to predict Y. Since nothing is related (all uniform distributed and independent variables) an R squared of zero is expected, but in fact it isn't. It turns out to be 0.5. Not bad for a regression based on random noise! Luckily, the model is not significant. The variables that are not significant are eliminated step by step and the model re-estimated. This procedure is repeated until a significant model is found. After a few steps a significant model is found with an Adjusted R squared of 0.4 and 7 variables at a significance level of at least 99%. Again, we are regressing random noise, there is absolute no relationship in it, but still we find a significant model with 7 significant parameters. This is what would happen if we just feed data to statistical algorithms to go find patterns.

So yes, Chris Anderson is right. With data, enough computing power and statistical algorithms patterns will be found. But are these patterns of any interest? Not many of them will be, as spurious patterns vastly outnumber the meaningful ones. Anderson’s recipe for analysis lacks the scientific rigour required to find meaningful insights that can change our decision making for the better. Data will never speak for itself, we give numbers their meaning, the Volume, Variety or Velocity of data cannot change that.

Remark : Details of the regression example can be found on my GitHib