Monday, 25 May 2015

There is more to analytics than just fishing in the data lake

We live in an era in which we celebrate technology, we live for the latest gadgets. Data is now longer a scarce resource, expectations about what can be done with it are rising fast. On the other hand, lakes of data are overwhelming and frustrating people while hard- and software vendors are inviting us to go on a data fishing trip. They tempt us to spend many Euros on data warehouses, hardware, and state of the art analytics software. However, no matter how many Euros you’re spending, if people who work with the data don’t know how to make sense of it or are unable to clearly present what they find, that investment is clearly wasted.  The problem of our time is not the lack of data, but rather the inability to make sense of it. In a typical analytics project, data is loaded into software and the “find the best model” button is pushed. According to the ads of the software vendors, decision makers can act immediately on the outcomes as the software is guaranteed to find the best possible model. This however can have serious problems.
Best practice before running any statistical analysis is to first visually inspect the data. In 1973, Anscombe presented four data sets that have become a classic illustration for the importance of visualizing data, not merely relying on summary statistics or model fitting procedures of analytics software. The four data sets are now known as "Anscombe's quartet."
When summarizing the data of the four series it becomes clear that the summary statistics are the same. Assuming a simple linear relationship between each X and Y results in four identical models, Y=3.000+ 0.500 X. But are these series indeed the same?
Things turn out to be very different when we visualise the data. As can be seen directly from the graphs it’s dangerous to assume you understand the nature of the data just from its summary statistics or the model output. Each of Anscombe’s examples shows an interesting and valid relationship, but only one of them matches the story drawn out from the summary and the fitted model. Set 2 clearly isn’t linear but quadratic. Set 3 is linear, but the outlier (upper right) skews the fitted model. Set 4 is a more extreme example of the effect of an outlier. A linear relation between X and Y in this case doesn’t make any sense.       
Data visualisations help us perceive and appreciate the features of the data but also let us look behind such features and let us see what else is there. Good analysis is not a routine matter and will require switching between graphical display of the data, model estimation results and crunching the numbers.
To be successful in analytics two skills are essential.  First of all statistical thinking, the ability to find insights that live in the data and make sense of them. Second visual thinking, the ability to see meaningful patterns in data by representing and interacting with them visually. Having lots of data, the latest hard- and software and the urge to go on a fishing trip are no substitute for these skills.
Code to reproduce the data, tabels and graphs can be found on my GitHub page