Sunday, 18 January 2015

Is Big Data Objective, Truthful and Credible?

In the past few years the attention for big data has grown enormously. Both business and science are focused on the use of large datasets to find answers to previously unsolvable questions. In the size of the data there seems to hide some kind of magic, which will answer any question that can be imagined. As former Wired editor-in-chief Chris Anderson puts it: “with enough data, the numbers speak for themselves.” As if massive data sets and some predictive analytics always will reflect the objective truth. But can big data really deliver on that promise? Is big data objective, truthful and credible?
Source : IBM Global Data Growth
The amount of digital data has grown tremendously in the past few years. It is predicted that this year we will reach around 8 zetta-bytes worldwide. The amount of data is growing and is expected to grow exponentially because more and more devices are connected to the internet (the internet of things). Second factor stimulating the growth of data is the use of social media. It is expected that the total amount data with an IP address will reach a whopping 44 zetta-bytes by 2020. Can we treat all that new data the same as the data from traditional sources, like the ERP system? Let’s start with social media content. To what level do you trust the content of a customer review, a tweet or Facebook post? How to detect rumours from fact and how to deal with contradicting information? Also, can we really expect that we have all data? There are many examples in which the observation bias negatively impacts the outcomes of an analysis based on social media content. See Kate Crawford’s  HBR post on this. Even companies like Google struggle with it as became clear in their overestimation of the number of flu infections. I guess it’s fair to say that social media content is highly uncertain in both expression and content. With sensor data it’s not better. If you use a satnav system you will probably know what I mean. Try navigating the inner-city streets of Amsterdam and you’ll see measurement error in action. Due to measurement errors, but also senor malfunctions, approximation errors, sampling errors, etc sensor data is highly uncertain as well. So although the amount of data grows (exponentially), the uncertainty in the data grows as well (exponentially).

Decision makers must understand the impact of data uncertainty on their decisions and should think of ways to making this impact explicit. This is not new and is not depended on whether the data comes from a big data source or not. Data uncertainty has been around ever since the first optimisation model was created. In practice this uncertainty is simplified by using a single measure, for example the minimum, maximum or average. The impact of that simplification is manifold as Sam Savage explains in The Flaw of Averages. Without explicitly taking into account the uncertainty in (big) data, the outcomes of optimisation models using that data are no better than a wild guess. With the high level of uncertainty of big data,  explicitly taking into account the data uncertainty is even more important. Luckily Operations Research offers various ways to incorporate this uncertainty into the modelling and changes a wild guess into an informed decision. Some well-known approaches are what-if analysis, fuzzy logic, robust optimisation and simulation.

Big data is not objective nor truthful nor credible; It’s a creation of human design and therefore biased. Numbers get their meaning because we draw inferences from them. Biases in the data collection, data analysis and modelling stages present considerable risks to decision quality, and are as important to the big-data equation as the numbers themselves.  Decision makers must know about this uncertainty, know how it will impact decision making.

”Not everything that counts can be counted, 

and not everything that can be counted counts.”

Albert Einstein