Correlation, Causation and Confirmation Bias

This post is less an indictment of data mining, and more a word of caution.  Data mining should never be a marketer’s first strategy for gaining insight, and a combination of logic and psychology can show us why.

A special report in the Economist, referenced in the post “Data Exhaust, from the Woods to the Highway”, had this to say about the overwhelming amount of data being collected:

Sophisticated quantitative analysis is being applied to many aspects of life, not just missile trajectories or financial hedging strategies, as in the past. For example, Farecast, a part of Microsoft’s search engine Bing, can advise customers whether to buy an airline ticket now or wait for the price to come down by examining 225 billion flight and price records. The same idea is being extended to hotel rooms, cars and similar items. Personal-finance websites and banks are aggregating their customer data to show up macroeconomic trends, which may develop into ancillary businesses in their own right. Number-crunchers have even uncovered match-fixing in Japanese sumo wrestling. (from “Data, Data Everywhere“)

After the awesome size of “225 billion” wears off, I start to wonder: what if I had that dataset of 225 billion flight and price records, and decided the price of farm machinery and the population of zebras in central Africa were the deciding variables in price prediction? You would think it a terrible price predictor and probably tell me so.

But, what if I decided to include the time of day of the flight as well as the weather conditions, for which I used a proxy of temperature?  What if including both caused variance inflation factors to skyrocket, because lower temperatures are correlated with nighttime?  What if I didn’t have time to check, or didn’t think to check the variance inflation factors – and this model was published on the Internet for consumers to use?  What if customers, upon inputting their preferred time of departure, received improper predictions because the weight applied to that variable in this model was biased? What if nobody could tell anything was wrong?

The science of data mining — systems, networks, and complex multivariate statistics — is critical to bringing understanding to the new data age, but the bottom line is most people do not have the time to learn and/or apply the most complicated algorithms and statistical methods to their data.  In these day-to-day situations, the hidden problems of a strictly data-driven approach grow larger without ever becoming more apparent. The human mind is still the best analytic thinker around and an analytic framework, again, should always come before the data processing.

It can be very tempting to defer to algorithms or statistical techniques far enough beyond our comprehension that we trust they will work away behind the scenes, providing us with deep insight as we ask them basic questions (think Google). But, as the current literature on computer learning algorithms and statistics attempts to demonstrate, we still have a long way to go before we have computers with any human semblance of theoretically sound pattern-matching, and thus any statistical methods which can deduce real causation.

While computers don’t have an intuitive grasp of how theory should be applied to patterns, we humans do, and we have to be careful to apply our intuition to the problem before we apply our computers to the data.  The order is important due to the risk of overlooking the problem at hand, focusing instead on the output from the computer.

If we directly apply our analytical skills to the output from the computer, we are hunting for a solution without ever properly defining the problem.  The actual causal process becomes a needle in a haystack of other temptingly similar looking needles.  And at this point, I think all of us would just choose the needle we like the most.

Researchers have a term for this: it’s called confirmation bias.  And marketers, like researchers, earn their keep by making decisions that are as free of bias as possible.  While data mining has become an essential weapon in the marketing arsenal, consumers remain complex in their behavior, and human intuition is still required to foster their brand relationships.

This is some text prior to the author information. You can change this text from the admin section of WP-Gravatar  To change this standard text, you have to enter some information about your self in the Dashboard -> Users -> Your Profile box. Read more from this author

RSS 2.0 feed. You can also leave a response, or trackback from your own site.

Leave a Reply



About Web Liquid

Web Liquid is a digital marketing agency with offices in London, New York and Lagos.


  • No categories


Recent Comments