Correlation, Causation and Confirmation Bias

This post is less an indictment of data mining, and more a word of caution.  Data mining should never be a marketer’s first strategy for gaining insight, and a combination of logic and psychology can show us why.

A special report in the Economist, referenced in the post “Data Exhaust, from the Woods to the Highway”, had this to say about the overwhelming amount of data being collected:

Sophisticated quantitative analysis is being applied to many aspects of life, not just missile trajectories or financial hedging strategies, as in the past. For example, Farecast, a part of Microsoft’s search engine Bing, can advise customers whether to buy an airline ticket now or wait for the price to come down by examining 225 billion flight and price records. The same idea is being extended to hotel rooms, cars and similar items. Personal-finance websites and banks are aggregating their customer data to show up macroeconomic trends, which may develop into ancillary businesses in their own right. Number-crunchers have even uncovered match-fixing in Japanese sumo wrestling. (from “Data, Data Everywhere“)

After the awesome size of “225 billion” wears off, I start to wonder: what if I had that dataset of 225 billion flight and price records, and decided the price of farm machinery and the population of zebras in central Africa were the deciding variables in price prediction? You would think it a terrible price predictor and probably tell me so.

But, what if I decided to include the time of day of the flight as well as the weather conditions, for which I used a proxy of temperature?  What if including both caused variance inflation factors to skyrocket, because lower temperatures are correlated with nighttime?  What if I didn’t have time to check, or didn’t think to check the variance inflation factors – and this model was published on the Internet for consumers to use?  What if customers, upon inputting their preferred time of departure, received improper predictions because the weight applied to that variable in this model was biased? What if nobody could tell anything was wrong?

The science of data mining — systems, networks, and complex multivariate statistics — is critical to bringing understanding to the new data age, but the bottom line is most people do not have the time to learn and/or apply the most complicated algorithms and statistical methods to their data.  In these day-to-day situations, the hidden problems of a strictly data-driven approach grow larger without ever becoming more apparent. The human mind is still the best analytic thinker around and an analytic framework, again, should always come before the data processing.

It can be very tempting to defer to algorithms or statistical techniques far enough beyond our comprehension that we trust they will work away behind the scenes, providing us with deep insight as we ask them basic questions (think Google). But, as the current literature on computer learning algorithms and statistics attempts to demonstrate, we still have a long way to go before we have computers with any human semblance of theoretically sound pattern-matching, and thus any statistical methods which can deduce real causation.

While computers don’t have an intuitive grasp of how theory should be applied to patterns, we humans do, and we have to be careful to apply our intuition to the problem before we apply our computers to the data.  The order is important due to the risk of overlooking the problem at hand, focusing instead on the output from the computer.

If we directly apply our analytical skills to the output from the computer, we are hunting for a solution without ever properly defining the problem.  The actual causal process becomes a needle in a haystack of other temptingly similar looking needles.  And at this point, I think all of us would just choose the needle we like the most.

Researchers have a term for this: it’s called confirmation bias.  And marketers, like researchers, earn their keep by making decisions that are as free of bias as possible.  While data mining has become an essential weapon in the marketing arsenal, consumers remain complex in their behavior, and human intuition is still required to foster their brand relationships.


This is some text prior to the author information. You can change this text from the admin section of WP-Gravatar  To change this standard text, you have to enter some information about your self in the Dashboard -> Users -> Your Profile box. Read more from this author


RSS 2.0 feed. You can also leave a response, or trackback from your own site.

2 Responses to “Correlation, Causation and Confirmation Bias”



  1. good stuff Nate. you’re right, we need to earn our keep by remaining free of bias, despite an overwhelming amount of tasty data points that the digital world kicks up. simple things like knowing that advertising channels work together, not in isolation. So if search appears to be your most effective ROI channel, it’s probably the result of combined channel influence, not just search itself (that’s a whole other topic however). Point being, experienced intuition is invaluable!

  2. Interesting piece. We use data mining to look at the insights we can take to agencies for their campaigns. There are two big challenges we’re seeing with this. Firstly, you need to have the right kind of people that can look at the statistical complexity of the output while bringing some marketing intuition to understand the noise (i.e. where there’s correlation but no causality) from the insight. Secondly, you need enough time with your own team and the client to communicate that complexity as well as the value it can bring. As the techniques become more sophisticated, the industry needs to keep up with the people it brings in to address these issues.


Leave a Reply


 

Contact

About Web Liquid

Web Liquid is a digital marketing agency with offices in London, New York and Lagos.

Search

Recent Comments