The dangers of data dredging
A major goal of any scientific analysis is to answer real-world questions by studying measured variables. We may ask questions such as “Is the earth warmer this year than last year?”, or “Is diabetes associated with kidney disease?”, or even “Is Donald Trump leading the field of Republic candidates?”. Each of these questions encodes a specific hypothesis, and given the right type of data, we might hope to test that hypothesis with real-world evidence.
But what if we flipped the script? Instead of starting with a specific hypothesis, we start with some data and, we analyze the data with an open mind. This is the scenario that much of modern data-science is facing. At Koneksa, we analyze large databases, so we face this situation regularly. The great thing about these data-sets is that they give us the opportunity to test hundreds of hypotheses. But we have to be careful. In fact, the more hypotheses we test the more careful we have to be. The fundamental challenge is that the more interesting aspects we look for, the more likely it is that something will appear interesting just due to random chance.
Consider a simple example of flipping coins. If we flip a coin 10 times and get 10 heads in a row, we would be surprised. We could “quantify our surprise” with a p-value—more on p-values later—and, given the low p-value we might conclude that our coin had a statistically significant bias towards producing heads. But what if we had 1000 coins, and flipped each one 10 times? If only one coin came up heads 10 times in a row, this would be less surprising. In fact, it would more or less expected based on random chance. This “multiple testing problem” applies when we analyze a big data-set too. If we test for a relationship between every pair of variables in your data, we may test millions of relationships. Just by chance, some of these relationships may appear surprisingly strong.
If we simply report these as likely positives, we’re in danger of reporting relationships that don’t hold up in the real world. That’s called data dredging and is a sin in the world of data-science. Under some scenarios, the vast majority of results reported by data dredging will actually be random. To avoid this, we can simply raise our threshold for being surprised. We ask if an observed relationship is statistically significant after accounting for the number of hypotheses tested. Is a relationship so strong, that we wouldn’t expect to observe it even after testing a million hypotheses?
By carefully accounting for the hypotheses we are testing, we can ensure that our results are not simply due to random chance. Indeed, when it comes to data-science, we must seek carefully if we want to find truth.
By Gaurav Bhatia, Data Scientist at Koneksa Health