"Pornhounds are friendlier," and other data-mining trivia
There's a story in a recent issue of Time about data mining. Apart from the usual big-brotheresque material, a couple of points struck me as really kooky, including:
A major hotel chain discovered that guests who opted for X-rated flicks spent more money and were less likely to make demands on the hotel staff, according to privacy consultant Larry Ponemon. These low-maintenance customers were rewarded with special frequent-traveler promotions. Victoria's Secret stopped uniformly stocking its stores once MicroStrategy showed that the chain sold 20 times as many size-32 bras in New York City as in other cities ...
A loan company using predictive-analysis software from Sightward, based in Bellevue, Wash., discovered that the No. 1 indicator of whether Web applicants will go through with a loan rather than merely check current quotes was whether they voluntarily identified their gender on the website.
What interests me about data-mining is that it throws all the old scientific warnings out the window. Scientists caution us that merely finding a link between two things doesn't mean they have any logical connection. Correlation, they insist, is not causation.
In contrast, data mining is all about finding zillions and zillions of new correlations, in a desperate bid to cash in on them. Screw the scientific method; it's like a sort of conspiracy theory on a mass scale -- finding as many loopy threads as possible to knit together the chaos of the world. Except of course, many of those loopy threads actually do turn out to be useful. For whatever reason, knowing someone's gender -- and knowing that the person is willing to reveal it to you -- means they really want a loan. Who knew?
The point is: Correlation may not be causation, but who cares? If you're scientist, you want to prove X causes Y, because you're trying to figure out the laws of nature. Marketers, on the other hand, have no such standard of proof. They find out that people who rent porn are friendly customers. Who cares why? It's still useful info. Subrational processes are more useful than we think they are -- and our machines are proving it.
Posted by Clive Thompson at January 07, 2003 01:37 PM
| TrackBack
Hey Clive,
Sure, whether you understand the cause is often irrelevant if you find a correlation. However, the real pitfall with this sort of datamining is that correlation doesn't necessarily imply future correlation. If you look at enough samples of random numbers, some of them are going to have an empirical correlation in your sample just by chance. This is how we get garbage like the idea that the superbowl predicts the stock market.
Heh. True! And indeed, paying too much attention to suggestive but ultimately nonsensical correlations leaves you as barking mad as that dude in the movie Pi.
But I think the point with some of the data-mining is that they find correlations that are sufficiently persistent -- i.e. they happen time and time and time again -- such that they do, in fact, become useful as for prediction. The stock-market/superbowl correlation has never achieved that level of reliability, but many other super-weird correlations have. In these latter cases, it's not that there are no logical causal connections; it's merely that they're so obscure that we can't figure out what they are. Or they might be removed by a few phases: We observe that, reliabily and with predictability, event A occurs whenever signal F is perceived. We don't know why. Forty years from now, we'll figure out that it's because F is a symptom of L that is triggered by C catalyzed by A. But merely because we don't know the link doesn't mean that we can't usefully exploit our noting the correlation between A and F.
Indeed, an astonishingly large part of everyday medicine is guided by correlation and not causation. One hundred years on, scientists still don't have a clear idea why aspirin cures headaches, but they know beyond a shadow of a doubt that it almost always does.