barrelop.blogg.se - Dredging data mining

It’s not until you deploy a model with leakers that predictive accuracy plummets, and you as a data scientist wind up with egg on your face. Because the same leaker variable is present in both the training and test data, no amount of cross-validation between subsamples of your model set will ever surface them. Leakers are nasty little creatures that wreak havoc to your project. The causal direction, however, runs the wrong way around: they are the result, instead of the cause for whatever your target variable represents. Leakers are “predictive” variables that are in some way causally related to the target variable you are trying to predict. Some colleagues might not look at leakers as a cause for overfitting, and that’s a valid perspective (too). The “symptoms” show up very differently, and mitigating the risk for each requires (very) different measures. These are completely different and unrelated reasons for overfitting. One is when data are (too) sparse, the other because of leakers (Berry & Linoff, 2004), sometimes called anachronistic variables (Pyle, 2003). Not only do you damage business objectives, but your stakeholders will lose confidence in value data science may offer.įrom my experience, there are two common reasons why overfitting occurs. When overfitting causes your model to suggest misleading relations in the data, and you find out only after deployment, money blows out the window. In extreme cases, these models might even perform worse than not using a model at all… Predictive modeling can be an amazing tool, but if you point the gun down, it is eminently possible to shoot yourself in the foot. Consequently, the final model winds up performing worse than you expected based on the “fit” to the training data. Overfitting of models occurs when idiosyncrasies in the training data become part of the model you use in production. Not in the least because so many colleagues (me too) have been stung by it! One problem that is fairly well understood, though, is “overfitting” of models. In the old days, “data mining” used to have a bad reputation because “if you torture data for long enough, they will confess to anything.” Although it is fairly easy to lie with statistics, I would like to point out that it is much easier to lie without them! We have come a long way in data science, and yet there is still lots and lots of ground to cover.