From the course: Mistakes to Avoid in Machine Learning
Unlock the full course today
Join today to access over 22,400 courses taught by industry experts or purchase this course individually.
Data leakage
- [Instructor] A common rule of thumb in machine learning. If your result looks too good to be true, it probably is. In these cases, the primary culprit is usually data leakage. Data leakage can be thought of as any time information from outside of your training set enters your model. Data leakage is especially prevalent when working with time series data and in environments where there are data cleanliness issues. The end result of this is that you may be fooled into thinking your model generalizes much better than it really does. So how can we detect and prevent data leakage? Here's an example. Let's say you are working to predict customer cancellations and you have a theory that a recent product that has been introduced makes for stickier customers. If you formulate the problem with historical data, you will likely find that all canceling customers did not buy this product. As you validate this model on new unseen data,…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
Assuming data is good to go2m 2s
-
Neglecting to consult subject matter experts1m 48s
-
Overfitting your models3m 25s
-
Not standardizing your data2m 57s
-
Focusing on the wrong factors2m 11s
-
Data leakage2m 40s
-
Forgetting traditional statistics tools1m 57s
-
Assuming deployment is a breeze1m 47s
-
Assuming machine learning is the answer1m 35s
-
Developing in a silo2m 16s
-
Not treating for imbalanced sampling3m 29s
-
Interpreting your coefficients without properly treating for multicollinearity3m 19s
-
Evaluating by accuracy alone6m 8s
-
Giving overly technical presentations1m 56s
-
-