From the course: Mistakes to Avoid in Machine Learning
Assuming data is good to go
From the course: Mistakes to Avoid in Machine Learning
Assuming data is good to go
- You scoped out your machine learning project. You have a pretty good sense of how you want this model to look and what direction you want to go in. You've even collected your data and you are ready to build your model. But please don't yet. A critical mistake in machine learning is assuming that your data is simply good to go. We've all heard garbage in, garbage out. And I want your models to be a lot better than garbage. Here's some tips on how to ensure clean data prior to modeling. First, visualize your data. If you're using Python, I recommend using the describe function in pandas or better yet, the pandas-profiling package will output interactive reports, which visualize descriptive statistics, correlations, and the completeness and distribution of your data. This is also a good method for eyeballing outliers in your dataset. Next, check for duplicates. A hastily-prepared SQL query can often result in duplicates. The pandas duplicated function will return any duplicated rows in your dataframe. And drop_duplicates will keep the first row and drop subsequent duplicates. Next, be aware missing values. You can call isnull.sum on a dataframe to see missing values by column. So how do you correct this? Two of the most common treatments are first, consider dropping these records using dropna, which removes records with null values entirely. And two, replace null values with say, zero using the fillna function. But in either case, do this very carefully. An alternative is called imputing data to estimate null values. Scikit-learn has built-in impute methods. For example, the Simpleimputer can fill missing values with the mean, median, or mode for that column. To create great machine learning models, you need clean data. So the next time you start a project, set aside some time at the beginning to investigate the cleanliness of your data. And think carefully about how best to treat it before you proceed in your modeling.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
Assuming data is good to go2m 2s
-
Neglecting to consult subject matter experts1m 48s
-
Overfitting your models3m 25s
-
Not standardizing your data2m 57s
-
Focusing on the wrong factors2m 11s
-
Data leakage2m 40s
-
Forgetting traditional statistics tools1m 57s
-
Assuming deployment is a breeze1m 47s
-
Assuming machine learning is the answer1m 35s
-
Developing in a silo2m 16s
-
Not treating for imbalanced sampling3m 29s
-
Interpreting your coefficients without properly treating for multicollinearity3m 19s
-
Evaluating by accuracy alone6m 8s
-
Giving overly technical presentations1m 56s
-
-