From the course: Mistakes to Avoid in Machine Learning

Assuming data is good to go

From the course: Mistakes to Avoid in Machine Learning

Start my 1-month free trial

Assuming data is good to go

- You scoped out your machine learning project. You have a pretty good sense of how you want this model to look and what direction you want to go in. You've even collected your data and you are ready to build your model. But please don't yet. A critical mistake in machine learning is assuming that your data is simply good to go. We've all heard garbage in, garbage out. And I want your models to be a lot better than garbage. Here's some tips on how to ensure clean data prior to modeling. First, visualize your data. If you're using Python, I recommend using the describe function in pandas or better yet, the pandas-profiling package will output interactive reports, which visualize descriptive statistics, correlations, and the completeness and distribution of your data. This is also a good method for eyeballing outliers in your dataset. Next, check for duplicates. A hastily-prepared SQL query can often result in duplicates. The pandas duplicated function will return any duplicated rows in your dataframe. And drop_duplicates will keep the first row and drop subsequent duplicates. Next, be aware missing values. You can call isnull.sum on a dataframe to see missing values by column. So how do you correct this? Two of the most common treatments are first, consider dropping these records using dropna, which removes records with null values entirely. And two, replace null values with say, zero using the fillna function. But in either case, do this very carefully. An alternative is called imputing data to estimate null values. Scikit-learn has built-in impute methods. For example, the Simpleimputer can fill missing values with the mean, median, or mode for that column. To create great machine learning models, you need clean data. So the next time you start a project, set aside some time at the beginning to investigate the cleanliness of your data. And think carefully about how best to treat it before you proceed in your modeling.

Contents