From the course: Cleaning Bad Data in R

Duplicated rows and values

From the course: Cleaning Bad Data in R

Start my 1-month free trial

Duplicated rows and values

- [Instructor] It's not uncommon to find duplicated rows in a dataset. This is definitely a situation that requires further investigation when it occurs. The most important thing to remember is that duplicated rows are not necessarily incorrect. There may be a perfectly valid reason to have the same row appear twice in your dataset. You shouldn't simply assume that any duplicated values are erroneous and need to be removed. This is a situation that calls for careful thought and a surgical scalpel rather than a blunt instrument. Let's think about some reasons that you might have duplicated values in your dataset. First, you might have records corresponding to two different people with the same name. Those names may appear, on the surface, to be duplicated values. Second, the same customer may have ordered the same product two different times. The important thing to remember is that a dataset in R is not the same as a database table that follows the normal forms. While a database table should have a primary key that helps distinguish rows, there is no such requirement for a unique identifier in an R dataset. Therefore, it's reasonable that you might have duplicate rows in your dataset that could be legitimate. The bottom line is that any duplicate records that you identify should serve as a starting point for further investigation. If the investigation determines that they really are unwanted duplicates, then you can proceed to get rid of them, and you have two options for removing them. First, if they truly are completely identical, you can use R's unique function to get rid of them. The unique function removes any duplicated values from a data frame. However, if the rows are not 100% identical, the unique function won't remove them. For example, if your dataset has two different records for a person, the unique function would see them as different rows and leave them both intact. You'll need to manually remove whichever row you believe is incorrect. Let's take a look at an example in R. I have the code set up here to load a data file containing information about continents. Let's go ahead and load that dataset. I'll begin by loading the tidyverse, change my working directory, and then load the continents.csv data file. I see that the file contained 10 observations, and I already know there must be some problem here because there are only seven continents. Let's take a look at the data frame using the View function and see if we can sort this out. Well, I can see the first problem immediately. There are three different records for Europe, and they all have the same data. I know that the unique function can take care of that. Let's go ahead and overwrite the continents data table with a copy of itself run through the unique function. Now if I run View continents, I see that I've already eliminated those completed duplicated rows for Europe, but I see that I still have two records for Antarctica. I can apply some common sense here to figure out which row is correct. Do you think that there are 4,000 people living in Antarctica or four million people? Unless there's been a population explosion that I haven't heard about, 4,000 sounds much more reasonable. There are many different ways that I could remove this incorrect row. I'm going to use a filter function to remove any records for Antarctica that have a population of more than 100,000 people. So I'll go back to my code. I'm going to begin with the continents data frame, and I'm going to overwrite it with a copy of itself that's sent through a filter function. Now what I have to remember is that the filter function specifies the rows that I do want. And in this case, I have some rows that I don't want. I don't want rows where the continent is equal to Antarctica and the population is over 100,000. So I'm going to write my filter condition using the not expression. That uses the exclamation point operator. That means that anything that comes after it is not true. I want all the records where it's not the case that they're for Antarctic and have over 100,000 people. So then I'll go ahead and specify those two conditions. I want to exclude records where the continent is Antarctica and the population is over 100,000. And when I run that, I can go ahead now and take a look at the continents table again. And I see that I'm down to seven records for seven continents, and my dataset looks like I've resolved those duplicated rows correctly.

Contents