From the course: Learning Data Analytics: 1 Foundations

Finding and removing duplicates

From the course: Learning Data Analytics: 1 Foundations

Finding and removing duplicates

- [Instructor] I don't really trust any dataset until I explore it just a little bit. That's because we often inherit the skills of the person who sent it to us. What if they didn't join it correctly? Or what if they didn't understand the ask? So when I work with data, I always do some exploration. What I found is that most people discover issues when they start to build calculations and something seems off. So if we look at our data here, we see we have a total order amount, and we want to vet that to make sure that it includes all of the items. Let's go take a look at our query. I'll go to data, query and connections. If I already see query and connections, I can just right click my table and choose edit. Let me scroll over and look at my fields here. I have price and quantity and discount and tax, and I have a sales amount. It's really easy to add a quick column in M code. So I'll go ahead and choose add, I'll do a custom column. So custom column. I'll go ahead and name this test calculation. I'm going to scroll down, if I take my unit price and I multiply that by my quantity, and then I deduct with a minus my discount amount, I add my tax amount, I should get something equivalent to the sales amount. So I'll go ahead and click okay because it says no syntax errors. I can immediately see something is off. I have $7.95 for my sales amount, but my test calculation gives me $8.58. Let me go ahead and choose currency here. So if I take the $7.95 and add $0.64 I'm going to get that $8.59. So that makes sense. But where's the $78.76 coming from. I need to explore my data. So I'll go to my home tab, I'll choose close and load. This is the type of analysis people don't talk about when they think about data analysis. I clearly see something is off in my data. I have to analyze what that is. The sales amount implies that there is some type of total, but it doesn't match my tests calculation and my total order is totally different. So again, it requires me to have a deeper analysis of my data. Let's go look at their original data on the order data tab, and for the sake of screen space, I'll close that queries and connections. I'm going to go ahead and scroll over. I'm going to hit refresh. I'll check my sales data. There's my test calculation. So, earlier I said if I just add these together. So there's that $8.58 so that's confirmed, but again, it doesn't explain that total order. So I'll go back over here to my order data. So I noticed that I have sales order line number, so that's interesting. If I scroll over here, I have a sales order number. So, if I have a sales order number, it's not exactly obvious if this is repeated in the data unless I look at it. One of the things that I'll do to spot duplicate is use conditional formatting. So I'll go ahead and hit column F, and I'll go to home, conditional formatting and highlight the sales roles, and then I'll hit duplicate values. Immediately I can see that it's highlighted anything that may be duplicated in red. I don't have to go any further. And if I look a little bit down in my data on row six and seven, I can clearly see it. Sometimes those duplicates are really hard to spot. So I'll go ahead and choose cancel here cause I don't need that conditional formatting. And let me go back to my clean data. Let me go to my query settings. So I'll go to data, queries and connections. Click my table and I'll choose edit. So what I actually have here are the individual line items on my sales order. So I'm going to go ahead and name this table. I'll right click it and rename it. And I'll call this sales order lines. I want to get to just the individual sales orders. So I'll right click this. And I'm going to duplicate this dataset. When I duplicate this dataset, I'm getting the exact same steps as previous. You can see I have the exact same steps here and it's completely disconnected from the sales order line. So if I make changes in sales order lines, it won't carry over. Let's go ahead and rename this one to just sales order header. I'm now ready to remove the duplicates. But to do that, I need to remove anything that might keep this dataset unique. So I'll go ahead and go to choose columns. I do want to keep my customer key, the first and last name, my city state and postal code. The product key though, the product key is for each individual line so I'll go ahead and uncheck that. The sales order number, pretty important, I think I'll keep it. The date ordered, the date shipped and then this line number, this line number is for each individual line that a product was ordered. So I'll go ahead and uncheck that. The unit price, order, discount amount, tax amount and sales amount, we know that those are based on the individual line so we can uncheck those. And then we no longer need our test calculation. So I think I have all of my columns that would give me a unique list of sales order if the duplicates were removed. So I'll go ahead and click okay. And then using the button to the left of the first column, I'll go ahead and hit that button. And I'm going to scroll down and chooses remove duplicates. There are a ton of options here. I'll go ahead and remove duplicates and then I'll close and load this to my spreadsheet. So I have 60,351 rows of lines and I have 27,639 actual individual orders. Let me go ahead and name these sheet tabs. So I'll call this one sales orders and then I'll rename that cleaned tab to sales order lines. And if you notice, when I'm on this sheet, it actually references the table I'm on. So if I go to sales order, it highlights the order header. This just keeps it easy for you to understand which dataset is associated to which query on the right-hand side. There are many ways that people remove duplicates in data. This is just one way to leverage the tools of power query. Because it maintains the step, if you update the data, then those steps are automated. If you manually remove the duplicates before you start working with the data, just understand it may degrade the data quality. So anytime you can automate the removal of duplicates, you're going to have a higher quality dataset.

Contents