From the course: Learning Data Analytics: 1 Foundations

Learning Excel datasets

From the course: Learning Data Analytics: 1 Foundations

Learning Excel datasets

- [Instructor] I've noticed over the years that people focus on the technical skills needed to work with data, but there is another learning curve to consider, and that's learning the actual data. When you get new-to-you data, be sure that you spend time looking at the data. Let me show you a few of my tricks. I've got a spreadsheet open here called Learn Data. One of the very first things that I note is that the sheet tab is called Order Data. This makes me wonder, Is there a query called Order Data or did someone name this? I would probably ask that question. Okay, I see things like employee ID, last name and first name, and they're together. So I believe that this employee ID and this last name and first name are probably the employees. And because I work at this company, I would likely know that Nancy is actually an employee. I also see things like customer ID and order ID. This tells me I'm looking at data about orders. I know this even more when I see things like quantity, price and product ID and product name. I see things like order date and ship date. So one of the very first things that I noticed is that the numbers all appear to be right-aligned. And I look up top and they're not set right-aligned. So they are right-aligned by default because they are numbers. That's an important distinction. I also see the order date. Now it looks like a date, which is the human version of a date, but if I look in the numbers section, I can also see that it's actually a date. All right, that's awesome. So when I first start looking at any new data, I want to know, How many column headings am I actually dealing with? Now I know that I can scroll over to the right, but I don't want to count them. I would rather the program tell me. So I'm in A1. I'll do shift, control and right arrow. And then using my auto calculate on the bottom of my screen, I say, "Count," and it says 14. Now, if you don't see information there, you can simply right-click the Count, or in that area, and then you can scroll through your settings here. You want to make sure, as an analyst, you have average, count, numerical count, minimum, maximum and sum. These are just baseline aggregate information. The difference between count is count will count everything, text or numbers, and numerical count will only count numbers. Now that's important. We'll look at that. Okay, so right now I have 14 column headers and they seem to be named pretty meaningfully. How much data do I actually have? And is that data broken? Let's take a look. I'm in A1, If I do Control A, I can select all of the data that Excel recognizes as this dataset. Now what's important about that is, if I have any solid blank columns or any solid blank rows, that's where it stops recognizing the data. So I'm going to zoom and I'm going to just check to make sure I don't have any broken data. Perfect. All right, let me do Control Home to go back to A1. I'll click in the center of my zoom. All right, perfect. So now that I don't have any broken data, let's see what my record count will be. Remember, count counts everything and count numerical counts actually just numbers. So I'll do Shift Control and down arrow from A1. And this navigates me all the way to 3685. Okay, take a look at your auto calculate. Notice that count says 3685, meaning you have 3685 sales highlighted. The numerical count is 3684. That's because everything underneath that header is a number. So I actually have 3684 records and it is 3685 rows because of the header. All right, I'll go ahead and do Control Home. Okay, so I want to know, How much in price would I have if I totaled this whole set? But I don't want to go add those; I just want to take a look. So I'll go ahead and highlight Column G, and using my auto calculate and my sum feature at the bottom, I notice that I have approximately 2 million. That means that if I have 4 million or 8 million show up in my report, something is wrong, because the raw data tells me that if I totaled all of it, it would only be around 2 million. It's important to keep up with this type of information. I call it quick and dirty analytics. Okay, so let me show you one area some people fail to pay attention to, and that's the duplicates. All right, I see order ID. Order ID, when I see them, after years of experience, tells me I'm either looking at a single order or I'm looking at line items of orders. So what I'm going to do is highlight Column E. I'll go to my Conditional Formatting. I'll go to Highlight Cells Rules and I'll do Duplicate Values. When I do this, it asks me if I want to highlight sales that are duplicate, and choose what color. I'll keep it to light red. Again, I'm just spot checking. Already, I can see those duplicates standing out. Now this tells me that I'm looking at line items. So if I'm going to do a total by order ID, I have to group those order IDs up together to get the total. That's valuable information. All right, I'll go ahead and go back to Conditional Formatting. I'll go to Clear Rules and I'll clear the rules from the entire sheet. All right, I'll go ahead and click on A1.

Contents