From the course: Learning Data Analytics: 1 Foundations

Working with joins and validation

From the course: Learning Data Analytics: 1 Foundations

Working with joins and validation

- [Instructor] When you join data, you were using a field of data that is the same in both tables. And it is the same data type. Sometimes they will be named the same and sometimes they will not. If you will open up bad link in design view, you're going to notice that product ID is related to sales order ID. I've noticed through the years that this is a common mistake when people are learning to link data together. When you run this query if you see either no results or limited results, the very first thing that you'll check is did you join the appropriate fields? I'll go ahead and run this. And I see no results. That's because product ID and sales order ID should not be linked. I'll go ahead and go back to my design view. I'll delete this link. The appropriate link would be product ID to product ID. And then when I run my results, I'll see the answer of where my product is related to my sales orders. Let's talk about another common mistake that produces a cross join. Right click and open in design view the query named cross. When you add two tables to a query window and you do not have a join type, this is called a cross join. A cross join we'll actually take every record from one table and attach it to every record in another table. Let's go ahead and get the sales order detail count and also the product count. So with that opened up my sales order detail. I'll notice that I have 79,584 rows of data. If I open up my product I have 504 rows of data. In a cross join it would take all of these sales records and attach them to all of the product records and it would multiply them. So the record results would be 79,584 times 504 to produce over 4 million records. So if you see an inflation in records, it could be that you didn't join your tables. Okay, let's go ahead and close our product table, our sales order table. And we'll go ahead and close that cross query. We don't need to save any changes. Let's go ahead and open up inner in design view. I'll right click and go to design view. By default, when you join two tables they're going to produce an inner join type. I'll highlight product ID and drag it to product ID. I know that this is an inner join because I do not see arrow shapes on either side of the join. Just be aware that when you do adjust to join it will change the way the join line looks. Now when I run this query I will say products that have ordered and have sales records. I'll go ahead and run that. Because I've checked my record counts, I'll say that I have all 79,584 sales records. But what this doesn't show me is all of the products that have not been ordered. I can only see the sales records of the products that have been ordered. Okay, I'll go ahead and save and close this. So if my goal was to actually report and analyze products that have not been ordered, I would need to perform an outer join. Okay. Let's check those record counts of products one more time. I'll open up the product table. I do have 504 product records. So if I'm looking at product records and I'm assuming some have not been ordered, then I can't tell it from just looking at product. I have to actually tie it to the sales order and then filter for the ones that have not been ordered. Okay, I'll close product. I don't need to save any changes. And I'll go ahead and do left outer in design view . To adjust the joint properties is simple, you can double click the join line, or you can right click and go to join properties. Every program sees left and right based on different ways and reasons. Here, I can see my left table name is product, and my right table name is sales order detail. If for any reason I wanted to change those I could. So let's write the three different join types. Include rows where joining fields from both tables are equal, that's the default inner join. Include all records from production product and only those records from sales order detail where the join fields are equal. And because product is in the left side that would be a left outer join. The third join type that's included here is include all records from sales order detail and only those records from production product where the joint fields are equal. In the case of this data that would produce the same result as an inner join. So I want option two. I want to see all of my products and then their associated sales records. I'll go ahead and click Okay. So when I run that query I'll see the results of that left outer join. Most notably you'll notice that starting at sales order ID, the fields are null and there are null all the way across. That's because there are no records in sales for these products. If I scroll down through my record set, the moment I find a product that's been ordered, I start to say the sales order record is no longer null. This would give me a comprehensive list of all products and their associated sales orders. If I only want a list of products that have never been ordered, I can then flip the design and place is null in the criteria for sales order. Let me show you. I'll right click and go to the design view. I'll go to sales order and type, is null. This particular query when I run it will give me every product that has not been ordered. So of our 504 products, there are 246 products that have no associated sales order records. When you're responsible for engineering your own datasets, your reporting requirement or data objective will determine what data you need to join and what type of join will be required of that data. If you're a data analyst you'll have plenty of opportunity to practice this in real life as a part of your everyday job.

Contents