From the course: Advanced SQL: Logical Query Processing, Part 2

Self and inequality joins - SQL Tutorial

From the course: Advanced SQL: Logical Query Processing, Part 2

Start my 1-month free trial

Self and inequality joins

- [Narrator] The same table expression can be used multiple times in a FROM clause, as long as each instance of it is uniquely aliased. We call it a self join. All the rules we learned regarding join processing hold true for self joins. Each instance of the table expression is completely independent of the other. One of the most common use cases for self joins is querying adjacency list. This is the definition of an adjacency list, go ahead, read it, I'll be waiting. Adjacency lists are commonly used to represent employee hierarchies, bill of materials, post threads and other types of graphs. They are very intuitive, but not necessarily the most efficient or flexible. And don't worry, I'm not going to make you suffer through another employee hierarchy example, I'm sure you've seen a gazillion of these. Practically every book and every article on joins has one. There are other techniques, such as enumerated paths and nested sets that can also be used to represent graphs and SQL. Let me know in the Q and A section, if you would like to see a course about them, they're really cool. Most joins use equality, qualification predicates. The most common join is between a parent and it's child rows. Customers with orders, orders with items or animals with adoptions. Non-equality joins are joins that use non-quality qualification operators. There's nothing complex about them if we follow join processing order. Now, despite my earlier promise, a quick review of join order will be very helpful here. Every join begins with a Cartesian product where every element from the blue set is matched with every element from the orange set. Cross joins end here. All other join types proceed to a qualification phase where each row of the Cartesian product is evaluated using the qualification predicate. And this is true, regardless what qualification predicate we use. It can be an equality operator and non equality, or even a constant Boolean expression. If the qualification predicate uses an equality operator, only the row with a two twos qualifies. For a different than operator, all the rows except the previous one qualify, and for a larger than operator, only the row with a blue three and orange two qualifies. These are best explained with a code demo, and I remind you that it's perfectly okay to pause the video as many times as needed if you feel it's going a little too fast for you. Our task is to report adopters who adopted two animals on the same day. The report should show the adopter, both adopted animals in separate columns and the adoption date. I said two animals, not more than one for a reason. And you can try and solve it for three or more animals, and I would love to see your solution in the Q and A, we're not going to cover it here. Let's begin by joining two instances of the adoption's table, alias them A1 and A2 respectively, and use an equality predicate, where both the adopter email and the adoption date are the same. I'll add an order by just for convenience. We can immediately see a few issues. First, each adoption row is matched with itself, and of course we need to filter these out. So, let's add a predicate, A1 name different than A2 name and execute it. Success. The second challenge, is that we get two rows for each date for the same animals, only where their positions swapped. A row for Gus and Sam, another for Sam and Gus. There is a cool trick to get rid of these duplicate easily. Pause the video for a minute and try to find it. One way is to change the different than operator to a larger than. Only one of the rows we'll evaluate to true, and the duplicate will be eliminated. There was no requirement to show a specific animal on either of the columns, so name order is as good as any other. But we're not done yet. There's a hidden bug lurking in the solution. Even though it doesn't show with this data, do you think you can spot it? Pause the video once again for a minute, and see if you can spot the lurking bug. Comparing the animal's name between A1 and A2, just isn't enough. An animal's identifier is species and name. And someday, we may have two animals with the same name, but of a different species. Let's try to demonstrate it. I'll add two animals of a different species with the same name. Duplicate sounds like a nice name. Let's get them adopted on the same day, execute the query again, and indeed duplicate doesn't show up. Can you think of a solution, so that duplicate the dog and duplicate the rabbit do show up. Once again, pause the video for a minute and see if you can find a solution. You might have been tempted to use a predicate for either name is larger, or the species is different. Let's try this, execute, and now duplicate does show up, but this change had an unfortunate side effect. The trick we previously used with A1 name larger than A2 name, no longer works. We get back duplicate rows for adoptions of animals, with a different name and a different species. Sam the cat and Gus the dog, but also Gus the dog and Sam the cat. The OR causes the predicates to evaluate to true for both. If this isn't perfectly clear, you can execute the query without the last predicate, scroll down to Francis rows, and evaluate the predicate for each row in your head. If you're attempted to change the species predicates to larger than, that's not going to help either. As dog is larger than cat, but Sam is larger than Gus. So, if we're going to try this, both are returned. How would you solve it? Once again, pause the video for a minute, and see if you can find a solution. We have three possible conditions for an animal. Either their names are the same, but the species is different, or their species is the same, but the names are different, or both their names and their species are different. And we must account for all three possibilities. So, let's write these down as individual predicates separated by OR, and execute once again. Now something else has gone wrong. Where did the adoptions of Gus and Sam and Archie and Abby disappear. And for the last time, pause the video for a minute and see if you can find what's wrong. The last predicate is wrong. There is no guarantee that both the name and the species from A1 will be larger than those from A2. Even though Sam is larger than a Gus, cat isn't larger than dog. And the same is true for Archie and Abby. We cannot have both predicates use the larger than operator, and in order to fix this query, we must one of them back to being a different than, and it doesn't matter which one. Let's change it, execute again, and now we get the correct result. If this still doesn't make sense, execute the query without the last predicates and evaluated in your head like before. Can you think of another way to solve this challenge? I invite you to post your suggested solutions in the Q and A section.

Contents