From the course: Advanced SQL for Data Scientists

B-tree indexes

From the course: Advanced SQL for Data Scientists

Start my 1-month free trial

B-tree indexes

- [Instructor] In addition to designing tables, part of data modeling is developing an indexing strategy. So let's take a look at indexes, different types of indexes, and how they apply when we're working with analytical queries. We'll start with B-Tree indexes. So, indexing is used primarily to reduce the amount of work we need to do when we have to go and fetch data. Now, typically this means we don't want to be scanning a lot of data blocks, so indexes are used to help us reduce that. Now, there is a cost associated with this because we have to maintain these indexes. So what that means is we're going to take up additional space, but we're also going to be doing additional rights, so anytime we're loading data or deleting data, we're going to need to update the index. Now, some things to keep in mind is that when we index a column, the higher the cardinality which means, you know, just the number of distinct values, really influences how well the index helps improve query performance. So for example, if you had a table of codes and the codes were numbers, one through 10, the codes are evenly distributed you can expect about 10% of the table to be returned if you look up a particular code. So 10% is not bad but it doesn't reduce a lot say compared to if the cardinality was say a 1000, and if you looked up a particular code you could reduce the amount of work that you need to do to maybe 1000th of the size of the table. Now, we're going to be talking about indexing, but again I want to point out in terms of things that are different in analytical databases, indexing is not used in analytical databases like Google BigQuery or AWS Redshift, again, they have different strategies. Now, there are several different types of indexes, we're going to pay attention to three broad categories, the B-tree, the bitmap, and the hash index, and then we're also going to look at special purpose index. So first, let's take a look at B-trees indexes. Now the B in B-tree stands for balanced. And what this means is that what we're trying to do is keep a record or information about rows of data by capturing a small amount of information, that's the attribute or attributes that we're indexing. And one of the things we want to be able to do with an index is to be able to look up a value very quickly. Well, B-trees are sort of the workhorse of indexing, they work really well in many different cases and what they give us is basically the ability to look up a value in logarithmic time. So let's see how that works. Say, we are indexing say a column that has one to 100, and the first one we index or the item right in the middle is a 50, and that makes sense because we want to be able to split sort of divide and conquer and with a balanced tree the way it works is the next time we insert a value, if the value is less than 50 then we're going to put it in toward the left node and if it's greater than 50 we're going to put it toward the right note. So for example, 25 would go to the left and 75 would go to the right. Now, let's say you want to insert a value say between 25 and 50, in that case, it's going to be less than 50, so we're going to go to the left, we'll go toward 25 but then if it's greater than 25 we're going to go to the right. So in this example, you know, 37 is placed first to the left and then to the right. And we can continue this example, and what we can see here is even though there could be a 100 different nodes in this tree, we're never going to have to go more than log n or log 100 steps to actually find the value. So that's one of the big values of B-tree indexes is it gives us that order and look up time.

Contents