From the course: Advanced SQL for Data Scientists

EXPLAIN and ANALYZE commands

From the course: Advanced SQL for Data Scientists

Start my 1-month free trial

EXPLAIN and ANALYZE commands

- [Instructor] Now as data scientists working with SQL, it's sometimes helpful to understand a little bit about how the database engine actually executes SQL statements. And that's particularly important if we're trying to optimize the performance of our queries. So let's start from first principles. SQL is basically a declarative language. So we specify what we want. So for example, I want these certain set of column returned from this particular table subject to this filter criteria. Now that again is stating what we want. I'm not saying in any way, how the database should go about doing that. Limiting the data we have to scan to a subset of rows. So many languages like Python or Java can be used procedurally. And when we're doing that we're specifying how to do something. So we're actually like directly manipulating data structures and determining like the sequence of steps. So one of the things that database engine does for us is it figures out those sort of procedural steps and then execute some for us. So one of the simplest kind of procedural steps is scanning a table. And basically what happens is we say, start at the beginning of table, fetch a row and then move on to the next row. And maybe we're doing a comparison. For example, we're looking at a status level and we want to filter, show me all the customers who are status level gold. Well, that's, what something like this kind of simple scan could do. We'll simply walk through each row in a table and check the particular column that's part of the filter. And then if it passes the filter, we return it. So again, the procedure is fairly straightforward. So in a lot of ways, scanning is simple. We just look at each row, fetch the data block that contains the row and apply the filter or the condition. So if we're thinking in terms of, well what's the order what's the time complexity of this? Well, the cost is basically based on the number of rows in the table. Now I should say that scanning isn't always as simple. I'm making the assumption that we're working with a database that uses a row storage or a row orientation. Now, some databases like AWS Redshift or Google Big Query use column in their storage. And again, we typically see columnar storage with data warehouses. So some of the things that I say here about scanning and row fetching certainly apply when we're using Postgres but maybe not something if you're working with say a petabyte scale data warehouse. So cost is based on the number of rows. Now that doesn't mean scanning is always bad. Some scanning can be really efficient when tables are small. So for example, rather than use an index and look up something in the index and then go fetch something from the table, it may be faster just to scan a row and not maintain an index. Now, scanning large tables can be efficient if we're only scanning them very few times. So for example, say you have a very large table and you have to scan it once. It's probably not worth building an index on that because of the cost of building the index, the storage that's required as well as the time. So even with very large tables, there may be times when scanning is efficient. But in general, with large tables that we query repeatedly, that's not efficient. So we want to find some other way to work with those tables rather than doing sequence scans or full table scans. One way to do that is to use indexes. And of course, we've already talked about this but just to recap, indexes are ordered, and they're faster to search for an attribute value than say scanning the whole table. And since the index is basically a set of attributes and an a pointer to a particular row that we can use the index say for our filter and use the index and do the filtering based on the index rather than doing the filtering based on the data in the rows. In the case of using 20 partitions. Now, another thing we want to watch when we're thinking about performance is joining tables. So of course the basic idea with joining tables is that we have data in say two tables and we want to somehow select a row from one table and link it or join it to a row in another table and basically produced a projection or a new view over that data. So the question is, how do we go about matching those rows? Well, one table is going to have a foreign key in it, and the other table is going to have of course the primary key in the other table. And so the question of how to match rows becomes how do we match keys? So there are a few ways to do this. There is a technique known as the nested loop join. And that's whether the database and Jim will compare all rows in both tables to each other. There is the hash join, in which we calculate a hash value of a key and then join based on matching hash values. And then finally, there's something called sort merge join in which we sort both tables and then join rows while taking advantage of the order. So for a nested loop, the basic sequence of steps is we loop through one table and then for each row, we loop through the other table and at each step we compare the keys. So the nice thing about nested loop joins is it simple to implement but it can be expensive if the tables are large. Nested loop join is often a good choice when the tables are small. Now a hash join computes the hash value of keys in a smaller table. It then stores those hashes in a hash table and the hash table has the hash value in some row attributes. And then we scan the larger table. And we find rows from the smaller hash table that matched with the hash value of the hash and key on the larger table. With sort merged join, we sort both tables, we compare the rows like we do with nested loop join but because both tables are ordered, we can stop when it's not possible to find a match later in the table because of sort order. And this is nice because we scan the driving table only once. So we've looked at indexes and we've looked at joins with regards to kind of things we look at with regards to query optimization. One of the things that the database engine or the the plan builder looks at is information about the tables themselves. And so the plan builder relies on statistics about the data in the tables that we're working with. Now, usually statistics are kept up to date. And in Postgres there is a process called autovacuum, which does that and also kind of cleans up after some deletes and compresses data and does some other sort of background, housekeeping kinds of things. Now, sometimes statistics can get out of date. So there is a command called ANALYZE which you can run, which will update statistics. So if for some reason you're looking at a query plan and you can't quite figure out why would the query plan builder choose this really inefficient way if it looks, like for example, it's doing a nested loop join when maybe a hash join would be more appropriate. A possible reason for that and outside chances for some reason, the statistics are out of date, so just run the ANALYZE command. And there are different parameters you can analyze whole schemas or tables. So you can definitely get more detail on the ANALYZE command. That's not something you necessarily need to use a lot, but if you do, there are some options with the ANALYZE command so you can either scan an entire schema or really target a particular table for that. And some things to keep in mind. When a table has a relatively small amount of data, the ANALYZE will look at all of the data. When it's a very large table, it will only sample the data. So it'll build statistics out of a sampling. So if you were to run ANALYZE say two times over a very large table you might get slightly different statistics but that's because of sampling.

Contents