From the course: Advanced SQL for Data Scientists

Rules of normalization

From the course: Advanced SQL for Data Scientists

Start my 1-month free trial

Rules of normalization

- [Instructor] When we talk about relational databases in SQL and data modeling, we're almost always going to be talking about normalization to some degree. Now, normalization is a practice in which we follow a set of rules for designing database tables which help us minimize the risk of data anomalies. Now, normalization is an important practice and it's widely adopted, but it's not always essential and there are cases where we actually don't want to normalize. So let's take a look at normalization and then we'll take a look at when we actually break the rules of normalization. So data anomalies are basically errors or inconsistencies in data that we really want to avoid. And there's three types we'll talk about here. There is an update anomaly. And an update anomaly occurs when you have redundant data and you only partially update that data. So for example, if we had a database of books and in that we tracked books and authors, and for authors we track things like their address. Well, if every time an author had a book entry, we captured the author's address along with the book, then we might have multiple addresses, say an author has 10 books published, there would be 10 copies of the address. Now, if for some reason we updated that address but only updated five of them, then we'd have five with old data and five with new. That's an example of an update anomaly. An insertion anomaly basically results when we're not able to add data to the database due to the absence of other data. So for example, if to add a new book I also have to be able to enter the author's address and I don't have the author's address, then I'm not able to insert that data. That's an example of an insertion anomaly. And then a deletion anomaly is when we unintentionally lose data because we've deleted other data. So for example, we might lose an author's address because we deleted the author's only book that we had in the database. Now we may have wanted to delete the book but actually keep the author and author information in the database for future use. So that's an example of a deletion anomaly. So the way we avoid anomalies like that is we follow normalization rules. Now there are many normalization rules but really the most important and the most widely used are the first three. The first normalization rule which is called first normal form, basically states that each value in a column is an atomic unit. So it's going to be a particular number or a unit string or a Boolean or some scalar values, some unit like that that can't be broken down further. The second normal form or second rule of normalization states that any attributes or any column we have in a table is dependent on the key. So it's a function of that key. So there's nothing in the table that's not related to that particular key. That gets us to second normal form. And then third normal form says we don't have any transitive dependencies or there's some pieces that are dependent on something in the table, but not the key. And that something else might be, for example, an author's address is dependent on the author and the author depends on, for example, the book, if we had that kind of relationship, that would violate third normal form. So we want to avoid that kind of transitive dependency. Now, oftentimes when we visualize normalized databases, we have diagrams which show rectangles representing tables and then we have lines connecting the tables which represent relationships, and then we have some kind of indicator for what kind of relationship it is. So for example, in many cases we have one table that's like a primary table, for example, like an order table. And then we have like a secondary cable that has a lot of detail, like an order's item. So if you have an order and there's maybe 10 different books on that order, we'd have one order in the order's table and 10 rows or 10 order items in the order items table that would be a one to many relationship or one to N. And sometimes the relationship can be one to one or one to zero or one or one to zero or many. And those all fit, these are all allowed under the rules of normalization. And we often see this kind of modeling when we're working with OLTP or online transaction processing systems. Now OLTP systems, for example, typically have many reads and writes so they're constantly being updated. Data is written once but then it may be updated again frequently by many different processes. So you can imagine like an e-commerce application with users updating many orders at the same time. Now, oftentimes these kind of OLTB systems are normalized to third normal form. Sometimes per performance they're de-normalized slightly but typically they're still considered normalized. Now contrast that with analytical databases, these are used typically for data analysis. And here we have many reads by many processes, but typically with analytic databases, we're not updating say a single customer's address or a single customer's order, we might be reading a single order or a single customer record, but in an analytical database we might look at thousands of orders or thousands of customers. So our reads tend to span many rows, but have fewer columns that we actually include in our query. We have many writes and with batch processing, many of those writes are done all at once like bulk updates where a job will run and it'll start from end to finish until the large number of rows are loaded in. We may also have streaming data where a process is ingesting data in near real time and writing it to a database. That's different from this idea of having maybe thousands of different users doing small updates or small writes, here we're doing fairly large amounts of writes when we have batch processes or we're consistently doing writes if we have a streaming process. These analytical databases are often de-normalized. And we'll take a look in the next video as to why that's the case.

Contents