From the course: Designing Highly Scalable and Highly Available SQL Databases

Choose a datastore: SQL, NoSQL, or analytical?

From the course: Designing Highly Scalable and Highly Available SQL Databases

Start my 1-month free trial

Choose a datastore: SQL, NoSQL, or analytical?

- [Instructor] Now as we consider designing a scalable database, one of the first things we need to decide is what type of data store we're going to use. Now, for decades, relational databases have been the workhorse of database applications, especially in enterprises and large organizations. But within the last decade or so, another class of data store has emerged called NoSQL databases. There's no single type of NoSQL database, in fact, there's at least three broad categories that we'll talk about. And then another recent addition to the data store landscape are analytical databases, which are designed for data warehousing and large scale analytics. Now, relational databases are designed for use with fixed schemas. Now that means that the information we're capturing is fairly well structured, it's well defined, it's fairly static. And most of the entities that we're describing all need to be described with virtually the same attributes. When that's the case, we can organize the data in highly structured rows and tables, and that provides a lot of advantages. One of the biggest is that we can use a standard query language called SQL. And SQL is quite powerful, and if you are not familiar with it, I'd recommend looking through the catalog for other courses on SQL. Also relational databases have long supported ACID transactions, which are really important for business applications in particular. So that just ensures that a logical business operation either completes or fails, but there's never partially done and partially undone. So for example, you wouldn't want to make a transfer from one account in your bank to another account, and have the money withdrawn from one account, but not posted to the second account. That kind of logical transfer is a transaction from a database perspective, and either both operations succeed or both fail. Now, some NoSQL databases are supporting ACID transaction, and that will probably grow, but that is something that relational databases have had from the beginning. And just as an example of a relational database is PostgreSQL, which is a widely used open source database. There are many others, MySQL is a popular open source database, also SQL Server from Microsoft, Oracle database, those are widely used as well. As I mentioned, NoSQL databases have sort of emerged in the last decade or so, and they're really nonrelational, and they were designed in response to the limitations of relational databases. Basically what happened was, people were adding more and more data to relational databases, and we were starting to hit performance walls. So people took a look and decided, "Well, what can we trade off in terms of features in relational databases, in such a way that we get the performance we want, but we lose certain features?" Well, out of that kind of questioning and exploration came three types of NoSQL databases, document databases, wide column databases and graph databases. Now, document databases are designed for semistructured data. So if you're familiar with JSON and the idea of having key value pairs, and then some of those values can themselves be structured items like other JSON structures or arrays, then you're familiar with the organization of document databases. So it's a hierarchical structure, but it can vary from one entity to another. Another thing that is different from relational databases is that in document databases, we store related information together. So for example, a customer and their address may be stored together in the same JSON structure, whereas in our relational database, we'd be more likely to have customers modeled in one table and addresses in another. A popular document database is MongoDB. Again, there are others like Couchbase, Cloud Fire Store in Google Cloud are also document databases. Wide column databases are another type of NoSQL database. Again, they're semistructured, so the set of attributes that we track across entities and across instances of entities can vary, and it's modeled after something known as a sparse multidimensional matrix. So if you think of a table like an Excel spreadsheet and then think of multiple sheets, so it's like almost like in 3D. Well, that's a three dimensional matrix. Well now imagine you can do that in many more dimensions, that's what a multidimensional matrix is. Now, by sparse we mean that oftentimes many of the attributes aren't actually filled in, we don't actually have values for them. And so people take advantage of that fact to make highly efficient storage structures. Now, again, like with document databases, we typically store related information together, because we don't work with joins. Now, sometimes we group columns together. So for example, columns that are related to, say customer, are grouped together, columns that are related to address are grouped together, and those are grouped into something called column families. And rows are defined as a set of column families. Now, Cassandra is a example of a wide column database. Google Bigtable is another example. The third type of NoSQL database is a graph database. Now graph databases can be structured or semistructured, there's sort of a spectrum. And so it's really useful for working with both. And here we're basically modeling the data store on a directed graph. And you can think of that as a set of nodes that have connections or links between them. So for example, a directed graph could model cities and roads that go between cities. And when we talk about directed graphs, we often talk about nodes, which for example, would be a city and edges, which might model a road. And those nodes and edges themselves have properties. So a city might have a name, a population. A road might have maximum speed, minimum speed, certain other descriptive properties. A widely used graph database is Neo4j. And then finally there are analytical databases, these are structured databases. So they're fairly fixed structures, and they may use SQL, but they're not relational. So for example, Google BigQuery is an analytical database. It's structured, it uses tables, it has SQL, but it's not a relational database. And that's because while it's designed for large volumes of data, it doesn't have a lot of the features of traditional relational databases. For example, it relies on massively parallel scans, there are no indexes, and it's not designed for transactions. So you wouldn't use an analytical database say for an eCommerce site, analytical databases are structured and look very similar to relational, but they're designed are optimized for doing large scale aggregate data analysis. So just to summarize our data store options, we typically use relational for highly structured transactional data. Now, one of the challenges there is it can be difficult to scale, right? And that was the motivation for NoSQL databases. The focus on this course is going to be on relational databases, how do we make them as scalable as possible? NoSQL is a good option when you're dealing with semistructured data. When you can deal with something called eventual consistency. So remember we talked about ACID transactions, and part of that is having consistent reads. With NoSQL, one of the things that NoSQL databases do is it replicates data, so there are multiple copies of data, and that makes it more efficient to do very large numbers of reads. However, that does create a situation where there may be times where copies are out of sync. So for example, one replica might be updated at time one, but another replica might not be updated until time three, but at time two, another replica was read, but that data was not consistent with what was the latest data. So eventually all of the replicas have the same data. And so if you can tolerate that delay, then NoSQL databases may be a good option for you. And analytical databases, again, those are designed for structured data, but really the focus there is on how much data you're using and how you're using it. We're no longer in the realm of transactions and updating a small number of rows. With analytical databases, they tend to be read intensive and focused on reading large amounts of data and doing like aggregate operations or analysis on them.

Contents