From the course: Amazon Web Services: Data Analytics

Query AWS public datasets - Amazon Web Services (AWS) Tutorial

From the course: Amazon Web Services: Data Analytics

Start my 1-month free trial

Query AWS public datasets

- [Instructor] In this movie we're going to consider using Public Data Sets to enhance our business data for analytics. Now, because Amazon's making services which allows us to more cheaply basically try out querying data that we might not understand as well as our business data, we might not be sure it's going to add value. And these services include services like Redshift Spectrum, for aggregate queries, and Athena for sequel queries. We now want to look at the process for working with public data. So the first part is finding the data sets. With my various customers, whether they be government entities, education, biomedical, I'm finding that there's actually repositories around various industries for data. But just to make it simple, we're actually going to look at Amazon's repository, 'cause they give instructions on what the data looks like, and which service might be best to use to work with it, which can be very helpful. Then, once we're set up, we're going to look at querying the data set, and then using the data set. A tip that I'll give you is that a lot of these repositories are really huge. And even though you might be using a serverless service, such as Athena, which only charges by the query, you want to start out small. You don't want to query the whole data set, 'cause it might not be useful. I usually limit it to a few hundred, few thousand, maybe even ten thousand rows. But I look at the size of what I'm querying and estimate the cost. So this is the main website for Amazon's public datasets, and you can see that they're grouped by vertical. So we have geospatial and environmental datasets. A lot of these have to do with mapping or satellite pictures. Then if I scroll down, there are genomics and life science datasets. And these are being used as reference data for looking at genomic variance for personalized medicine and other research. And we have datasets for machine learning, and we have datasets in this section that store entire, basically, libraries, views of the sky, they're really huge datasets, basically, it's what's in common in this section. And the last section is regulatory and statistical data. This is really just a subset of the financial data that I know I've worked with. Another site that I'll just talk about here because I've used it so frequently, I'll just actually bring it up. It's Quandl. Quandl has a wealth of financial data. And they really have two sections. They have a free section and they have a premium section. I have many clients that have worked with Quandl data. What's neat about Quandl is, you don't actually have to set up any Amazon service to browse the data. You can actually look at a subset of most of the data. Some of the premium data you have to actually pay a small fee, but you can actually set this up for free data. And you can see, if I want to look at Wiki End of Day stock prices I just click here, and I can preview. And they have the different file formats here. Now let's get back to the Amazon world. So if we wanted to take a look at one of these datasets, let's click on this USA Spending Gov on AWS. So, this is government spending. And here, like I said, Amazon helpfully is telling you how they recommend accessing this data. So they've stored the information as a snapshot that you can associate to an RDS instance. Now, you could go through the console and you could set this up, but because we have the CLI, it's a little bit quicker if we just run the CLI command. So I'm going to scroll down, and you can see we're going to restore from Snapshot and create an instance. Now, this is going to take a couple of minutes, so let's go back and look at what we're going to do with it once it's created. Then we're going to use our client, that's NapCat you might remember from earlier movies, and we're going to connect to it, and then we can explore our data. Now, we might have to set up our port to allow traffic, so I'll check that and let you know if I needed to update that, based on this implementation, once it gets set up. And here we are in the RDS Dashboard, and the name of my new instance is my test db-cli. Notice it's using the PostgreSQL engine, and so we'll wait for this to create. All right, our instance is now available, so we're going to click into it, and scroll down, and see, here is our endpoint. So we're going to copy this, and then go back over to our directions. Now we're going to use our client, add a new connection to Amazon RDS for PostgreSQL, call it demo public data, paste in the endpoint, copy the database name, enter the user name "root," and the password a password, and click test connection. Click OK, and click save. Right click to open a connection, click on the data store API data base, right click it to open it, right click the Public Schema, click open schema, and you can see here is our data that we can now query, using regular relational techniques. One tip I'm going to give you before I close out this movie, is that you will need to set up a security rule to access this, so let me show you what that looks like. So in the RDS console, if you scroll down and you go to the security groups, and you click this link, that will open this security group. And you'll need to click Edit and then you'll need to add an inbound rule for Postgres1l, and set it to your IP address. Of course, this is mine, so yours will differ.

Contents