From the course: Data Science Foundations: Data Mining in Python

Handwritten digits dataset - Python Tutorial

From the course: Data Science Foundations: Data Mining in Python

Start my 1-month free trial

Handwritten digits dataset

- [Instructor] And let me show you how this works in Python. Now, when we open up Python, I would just want to point out one thing before I get started you usually see it in a browser window if you're using Jupiter in the Anaconda distribution and you have the menu bars from your web browser. I'm going to do two things to make the demonstrations work a little better. First off, I'm going to switch my browser to full screen. Then in the Jupiter interface there's a few other things I can do. I can come to View and I can toggle the header, that's the top there. And I can also toggle this toolbar. This gets me to the minimalist version of the Jupiter interface and it's the one that is best for this demonstration. So it's what I'm going to be using throughout the course. Just be aware that you can bring back the toolbars, you can bring back the browser menus if you want but back to the handwritten digits dataset. To do this, I'm first going to load several libraries and functions. And then I'm going to come down and load the data set. Now I've saved this as a local CSV file to make it much easier to deal with. And it's in the data folder for our course. So I'm going to open this data file. You can read it directly from the University of California Irvine Machine Learning Repository if you want just uncomment these three lines, run that one, you'll get the exact same thing. But let's take a look at the data very quickly. And what we have are five rows of data here, zero through four, and then we have numbered columns. These are the pixel data here and then the very last one is the class identifier. So let's rename these columns so they're a little more informative. What I'm going to do is I'm going to take the attributes, the ones that tell us how many pixels there are in a particular part of the image, and I'm going to rename them, starting with P for pixels. And then the last one, a lowercase y which is what I just call a class or outcome variable unless I have a reason to use something else. So I run this command. And then also the dataset includes the numbers zero through nine and that makes sense, but to make our demonstration work a little faster, I'm going to restrict it to just the numbers one, three, and six. So this command will restrict it to those numbers. And we'll take a look at the first few rows. And here you can see, we have rows four, 11, 14 because these are the ones that have the six, the one or the three in them. And so this is the dataset that we're going to be working with. There is however, one additional step we need to take. And that's because we're using machine learning algorithms and it's most common in this situations to do a training dataset and a testing dataset. We have a large dataset, so we can make that split. I'll be using the train test split function and creating the X train and Y train and X test and Y test datasets. This I will split it up the way that we need. And now we can begin exploring the training dataset. That's the one we're supposed to be working with as though it's the only one we know and then we later go to the testing. Let's start by looking at the first 20 images that are in the training dataset. So, this will set up a grid for the images and then we'll actually plot those because there are underlying pixels. And so you can see, this is probably a one, a one, a one, a six. You see they're low resolution and so it is going to be a challenge for the algorithm to correctly classify them. Although it's pretty easy to tell that this is a three and this is a six, this one is a smudge I can't tell what it is, but we can also explore the attribute variables in addition to the class variable. So what we're going to do here is just sort of at random or haphazard, choose a few variables p25,p30, p45 and p60 and we're going to look at those and their association between them by creating a grid. And so what we have here are the distributions on the diagonal for the three variables where one is in red, three is in green and six is in blue and then we have scatter plots and we have really two dimensional density plot here. The point here is that you can see that there's separation between these things. It doesn't even really matter what the variables are, but you can see that there is separation. This is critical for our algorithms to work properly. And so now that we check that we've got the data imported we've renamed, we've checked that there's going to be adequate separation, just a quick visual check. You can now save the data. Now I've already saved it, so I don't need to repeat this one, but the .2CSV command will take each of these datasets you've created we'll save them as CSV so they're easy to import and get started with in the next step when we start looking at the individual algorithms.

Contents