From the course: R Essential Training Part 2: Modeling Data

Data science with R: A case study

From the course: R Essential Training Part 2: Modeling Data

Start my 1-month free trial

Data science with R: A case study

- [Instructor] Let's begin our hands on discussion of R with a case study. I want to give you an idea of the sort of things you can learn to do in this course, and this represents a continuation of the case study from part one of our essential training, which was about wrangling and visualizing data, I'm going to do additional visualizations in this half, and I'm going to be building statistical models and using some machine learning methods, but it's all there to give you a flavor of what's possible. Now I do want to mention one thing before getting started, I tend to write my code a little differently in that I tend to use a very narrow window, and I organize things vertically. You can see that right here. Now I do that for a few reasons, number one, I am a teacher, and this allows me to comment individual lines much more cleanly. Second, because I'm filming, I use a narrow window, I only have 1280 pixels wide, and I have 60 characters per line in my window right here. Most people will use 80, 100, something much broader, use what works well for you. The third one is, I just think it makes it a little easier to follow the code and see what's happening, but while I organize it vertically, you can set it up however you want. But I'm going to start by loading some packages. Now, if you don't already have pacman installed, I find it very useful. You can install it by using this command, and then once that's loaded, you can use it to load all these other ones, it will load them if you already have them, if you don't have them it will install them and then load them. So I'm going to run that command. And then, I'm going to import a data set, Big 5, which we've seen in course one, and what this is is nearly 20,000 observations on eight variables. Specifically a person's age, their gender, whether they speak English as a native speaker, and then five personality variables, extroversion, neuroticism, agreeableness, conscientiousness and openness to experience. Now, I'm going to be using openness as the example outcome in this one, so let's start by looking at the distribution of openness, I'm going to pull it, and I'm going to get a boxplot and calculate the median. So when I do that and I zoom in on the plot, you can see that the median is right around four, it goes up to the maximum of five, and then anything below say two and a quarter is an outlier. Now I do this because I need to check the median, because I want to create a dichotomous version of this variable as well. In fact, that's what I'm going to do right here, I'm going to take the data frame, and then using the compound assignment pipe, I'm going to create a new variable and add it to the data set. This new one will be called open_t for text or two, and it's simply going to see whether a person's score on open is greater than or equal to four, if it is, it'll give that value high. If it's not, it'll give it low. And then we'll convert it to a factor and we'll add that to the data set and print it again. So let's zoom in here on the bottom, and now you can see I have this additional variable here at the end. I'll come back in. Now let's start exploring our data, and this is easiest to do visually. Let's get a bar chart of open the text version with two values. I'm going to do this with ggplot and then geom_bar. And when I do that, well, because it was a median split, we expect these two bars to be about the same height, they're very close to the same. We can also get a scatterplot of all the variables in the matrix, using the ggpairs function. This is a great, very rich graphic, it takes a little while to draw, and I'm actually going to use three commands at once because I'm also using the tictoc package, which allows you to time how long certain commands take. So I'm going to run this one, and I'll run them in order, and normally it takes about 15 seconds for it to draw it, and then another 10 seconds for it to appear, and you can get a progress report right here on the left. What we're going to be doing is using this as a guide for the overall visualization and the analysis. Now it's done calculating it, we'll have it displayed in just a moment. Now our graph's there, I'm going to zoom in on it and it's going to take a few seconds for it to adjust to the new resolution. So what we have is a matrix of the variables here. We start with open_t, that's the dichotomized version of t, then open, extroverted, neuroticism, agreeableness, conscientiousness, age, and then gender and whether a person speaks English as a native speaker, and that's a yes or no. We have those variables listed across the top, and then also down the side, and what the ggpairs does, is it first shows you the distribution of each variable on the diagonal. These are the two bars for open_t, this is the density plot for the distribution on openness and extroversion, and then we have bar charts for gender where we have about 50% more female respondents than male, and then we have about twice as many English native speakers as non-native. And then ggpairs draws different kinds of plots depending on the combination of variables. So over here, we have histograms that are broken down by the high or low score on open. And you look and see if there are any major differences, and then down here we have grouped bar charts, and we're looking for patterns in the differences. Up here on the top, we have grouped box plots, and then in these cells we have the correlation coefficients that correspond to the scatterplots that are here below the diagonal. Now, this is a lot of information, you can tell that there's nothing enormously obvious going on in here, so we're going to do some more focused analyses. I'm going to start by zooming this one back out. Now let's get some summary statistics for our data by simply taking the entire data frame and using the summary command. And then we can zoom in on the console here, and we have several descriptive statistics for the quantitative variables, the minimum, the first quartile, the median, mean, third quartile and maximum. And then the frequencies for the categories within the categorical variables. So that gives us a good idea of what we're dealing with, and it's a nice way to follow up the visual analysis we did a moment ago. Now we can start doing some statistical analyses, because that's one of the emphases of this part two of our essential training. A very simple one is a t-test, so if we look at extroversion and openness, well, extroversion is right here and then maybe there's a difference between those two. We'll do a t-test. And what we find on that one, is that the difference is statistically significant, the value for t is really big, although the difference between the two groups is pretty small. The mean extroversion for people who are high in openness is 3.13, the mean level of extroversion for people who are in the low openness group is 2.89, so it's a few tenths difference between the two. However, when you're working with a very large data set, even small differences can be a way of finding extra value and meaning to help guide your predictions. Instead of doing a whole collection of t-tests, which violates the assumption of independent probabilities, let's go and do a linear regression on the full data set. To do this, I'm going to save an object called fit_lm, fit is a very common name for when you're saving a model, I'm adding lm for linear model. We take the data frame, we select the variables we want, and we run the regression command. Now it's saved over here, but let's get the model summary by simply feeding it into the summary function. And what we have are the various coefficients, the t-test for each, and the probability, and we find that nearly all of the predictors are statistically significant, again, not a surprise when you have almost 20,000 cases. We can also get diagnostic plots by using the plot command. And I'm going to have separate videos on each of these procedures that will give more information about what each one of them does and what it means. These are normality plots for the residuals and the values that are in the regression equation. But now let's go from a basic regression model, of which there are several different variations, but let's look at some of the more advanced methods, the machine learning approaches that we're going to cover in this course. To do this, I'm going to use a very common approach of splitting the data into testing and training, where you explore and build your model on the training data, and then once you've settled on a model, you then test it on the other data that you've set apart, this is a way of maximizing the generalizability. I'm going to first set a random seed, this makes it so we can repeat some of the random processes, and you can pick any number there, I've got 333. Now if you want to save time, you can take a random subsample, maybe 1000 cases or 500 cases. Just uncomment this and run it what you want. I'm going to split the data so 70% goes into train, and then test, we simply use an anti_join which gives everything else to the test data, you can see how those two are now formed over here. We're going to start by doing a knn, a k-nearest neighbors model for classifying data. I'm going to save some parameters, I'm going to run those parameters on the model, and then I'm going to apply it to the training data, which can take a couple of minutes. Now I paused the recording while this was running, and even though tictoc says it took a fraction of a second, it was two or three minutes. Now that's when I'm running the full data set with 13000 cases in the training data, it's quicker when I started with a subset. But let's now apply this model to our training data just to see how well it's working. We're going to save a predicted value for openness, and then we'll get an accuracy where we run table, we give it the actual class, the outcome, as well as the predicted class, and then we'll use a confusion matrix. And when I run that, over here you see how many cases were there supposed to be, and were there predicted to be, and we have an overall accuracy of about 6247, so about 62% were correctly classified. This is a little different from when I last did it and I had about 61.8%. But, that's on the training data, the important one is how well does it work on the testing data? So now we're going to come down and do this again with the testing data, but we don't have to compute the model again, we simply apply the model. And now, we can see that we have an accuracy of about 57.6%, it's lower, which is what you expect when you take a model from an original data set and apply it to new data, but it still is holding up and it's better than chance. Another method is to use a decision tree, this is a way of drawing graphically the process by which you decide whether a case goes into high on openness or low on openness. To do this, I'm going to use train, and then I'm going to come down here and say I'm using rpart for recursive partitioning. And we'll run this on the full training data set. Took about two seconds, I didn't pause. And now we can get the processing summary, we've saved it into fit_dt for decision tree. This gives us some information about accuracy and kappa which is a chance-corrected version of accuracy. Let's describe the model, and this gives us the collection of decisions it uses, and we can actually plot this, this is one of the great things about decision trees. This is going to give us our collection of decisions we have to make, it says start by determining whether a case is above or equal to 3.7 on extroversion. If they are, yes, then you're going to predict that 62% of them are high on openness. If they're lower, then go check their agreeableness. If they're higher on agreeableness, then you're going to predict high openness, otherwise you predict low, and these are the only two particularly useful predictors in this data set. We can also look at a classification table like we had before. And we see that we have 58% accuracy. Let's come down and apply the model to the testing data, where I say take fit_dt for fit the decision tree, and we're going to apply it to the test data. We'll calculate the predicted value of open, then we'll get the table with the confusion matrix. And when we zoom in on that, we see that we have 57.76% accuracy this time around. Now, one of the things about decision trees is they're a little sensitive to exactly how you split the data. One way around that is to get into what you might call full scale machine learning, and then just use a random forest, and what this does is it computes many decision trees using random splits on variables to create what is called a forest of decision tree that is randomly built, so a random forest. Now to do this, I need to first specify some of the parameters that I'm going to use, and then I'm going to train the forest. And this takes about 10 or 12 minutes on my computer, so I'll be pausing while I do this. But we'll run this command, maybe you can go get a cup of tea while this is running. But let's do this where we're going to ask it to do 300 decision trees, you can do more, you can do 500, 800, 1000, but 300 is enough to get a feel for what's happening here. So I'm going to run that command now, and after pausing this for about 14 minutes, we have a computed random forest. Now it doesn't have much to show us here, we don't have a picture of this thing, but we can get a processing summary that tells us about the steps it's taken, and the accuracy and the error-corrected accuracy. And we can get accuracy by the number of predictors. And what this tells us is that when it had one predictor it seemed to have the greatest accuracy and it diminished as it went down, though please note the scale is not from zero to 100, it's from 57%, 58%, 59%. But let's see how it works with the training data, we're going to pull out this information here. Now the important thing is that we're looking here is the out of bag error rate, out of bag means ignoring the group memberships, and you can think of this as the average or overall, so if you take the flip side of that, you get about 59% accuracy, which is what we see right there. We can also plot error by the number of trees, taking that final model and running it through the generic plot command, and let me zoom in on that one. And these are for the two categories, one is for high openness, the other one is for low openness, and the black line down the middle is for a combined estimate of the accuracy between the two. We're actually getting an error here, so again, subtract that from 100 to get the accuracy as opposed to the error. But let's see how well this model fits on the data. We'll create a variable and get a table on the training data, we'll zoom in on that, and we see that we actually have about 71% accuracy on the training data, but again, that's training. The important thing is how well this would generalize to the testing data. So let's come back out, and we will apply this, the fit or the model that we created with rf for random forest, let's apply that to the test data, and then look at that accuracy. And what we see here is we have about 58.5% accuracy. Lower than the 71% or so we had with the training data, but it's the testing data that matters. And then when we come back and look at these predictions, we have a total sample size of 18837, we split 70% into train and 30% in the test, and when we take a look at how well things have worked out, what we find is this. Now this is built on models that I created earlier, so the numbers are going to be a little different from what we have right now, but you can see that while they varied a fair amount in the training data, all of them had 58 to 59% accuracy in the test data, and so these are several different methods you can use to build models to predict an outcome that's important to you, and to see how well it generalizes. This is the entire approach, about creating our data, visualizing our data, doing a regression model and then using machine learning predictive analytics techniques, like k-nearest neighbors, random forest, as a way of getting additional and more reliable insight out of your data. And as we go through the course, I'll give you more details on each of these, and how you can use them in your own projects.

Contents