From the course: R Essential Training Part 2: Modeling Data

Predicting outcomes with linear regression

From the course: R Essential Training Part 2: Modeling Data

Start my 1-month free trial

Predicting outcomes with linear regression

- [Instructor] Probably the must useful and flexible approach for working with data and trying to predict scores is linear regression. It's a general purpose tool that allows you to use several variables usually quantitative, but they can be any kind to predict a single quantitative outcome. I want to show you a brief example of how this works in R. I'm going to start by learning a few packages including one called ggally, which we use for a scatterplot Matrix. Then I'm going to load the state data set and save it into an object called DF for data frame. You can see down here it's the same data set that we've used before and there are many different variables in there. Now let's use ggpairs to get a scatterplot Matrix of all of the variables in the data set. Now, this sometimes takes a minute. Let's zoom in on that for a moment. So, what we have here are all the variables in our date set from Instagram through modernDance and it's showing us the density plot for each of the variables and we've got scatterplot showing the relationships between the variables. We have a few with peculiar distributions, some with outliers, but let's zoom out for a moment and let's start by looking at one pair of variables in particular. That's museum and volunteering, which is near the bottom right of the scatterplot Matrix. We'll get a scatterplot of just those two along with a regression line and so this is a standard linear regression. The blue line is the linear regression line that allows you to use levels of museum to predict levels of volunteering. Now, if we actually want to get the model for that we're going to use a bivariate regression. I'm going to do this by selecting two variables of volunteering and museum and please note that when you do this you have to put the Y variable, the outcome variable, first and the X variable, or the predictor variable, second which is the opposite of what you do when you're making the graph and I'm going to use LM and save it into FIT1. Fit is a common term used for saving models in R and since I'm doing more than one Fit, I'm distinguishing them. So now you can see that, that saved over here. We have Fit1 and if you want to see the model, we just run Fit1 and this is the very brief output that it gives us. It's the slope in the intercept. If you want to get the regression table, we use summary and I'll zoom in on that one and here we have the residuals and we have the estimates which we had up here, but now we also have the standard error, the t-value, and the Probability value for the t-value and this one lets us know that the slope is statistically significant and then we have a little more information here about the residuals and about the R-Squares. Specifically the adjusted R-Squared is three, four, nine, seven. So, if you know a states score on museum, you can account for approximately 35% to the variants in these states on volunteering. Let's zoom back out. You can get some additional information. You can get confidence intervals for the coefficients by using conf int and that's the 95% confidence intervals. You can predict the values of volunteering now we have 48 observations because there are 48 states in this data set. So those are the predicted values. You wouldn't normally want to print all of those. You can also get prediction intervals for those and now we zoom in on that. You can see we have a high and a low estimate for each one and you can get regression diagnostics. We have measures of influence. Zoom in on those. It's a fair amount of output for what is really a small data set and you can get some other influence measures and I'll zoom in on those, too. So you have a number of diagnostics that, depending on what you're doing, may be useful in your modeling. Now you can also get diagnostic plots by simply taking the regression, the Fit1, and running it through plot and then what happened here is you actually get four different plots and you come down to the console and it says, "Hit return to see next plot." So, I'm going to come down here, hit return. There's our fitted values verses residuals. There's the normal Q-Q plot. We're pretty close. There's the scale verses location and there's the residuals verses the leverage and so these are several different ways of evaluating how well your data made the assumptions of linear regression, but this is all for bivariate regression. I want to show you a more common procedure that's multiple regression where you have several predictors for a single outcome. Now, what I'm going to do here is a little bit of cleaning up of the data set and rearranging. That makes it easier to set up regression models in R. The first thing I'm going to do is, I'm going to take the outcome variable, which in this case will be volunteering, and then I'm going to move that to the front and then everything else after it, using the everything. So now we're going to run that command and here's our data set, a little glimpse of it down here. Volunteering is at the front and then we have all the other ones coming afterwards. Now there's several different ways you can specify a multiple regression model in R. The most concise, the easiest, is when you have your data frame set up this way. So you have the outcome at the very front and everything else in there is going to be used as a predictor variable. All you have to do is feed that into LM for linear model. You can even do the shorter version, LM and then in parenthesis DF. So let's run that one and here you see the intercept and the slopes for each of the variables that are included as predictors in the model. There are other ways you can specify this same model. You can be specific about what the outcome is. Volunteering and then use a tilde and then dot to mean everything else in the data set and then you specify data frame. This will get you the exact same output. You can see that is appearing twice here or you can spell out every element of the model. This is especially important if you have some variables in your data frame and that you don't want to use. In this case you say LM for linear model and then volunteering, the outcome variable. Tilde is a function of, or as predicted, by this variable plus this one, plus this one, so on and so forth and then you finish with specifying the data frame. When we do that, again, we get the exact same output. In fact, we'll zoom out, you can see same thing three times. Now we can save the model into Fit2. That means my second model of data. We can show the model and again, you see the same stuff again, but we can now get the summary table for the regression model by using summary and when we do that you get some information about the residuals and then for each of the slopes, you get the estimate of the slope, the standard error, the t-value, and the provability value or the p-value for that t-test. You can see that within context of this single entry multiple regression, two of the variables, Instagram and Facebook, have statistically significant slopes. Now it doesn't mean that those would be the same if we removed the other variables. Again, when you do a regression like this these slopes are only valid in the context of all of the variables that you selected, but you can see also, down here that our adjusted R-Squared has gone all the way up to six, one, 81. We can account for nearly 62% of the variance in volunteering by using these other variables as predictors and this is the F-statistic, that does a significance test on the entire model and not surprisingly, it's really statistically significant. Now, you can also get confidence intervals for the coefficients. If you like those, as opposed to hypothesis tests, there they are and you can do the other measures. The prediction intervals, the predicted values, the regression diagnostic, using the same commands that I showed you earlier for bivariate regression, but all told, this let's you know that linear regression's really easy to set up in R and especially if you have your data properly organized and using the dplyr commands and the Tidyverse, make that part really easy and that facilitates the regression models and that, in turn, helps you get some very quick, meaningful insight out of your data.

Contents