From the course: Tableau and R for Analytics Projects

Analyze regression variables for significance in R

From the course: Tableau and R for Analytics Projects

Start my 1-month free trial

Analyze regression variables for significance in R

- [Instructor] When you perform linear regression in R, the routine you call creates an equation that best fits your data using all of the variables you provide. Not all of the variables will be statistically significant though, so it makes sense to remove the ones that already identify as insignificant in your data. In this movie, I will show you how to analyze your regression variables and identify which ones are significant. I've continued to work that I did in the previous movie. What I did was import data into a vector called rdata from a file in the Chapter03 folder of the exercise files collection called SalesData.csv and I did that using the command that you see in the first red line on your screen. I then created two linear regression models, assigned two variables, model1 and model2, and those variables are still active. I also displayed the coefficients for each of these two models, but by themselves, they don't tell you a lot. So in this movie, I will show you how to get a more complete summary of your data and the models and how to interpret it. The command that you use to get more information about a linear regression model is summary. So I'll type summary and then in parentheses, model1. And a lowercase L and a one look very similar but that is model1. Press Enter and let's take a look at the summary we got. So you can see here that we have information about the minimum, the median, and so on, and that is the output of the model. It's not actually minimums or maximums within the dataset. And then below that, we have significance. The intercept will always be significant so we can ignore that. And if we look at the second variable, distance, we see that it has no stars next to it. If we look at the coefficients, we can see how significant they are. The intercept is always significant, so you can ignore that. It's just a built in part of the model. If we look at distance, we see that there are no asterisks or stars next to it. So that means, in this case, that there is no significance to that variable. The significance codes can be interpreted as three stars being very significant, two stars, significant, one star, also significant but not quite as much. If you have a period, it indicates that the variable is on the border of being significant, and if it's blank, then the variable is not considered to be significant. Below that, we have the Multiple R-squared and the Adjusted R-squared. Multiple R-squared is 0.01269 and you interpret that to mean that the model we created will predict about 1.269% of sales based just on distance. So practically none. And in fact, if you look at the Adjusted R-squared, it's negative, which tells you that this model is pretty much worthless. Now let's take a look at model2, which used both distance and order count. So we'll use two variables instead of just one. So I'll type summary, model2, that's in parentheses, and Enter, and we get a much different result. The intercept is still significant, of course. We see that distance is mildly significant and we also see that order count is very significant. So you can see that there are three asterisks and the probability that this value occurred by chance is very small. So it's 8.91, preceded by five zeroes. That's what the E minus 05 means. So we see that the order count variable is very significant and distance combined with it is significant as well. If we looked down at Multiple R-squared, we see that that value is 0.6097. If we look at Adjusted R-squared, which is more relevant when you have multiple variables, then that is 0.5638. So about 56.38% of the variation in sales is explained based on the inputs to this model by distance and order count. But as we can see, order count provides the majority of the predictive power within this model.

Contents