15 Deeper Linear Regression
Let’s chat about why understaning linear regression is so important.
While there may always seem to be something new, cool, and shiny in the field of AI/ML, classic statistical methods that leverage machine learning techniques remain powerful and practical for solving many real-world business problems.
Let’s look at a very simple model first. For this example, we will need to import the Introduction to Statistical Learning package (ISLR). We will use the “credit” data set that is part of the ISLR package.
lm is the function we use to create linear regression models. Now, before we discuss interpreting the results we get from this function, we will discuss the different parts of the model. The “~” symbol is the key to this entire equation. We are telling R to predict whatever is on the left side of the tilde using the variables on the right.
15.1 Interpretation of the Model
Let’s run a summary on this model and see what we get.
summary(M1)
#>
#> Call:
#> lm(formula = Balance ~ Limit + Ethnicity)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -677.39 -145.75 -8.75 139.56 776.46
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3.078e+02 3.417e+01 -9.007 <2e-16
#> Limit 1.718e-01 5.079e-03 33.831 <2e-16
#> EthnicityAsian 2.835e+01 3.304e+01 0.858 0.391
#> EthnicityCaucasian 1.381e+01 2.878e+01 0.480 0.632
#>
#> (Intercept) ***
#> Limit ***
#> EthnicityAsian
#> EthnicityCaucasian
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 234 on 396 degrees of freedom
#> Multiple R-squared: 0.743, Adjusted R-squared: 0.7411
#> F-statistic: 381.6 on 3 and 396 DF, p-value: < 2.2e-16
There is a lot of statistical jargon included in our summary that may be unfamiliar to those who have not taken statistics before. That is okay, however, because we are going to breakdown the main statistics we are interested in. Let’s start with our variables and their significance in the model.
15.1.1 P-Values
The p-value of our model helps us either prove or disprove the null-hypothesis of our test. In the case of this class, the null-hypothesis is that there is no relationship between the variables we are using to make the predictions and the actual variable we are predicting. In other words, the smaller our p-value the higher the level of significance there is between our variables. When we run a summary of our linear regression model we are give multiple p-values.
First, under coefficients, they are listed for each variable. This can help us optimize our model because we can see what variables are helping make the model more accurate versus those that may be hindering its performance. Also notice the asterisks next to our p-values. R kindly puts up to three stars next to each variable to help us visually tell if they are significant, essentially more stars means a lower p-value and thus a higher correlation. The second place we see a p-value is at the bottom of our summary. This p-value will give us the overall correlation that exists in our model. As we see in this case, our p-values for this model is < .00000000000000022, that is a tiny number and frankly a great p-value. Typically we want our p-value to be .05 or smaller. A p-value of .05 tells us that we have a confidence level of 95%.
15.1.2 Multiple R-Squared
R-squared tells us how well our model explains the variance in our variable. In other words, is the reason for the change in the independent variable actually due to our model’s prediction? The higher the r-squared, the more accurate our model is because the better the data fits it. The maximum value r-squared can be is 1.
In our model’s case, we have a multiple r-squared of .743, this means our model is approximately 74.3% accurate as this is the amount of variance in the data caused by our dependent variable. Our r-squared could certainly be better. In fact, in the real world you typically are aiming for an r-squared above .9 or .95, which means you would have 90%-95% accuracy.
15.2 Applying the Model to Make Predictions
This type of regression is refereed to as linear for a reason. If we were to visualize our model on a quadratic plane, we would see a line of best fit that would travel along through our data. This means we can simplify the model to fit the slope-intercept equation:
y = m(x)+b
In our case the slope of our line is related to the independent variables. The sum of these slopes will give us the overall slope of our line and the intercept is provided by the equation summary. If we modify this equation to be more applicable to our situation we would get something like this:
y = m1x1 + m2x2 … + b
Let’s look back at our example model from before
M1 <- lm(Balance ~ Limit + Ethnicity)
summary(M1)
#>
#> Call:
#> lm(formula = Balance ~ Limit + Ethnicity)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -677.39 -145.75 -8.75 139.56 776.46
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3.078e+02 3.417e+01 -9.007 <2e-16
#> Limit 1.718e-01 5.079e-03 33.831 <2e-16
#> EthnicityAsian 2.835e+01 3.304e+01 0.858 0.391
#> EthnicityCaucasian 1.381e+01 2.878e+01 0.480 0.632
#>
#> (Intercept) ***
#> Limit ***
#> EthnicityAsian
#> EthnicityCaucasian
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 234 on 396 degrees of freedom
#> Multiple R-squared: 0.743, Adjusted R-squared: 0.7411
#> F-statistic: 381.6 on 3 and 396 DF, p-value: < 2.2e-16
We see that our limit variable has an estimate of 1.718e-01, this is our slope. When dealing with quantitative variables, we simply multiply our slope by the intended independent variable. So, if we wanted to find the balance of someone with a limit of 400, we would multiply 1.718e-01 by 400.
With the qualitative variables, in this case ethnicity, we multiply the estimate of the TRUE values by 1 and FALSE values by 0, thus cancelling the FALSE values out.
Let’s look at an example. If we used our above equation to predict the balance of someone who was Caucasian and has a credit limit of 500, here is the equation we would set up:
y = (1.718e-1*500) + (1.381e+01*1) + (2.835e+01*0) + (-3.078e+02)
y
#> [1] -208.09
So, according to our model our customer would have a balance of -208.09. This number may seem funny, but keep in mind that our r-squared was not the best for this model making it inaccurate and the ethnicity of the customer was not highly correlated with the balance. Both of these facts may cause our prediction to be off. If we were actually creating a model that could predict balance, we would want to look at some of the more correlated variables in the data set.
15.3 Review Questions
Create a linear model predicting using the ISLR data set that predicts a customer’s credit limit based on their age, current balance, and the number of cards they have.
What is the p-value of this model? What does this tell us?
List the variables in order from most correlated to least. How do you know that they are correlated?
What is the multiple r-squared of the model? What does this tell us? Is this good or bad?
What would be the credit limit of a 29 year old with 5 cards and a total balance of 1500?
Explain what the following piece of code does