An Algebraic View of Multiple Regression Part 2

15 Sep 2018

Back to Business

In the previous tutorial, we covered how to approach regression from an algebraic viewpoint. However, we only discussed how to implement this framework working with numerical data without any interaction terms. Here, we will expand what was covered in the previous tutorial to include incoporating an interaction term and illustrating a few different ways for working with categorical data. Before starting, we are going to incorporate what we did last tutorial into a function called “my_lm” that will take in a design matrix and response variable. “my_lm” will output what we are used to seeing from base r, regression coefficients, standard errors, t and p values etc.

A Handy Function

my_lm<-function(x,y){
  
XtX<-t(x)%*%x

XtY<-t(x)%*%y  
  
betas<-solve(XtX)%*%XtY
  
hat_matrix<- x%*%solve(XtX)%*%t(x)

predicted<-hat_matrix%*%y  
  
residuals<-(y-predicted)  
  
var_est <- (1/(nrow(x)-ncol(x)))*t(residuals)%*%residuals

ses<-sqrt(diag(as.numeric(var_est)*solve(XtX)))  
  
tvals<-betas/ses

pvals<-sapply(tvals, function (p) 2*pt(abs(-p), nrow(x)-ncol(x), lower=FALSE))  

r2<-sum((predicted-mean(y))^2)/sum((y-mean(y))^2)

adjusted_r2<-1-(((1-r2)*(nrow(x)-1))/(nrow(x)-(ncol(x)-1)-1))

return(round(data.frame(b=betas,se=ses,t_value = tvals,p_value=pvals,r2=c(r2,rep(NA,ncol(x)-1)),adj_r2=c(adjusted_r2,rep(NA,ncol(x)-1))),6))

}

Incorporating an Interaction Term into the Design Matrix

Now, we have a function that will take in our response variable and a design matrix. In this post, we are going to create 3 different design matrices. Let’s start by setting up a design matrix for an interaction between two numeric predictor variables. To start, we will partition off our response variable (mpg), predictor variables (hp and disp), and create a variable for our intercept term.

y<-mtcars[,"mpg"]

x<-mtcars[,c("hp","disp")]

x0<-rep(1,nrow(x))

Now, we simply need to create our interaction term which we will do by multiplying our 2 predictor variables by each other and adding that term to our design matrix.

x<-as.matrix(cbind(x0,x,x[,1]*x[,2]))

Our design matrix is ready to go. It contains 4 columns: an intercept, disp, hp, and the product of hp * disp. We can fit a model with our lm function, “my_lm”, and compare the results to “lm” from base r.

my_lm(x,y)

##                         b       se   t_value  p_value       r2   adj_r2
## x0              39.674263 2.914172 13.614248 0.000000 0.819849 0.800548
## hp              -0.097894 0.024745 -3.956175 0.000473       NA       NA
## disp            -0.073373 0.014387 -5.100113 0.000021       NA       NA
## x[, 1] * x[, 2]  0.000290 0.000087  3.336151 0.002407       NA       NA

summary(lm(mpg ~ hp * disp, data = mtcars))

## 
## Call:
## lm(formula = mpg ~ hp * disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5153 -1.6315 -0.6346  0.9038  5.7030 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.967e+01  2.914e+00  13.614 7.18e-14 ***
## hp          -9.789e-02  2.474e-02  -3.956 0.000473 ***
## disp        -7.337e-02  1.439e-02  -5.100 2.11e-05 ***
## hp:disp      2.900e-04  8.694e-05   3.336 0.002407 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.692 on 28 degrees of freedom
## Multiple R-squared:  0.8198, Adjusted R-squared:  0.8005 
## F-statistic: 42.48 on 3 and 28 DF,  p-value: 1.499e-10

Categorical Data: Dummy Coding

Let’s get to work on categorical data. First, we will implement a common scheme for dealing with categorical data known as dummy coding where we will use 0s and 1s to convey all possible information about level membership for a factor. The first thing we need to do is get our categorical data in that form. Below, I’ve created a function that will do just that - read in a vector representing some categorical data and output it in dummy coded form. There is undoubtedly a neater way to accomplish this, but I think the funciton below highlights clearly what is going on.

Making Dummies

make_dummy<-function(vec){

unq<-as.character(unique(vec))
unq<-sort(unq)

dummy_list<-list()

for (i in unq){
  
dummy_list[[i]]<-as.numeric(vec==i)

out<-do.call("cbind",dummy_list)

}
  
 return(out) 
  
}

Now we can create an appropriate design matrix representing dummy coded data with our helper function. Our response variable will still be mpg and we will use 2 categorical variables (cyl and gear) as our predictor variables. We’ll start by grabbing those variables and running them through our “make_dummy” function to create a dummy coding scheme for those 2 predictors and take a peek to make sure things are running correctly.

y<-mtcars[,"mpg"]

cyl_temp<-make_dummy(mtcars[,"cyl"])

gear_temp<-make_dummy(mtcars[,"gear"])

head(gear_temp)

##      3 4 5
## [1,] 0 1 0
## [2,] 0 1 0
## [3,] 0 1 0
## [4,] 1 0 0
## [5,] 1 0 0
## [6,] 1 0 0

We can now finish our design matrix. Notice that we have made the first level of each group the reference level by dropping the first column. This will immitate the default behavior of r with respect to dummy coding.

x<-as.matrix(cbind(cyl_temp[,-1],gear_temp[,-1]))
head(x)

##      6 8 4 5
## [1,] 1 0 1 0
## [2,] 1 0 1 0
## [3,] 0 0 1 0
## [4,] 1 0 0 0
## [5,] 0 1 0 0
## [6,] 1 0 0 0

Dummy Results

An intercept column will complete our design matrix. We can fit a model and compare our output to that of base r.

x0<-rep(1,nrow(x))

x<-cbind(x0,x)

my_lm(x,y)

##             b       se   t_value  p_value       r2   adj_r2
## x0  25.427899 1.881164 13.517111 0.000000 0.739788 0.701238
## 6   -6.655978 1.629136 -4.085589 0.000353       NA       NA
## 8  -10.542210 1.957977 -5.384235 0.000011       NA       NA
## 4    1.324094 1.927619  0.686907 0.498000       NA       NA
## 5    1.500181 1.854852  0.808787 0.425707       NA       NA

summary(lm(mpg ~ as.factor(cyl) + as.factor(gear), data = mtcars))

## 
## Call:
## lm(formula = mpg ~ as.factor(cyl) + as.factor(gear), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3520 -1.7633 -0.3789  1.7393  7.1480 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        25.428      1.881  13.517 1.55e-13 ***
## as.factor(cyl)6    -6.656      1.629  -4.086 0.000353 ***
## as.factor(cyl)8   -10.542      1.958  -5.384 1.09e-05 ***
## as.factor(gear)4    1.324      1.928   0.687 0.498000    
## as.factor(gear)5    1.500      1.855   0.809 0.425707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.294 on 27 degrees of freedom
## Multiple R-squared:  0.7398, Adjusted R-squared:  0.7012 
## F-statistic: 19.19 on 4 and 27 DF,  p-value: 1.405e-07

Categorical Data: User Defined Contrasts

We’ve seen one way (out of many see: simple coding, deviation coding, forward difference coding, helmert coding, etc ) to accomodate caterogical data and incorporate it into a design matrix. Here, we will take a look at an additional scheme for handling categorical data, user-defined custom contrasts. This will allow us some additional control over the manipulation of our factor levels.

Below, we will be using the “warpbreaks” data because yarn is fascinating, and I’m getting tired of “mtcars”. We’ll be looking at the relationship between breaks (y) and tension (x). We’ll start by creating a contrast matrix depicting the differences between levels in tension we want to investigate. Tension has 3 levels which means we will be able to create 2 (k-1, where k = levels of a factor) orthogonal contrasts.

contrast 1: will compare “l” to the average of “m” and “h”
contrast 2: we will compare “m” to “h”

To ensure that these contasts are orthogonal we will multiply the contrasts in a pairwise fashion and ensure that the sum of the resulting products are equal to 0.

data("warpbreaks")

lVSmh<-c(1, -.5,-.5)
mVsh<-c(0,1,-1)

sum(lVSmh*mVsh)

## [1] 0

Don’t forget to Invert!

At this point, we have an initial contrast matrix, but before moving on we need to take the inverse of it to ensure out beta coefficients accurately represent the differences in estimates we have supplied. We can do this in base r, assuming we have a square matrix, as shown below. The ones are a constant representing the intercept term.

mat<-solve(rbind(c(1,1,1),lVSmh,mVsh))

With a final contrast matrix, we can move on and incorporate that scheme into our design matrix. Instead of using the “make_dummy” function, we will use the “model matrix” command to apply our inverted contrasts to our design matrix. Finally, we can use our “my_lm” function and compare it to “lm”.

y<-warpbreaks[,"breaks"]

x<-warpbreaks[,"tension"]

x0<-rep(1,length(x))

x<-apply(model.matrix(~x+0),2,function(x) as.numeric(x))
colnames(x)<-sub("x","",colnames(x))

x<-x%*%mat[,-1]

x<-as.matrix(cbind(x0,x))

head(x)

##      x0     lVSmh mVsh
## [1,]  1 0.6666667    0
## [2,]  1 0.6666667    0
## [3,]  1 0.6666667    0
## [4,]  1 0.6666667    0
## [5,]  1 0.6666667    0
## [6,]  1 0.6666667    0

my_lm(x,y)

##               b       se   t_value  p_value       r2   adj_r2
## x0    28.148148 1.616742 17.410415 0.000000 0.220329 0.189754
## lVSmh 12.361111 3.429628  3.604214 0.000711       NA       NA
## mVsh   4.722222 3.960193  1.192422 0.238614       NA       NA

contrasts(warpbreaks$tension)<-mat[,-1]

summary(lm(breaks ~ tension, data = warpbreaks))

## 
## Call:
## lm(formula = breaks ~ tension, data = warpbreaks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.389  -8.139  -2.667   6.333  33.611 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    28.148      1.617  17.410  < 2e-16 ***
## tensionlVSmh   12.361      3.430   3.604 0.000711 ***
## tensionmVsh     4.722      3.960   1.192 0.238614    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.88 on 51 degrees of freedom
## Multiple R-squared:  0.2203, Adjusted R-squared:  0.1898 
## F-statistic: 7.206 on 2 and 51 DF,  p-value: 0.001753

A Few Quick Checks

In addition, we can also use the “model.matrix” command to compare the design matrix from r to our own.

model.matrix(lm(breaks ~ tension, data = warpbreaks))[1:10,]

##    (Intercept) tensionlVSmh tensionmVsh
## 1            1    0.6666667         0.0
## 2            1    0.6666667         0.0
## 3            1    0.6666667         0.0
## 4            1    0.6666667         0.0
## 5            1    0.6666667         0.0
## 6            1    0.6666667         0.0
## 7            1    0.6666667         0.0
## 8            1    0.6666667         0.0
## 9            1    0.6666667         0.0
## 10           1   -0.3333333         0.5

x[1:10,]

##       x0      lVSmh mVsh
##  [1,]  1  0.6666667  0.0
##  [2,]  1  0.6666667  0.0
##  [3,]  1  0.6666667  0.0
##  [4,]  1  0.6666667  0.0
##  [5,]  1  0.6666667  0.0
##  [6,]  1  0.6666667  0.0
##  [7,]  1  0.6666667  0.0
##  [8,]  1  0.6666667  0.0
##  [9,]  1  0.6666667  0.0
## [10,]  1 -0.3333333  0.5

Finally, we can double check to make sure the our beta coefficients are representing what they should be.

cat("mean(l)-(mean(m)+mean(h))/2:",aggregate(breaks ~ tension, data = warpbreaks, mean)[1,2]-((aggregate(breaks ~ tension, data = warpbreaks, mean)[2,2]+aggregate(breaks ~ tension, data = warpbreaks, mean)[3,2])/2),"result from my_lm:",my_lm(x,y)[2,1])

## mean(l)-(mean(m)+mean(h))/2: 12.36111 result from my_lm: 12.36111

cat("mean(m)-mean(h):",aggregate(breaks ~ tension, data = warpbreaks, mean)[2,2] - aggregate(breaks ~ tension, data = warpbreaks, mean)[3,2],"result from my_lm:",my_lm(x,y)[3,1])

## mean(m)-mean(h): 4.722222 result from my_lm: 4.722222

Hopefully, you now have a better intuition of how categorical data is incorporated into a regression model and a better understanding of the algebra underlying regression!