One of the very first learning algorithms that you’ll encounter when studying data science and machine learning is least squares linear regression. Linear regression is one of the easiest learning algorithms to understand; it’s suitable for a wide array of problems, and is already implemented in many programming languages. Most users are familiar with the lm()
function in R, which allows us to perform linear regression quickly and easily. But one drawback to the lm()
function is that it takes care of the computations to obtain parameter estimates (and many diagnostic statistics, as well) on its own, leaving the user out of the equation. So, how can we obtain these parameter estimates from scratch?
In this post, I will outline the process from first principles in R. I will use only matrices, vectors, and matrix operations to obtain parameter estimates using the closed-form linear algebraic solution. After reading this post, you’ll see that it’s actually quite simple, and you’ll be able to replicate the process with your own data sets (though using the lm()
function is of course much more efficient and robust).
For this example, I’ll use the well-known “Boston” data set from the MASS library. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s. Our target variable will be the median home value for each tract — the column titled ‘medv.’ Let’s get an idea of what this data set looks like:
library(MASS) str(Boston) ## 'data.frame': 506 obs. of 14 variables: ## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... ## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... ## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... ## $ chas : int 0 0 0 0 0 0 0 0 0 0 ... ## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... ## $ rm : num 6.58 6.42 7.18 7 7.15 ... ## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... ## $ dis : num 4.09 4.97 4.97 6.06 6.06 ... ## $ rad : int 1 2 2 3 3 3 5 5 5 5 ... ## $ tax : num 296 242 242 222 222 222 311 311 311 311 ... ## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... ## $ black : num 397 397 393 395 397 ... ## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... ## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
To learn more about the definition of each variable, type help(Boston)
into your R console.
Now we’re ready to start. Linear regression typically takes the form
$$y=\beta X + \epsilon$$where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \(\epsilon\) is the error term; it represents features that affect the response, but are not explicitly included in our model.
The first thing we will need is a vector of our response variable, typically called ‘y’. In this case, our response variable is the median home value in thousands of US dollars: ‘medv’:
y <- Boston$medv
Next, we’ll need our ‘X’ or ‘design’ matrix of the input features. We can do this with the as.matrix()
function, grabbing all the variables except ‘medv’ from the Boston data frame. We’ll also need to add a column of ones to our X matrix for the intercept term.
# Matrix of feature variables from Boston X <- as.matrix(Boston[-ncol(Boston)]) # vector of ones with same length as rows in Boston int <- rep(1, length(y)) # Add intercept column to X X <- cbind(int, X)
Now that we have our response vector and our ‘X’ matrix, we can use them to solve for the parameters using the following closed form solution:
$$\beta = (X^TX)^{-1}X^Ty$$The derivation of this equation is beyond the scope of this post. If you are interested in the derivation, a good place to start is Wikipedia or any good undergraduate textbook on linear regression.
We can implement this in R using our ‘X’ matrix and ‘y’ vector. The %*%
operator is simply matrix multiplication. The t()
function takes the transpose of a matrix, and solve()
calculates the inverse of any (invertible) matrix.
# Implement closed-form solution betas <- solve(t(X) %*% X) %*% t(X) %*% y # Round for easier viewing betas <- round(betas, 2)
Now we have our parameter vector β. These are the linear relationships between the response variable ‘medv’ and each of the features. For example, we see that an increase of one unit in the number of rooms (rm) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘chas’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘chas’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.
print(betas) ## [,1] ## int 36.46 ## crim -0.11 ## zn 0.05 ## indus 0.02 ## chas 2.69 ## nox -17.77 ## rm 3.81 ## age 0.00 ## dis -1.48 ## rad 0.31 ## tax -0.01 ## ptratio -0.95 ## black 0.01 ## lstat -0.52
Now let’s verify that these results are in fact correct. We want to compare our results to those produced by the lm()
function. Most of you will already know how to do this.
# Linear regression model lm.mod <- lm(medv ~ ., data=Boston) # Round for easier viewing lm.betas <- round(lm.mod$coefficients, 2) # Create data.frame of results results <- data.frame(our.results=betas, lm.results=lm.betas) print(results) ## our.results lm.results ## int 36.46 36.46 ## crim -0.11 -0.11 ## zn 0.05 0.05 ## indus 0.02 0.02 ## chas 2.69 2.69 ## nox -17.77 -17.77 ## rm 3.81 3.81 ## age 0.00 0.00 ## dis -1.48 -1.48 ## rad 0.31 0.31 ## tax -0.01 -0.01 ## ptratio -0.95 -0.95 ## black 0.01 0.01 ## lstat -0.52 -0.52
The parameter estimates are identical! Now you know how to obtain parameter estimates on your own. Keep in mind that you’re unlikely to favor implementing linear regression in this way over using lm()
. The lm()
function is very quick, and requires very little code. Using it provides us with a number of diagnostic statistics, including \(R^2\), t-statistics, and the oft-maligned p-values, among others. It also solves for the parameters using QR decomposition, which is more robust than the method I’ve presented here. Nevertheless, I wanted to show one way in which it can be done.
There are other ways to solve for the parameters in linear regression. For example, gradient descent can be used to obtain parameter estimates when the number of features is extremely large, a situation that can drastically slow solution time when using the closed-form method. Stochastic gradient descent is often used when both the number of features and the number of samples are very large. I hope to detail these in a future post.
I have made the code from this post available at my Github here.
Cheers!