In this post, I will show how to run a linear regression analysis for multiple independent or dependent variables. You should not be confused with the multivariable-adjusted model. This tutorial is not about multivariable models.
This post is to show how to do a regression analysis automatically when you want to investigate more than one independent or dependent at the time, and to avoid repetition of many separate models as it is shown below:
# lm(independent ~ dependent1 + var1 + var2, data=data)
# lm(independent ~ dependent2 + var1 + var2, data=data)
# lm(independent ~ dependent3 + var1 + var2, data=data)
Such investigations are favorite in genetics, but also it comes handy when I have several variables to investigate. I have seen previously similar posts which build loops to run regression analysis for multiple variables. Loops are challenging to create and edit for analysis.
Below I will show how you can quickly achieve the same results by using the broom
and other packages.
Load the libraries
library(reshape)
library(tidyverse)
library(broom)
Create the dataset
This dataset have 7 variables including id, age, sex, bmi, sbp, dsp, and gluc.
set.seed(3319)
dt <- data.frame(
id = 1:100,
age = rnorm(100, mean = 65, sd = 8),
sex = sample(c("Women", "Men"), 100, replace=TRUE, prob=c(0.4, 0.6)),
bmi = rnorm(100, mean=27, sd = 4),
sbp =rnorm(100, mean=145, sd = 30),
dbp = rnorm(100, mean=80, sd=20),
gluc = rnorm(100, mean=140, sd=50)
)
dt %>%
slice(1:5)
## id age sex bmi sbp dbp gluc
## 1 1 62.93794 Women 29.17697 114.7606 80.10420 82.78492
## 2 2 77.51981 Women 32.85751 162.5616 84.43917 109.96731
## 3 3 63.61055 Men 38.33631 181.8345 67.32783 184.93135
## 4 4 54.69469 Women 27.92725 115.0467 97.35000 208.07860
## 5 5 63.26105 Women 21.67979 156.6037 69.39819 149.68155
Let's suppose I am interested in the association between bmi (dependent) with gluc, sbp, and dsp. Normally, I have to create 3 separate models for each independent variable in this analysis. See example below:
# lm(bmi ~ gluc + age + sex, data=dt)
# lm(bmi ~ sbp + age + sex, data=dt)
# lm(bmi ~ dsp + age + sex, data=dt)
While 3 models are durable, 100 models will be extremely time-consuming. So it would be necessary to make it automatically. The first step is to transform the dataset from wide to long. The variable which I am interested to evaluate will be transformed in a long format. In this example are gluc, sbp, and dsp, so I keep the rest at id.vars
.
The code below transform the dataset from wide to long using the function melt
.
long = melt(dt, id.vars = c("id", "bmi", "age", "sex"))
long %>%
slice(1:5)
## id bmi age sex variable value
## 1 1 29.17697 62.93794 Women sbp 114.7606
## 2 2 32.85751 77.51981 Women sbp 162.5616
## 3 3 38.33631 63.61055 Men sbp 181.8345
## 4 4 27.92725 54.69469 Women sbp 115.0467
## 5 5 21.67979 63.26105 Women sbp 156.6037
Two new columns are created, one column shows the variable name and the other column the value which presents the value for the specific variable. If transforming from wide to long is not clear go back to dt
dataset and compare the values with long
.
Linear regression
Now that the dataset is ready I will run a linear regression by the group. The group_by
variable includes gluc, sbp, and dsp. The value that you see in the model has all the values for the variables in the group. In addition, I do some other coding such as filter by term
which select only our variables of interest and recode the estimate, std.error, and p.value.
long %>%
group_by(variable) %>%
do(tidy(lm(bmi ~ value + age + sex, .))) %>%
filter(term == "value") %>%
mutate(Beta = as.character(round(estimate, 3)), "P Value" = round(p.value, 3), SE = round(std.error, 3)) %>%
select(Beta, SE, "P Value") %>%
as.data.frame()
## variable Beta SE P Value
## 1 sbp -0.001 0.013 0.965
## 2 dbp -0.02 0.020 0.313
## 3 gluc -0.001 0.007 0.883
Having the results of regression analysis in this form allows me to transport to a table or to a PDF report.
If you have questions or comments about this post please let me know.