A friend asked me whether I can create a loop which will run multiple regression models. She wanted to evaluate the association between 100 dependent variables (outcome) and 100 independent variable (exposure), which means 10,000 regression models. Regression models with multiple dependent (outcome) and independent (exposure) variables are common in genetics.
So models will be something like this: (dx is dependent and ix is independent variable, v are other variables)
dx1 = ix1 + v1 + v2 + v3 dx1 = ix2 + v1 + v2 + v3 dx1 = ix3 + v1 + v2 + v3 ... dx1 = ix100 + v1 + v2 + v3 dx2 = ix1 + v1 + v2 + v3 ... dx100 = ix100 + v1 + v2 + v3
The output should be a data frame with 5 columns, including dependent variable, independent variable, beta estimate, standard error and the p-value.
Something like this (those numbers are just for illustration purposes):
d i beta se pvalue 1 dx1 ix1 0.1 0.002 0.950 2 dx2 ix2 0.2 0.002 0.826 3 dx3 ix3 0.3 0.005 0.123
OK, now lets begin: the dataset that I received had all the variables in columns and observations in rows (the data is not real, just random numbers for illustration purposes):
id dx1 dx2 ... dx100 ix1 ... 1x100 v1 v2 v3 10 324 124 ... 214 32 ... 32 ax b4 c3 11 431 982 ... 114 12 ... 77 ce b2 c5 12 545 123 ... 104 34 ... 11 ar c2 a5 ....
Position of variables
Create vectors for the position of the dependent and independent variables in your dataset.
# outcome out_start=2 out_end= 101 out_nvar=out_end-out_start+1 out_variable=rep(NA, out_nvar) out_beta=rep(NA, out_nvar) out_se = rep(NA, out_nvar) out_pvalue=rep(NA, out_nvar) # exposure exp_start=102 exp_end=203 exp_nvar=exp_end-exp_start+1 exp_variable=rep(NA, exp_nvar) exp_beta=rep(NA, exp_nvar) exp_se = rep(NA, out_nvar) exp_pvalue=rep(NA, exp_nvar) number=1
For Loop
I used linear mixed effect model and therefore I loaded the lme4
library. The loop should work with other regression analysis (i.e. linear regression) if you modify it according to your regression model. If you don’t know which part to modify, leave a comment below and I will try to help.
As other loops, this call variable of interest one by one and for each of them extract and store the betas, standard error, and p-value. Remember, this code is specific for linear mixed effect models.
library(lme4) for (i in out_start:out_end){ outcome = colnames(dat)[i] for (j in exp_start:exp_end){ exposure = colnames(dat)[j] model <- lmer(get(outcome) ~ get(exposure) + v1 + (1|v2) + (1|v3), na.action = na.exclude, data=dat) Vcov <- vcov(model, useScale = FALSE) beta <- fixef(model) se <- sqrt(diag(Vcov)) zval <- beta / se pval <- 2 * pnorm(abs(zval), lower.tail = FALSE) out_beta[number] = as.numeric(beta[2]) out_se[number] = as.numeric(se[2]) out_pvalue[number] = as.numeric(pval[2]) out_variable[number] = outcome number = number + 1 exp_beta[number] = as.numeric(beta[2]) exp_se[number] = as.numeric(se[2]) exp_pvalue[number] = as.numeric(pval[2]) exp_variable[number] = exposure number = number + 1 } }
Create a dataframe with results:
outcome = data.frame(out_variable, out_beta, out_se, out_pvalue) exposure = data.frame(exp_variable, exp_beta, exp_se, exp_pvalue)
Management of the dataframe
We have 2 different data frames with our results and we need to combine in one. With the help of tidyverse
package this is a simple task. Basically, we rename variables by giving the same name and after we merge both data frames together.
library(tidyverse) outcome = outcome %>% rename( variable = out_variable, beta = out_beta, se = out_se, pvalue = out_pvalue, obs = out_nobs ) exposure = exposure %>% rename( variable = exp_variable, beta = exp_beta, se = exp_se, pvalue = exp_pvalue, obs = exp_nobs ) all = rbind(outcome, exposure) all = na.omit(all) head(all) variable beta se pvalue 1 dx1 0.1 0.002 0.950 3 dx2 0.2 0.002 0.826 ........ 2 ix1 0.1 0.002 0.950 4 ix2 0.2 0.002 0.826 ........
Yet, this is not a data frame that we are looking for. We need a data frame to have both dependent and independent variables in one row. Therefore, we do the final transformation as follows:
data = all %>% mutate( type = substr(variable, 1, 2) ) %>% spread(type, variable) %>% rename( d = dx, i = ix ) %>% mutate ( beta = round(beta, 5), se = round(se, 5), pvalue = round(pvalue, 5) ) %>% select(d, i, beta, se, pvalue) head(data) d i beta se pvalue 1 dx1 ix1 0.1 0.002 0.950 2 dx2 ix2 0.2 0.002 0.826 3 dx3 ix3 0.3 0.005 0.123
I hope you find this post useful for your research and data analysis!