When I wanted to calculate the correlation coefficients for 25 variables it became tricky. As I aimed to export results in a table, the function cor
was not helpful.
Moreover, I was interested in filtering the results by the p-value and export only the significant variables. Ok, after some search and help around I was able to accomplish this task and I am happy to share the code with you.
Load libraries
library(tidyverse)
library(broom)
Create the dataset
dt <- data.frame(a = rnorm(10) , b = rnorm(10), c = rnorm(10), d = rnorm(10))
dt
## a b c d
## 1 0.1296433 -0.04097794 1.05208987 1.1829971
## 2 1.1535071 -0.20391734 -0.41343509 0.3806597
## 3 1.0543683 0.76701232 -1.85961221 0.4893538
## 4 -0.2447034 -1.12584839 0.02915253 0.2820233
## 5 0.1323569 -0.74283490 0.12871463 -2.0830425
## 6 -1.7342506 -1.21694577 0.36379475 1.3919810
## 7 -0.3090799 0.71859777 -0.83134000 -0.5394136
## 8 0.9378126 -0.32229890 0.25171600 1.7425813
## 9 -0.6672502 -0.29552463 -0.12799691 1.2762921
## 10 0.3649565 -0.87161556 -0.82802152 0.1716353
Assess the correlation of all variables in the dataset
cor(dt)
## a b c d
## a 1.0000000 0.45227252 -0.3843381 -0.11109833
## b 0.4522725 1.00000000 -0.5235755 -0.03721051
## c -0.3843381 -0.52357546 1.0000000 0.24206394
## d -0.1110983 -0.03721051 0.2420639 1.00000000
Up to now everything is simple and nothing new, but as I mentioned above, that my interest was to export the results and to filter by p-value, the result of cor()
is useless. In this case, some more coding is, and the map()
functions from the broom package will be useful.
First, I will create a dataset with two columns which will show all variable names and every possible combination. In our example will be 6 rows.
dt1 = t(combn(names(dt), 2)) %>%
as_data_frame() %>%
setNames(c("x", "y"))
dt1
## # A tibble: 6 x 2
## x y
## <chr> <chr>
## 1 a b
## 2 a c
## 3 a d
## 4 b c
## 5 b d
## 6 c d
Now that we have the variable names and the dataset with the values will use the map2()
to call the variables and apply cor.test
function.
cor_result = dt1 %>%
mutate(results = map2(x, y, ~ cor.test(dt[[.x]], dt[[.y]], method = "pearson")),
results = map(results, tidy)) %>%
unnest(results)
cor_result
## # A tibble: 6 x 10
## x y estimate statistic p.value parameter conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 a b 0.452 1.43 0.189 8 -0.248 0.842
## 2 a c -0.384 -1.18 0.273 8 -0.816 0.324
## 3 a d -0.111 -0.316 0.760 8 -0.692 0.558
## 4 b c -0.524 -1.74 0.120 8 -0.867 0.158
## 5 b d -0.0372 -0.105 0.919 8 -0.652 0.607
## 6 c d 0.242 0.706 0.500 8 -0.457 0.756
## # ... with 2 more variables: method <chr>, alternative <chr>
To select and filter the results by the p value.
cor_result %>%
select(x, y, estimate, p.value) %>%
filter(p.value < 0.5)
## # A tibble: 3 x 4
## x y estimate p.value
## <chr> <chr> <dbl> <dbl>
## 1 a b 0.452 0.189
## 2 a c -0.384 0.273
## 3 b c -0.524 0.120
With filtering, I end this post and hope you find it useful. If you have a better idea to accomplish the same results, please share with me.