Imputation is a complex process that requires a good knowledge of your data. For example, it is crucial to know whether the missing is at random or not before you impute the data. I have read a nice tutorial which visualize the missing data and help to understand the type of missing, and another post showing how to impute the data with MICE
package.
In this short post, I will focus on management of the missing data using the tidyverse
package. Specifically, I will show how to manage missings in the long data format (i.e., more than one observation for id).
Load package and create a dataset
For this example, I will create a small data frame with 3 ids and 4 observations per id. The dataset has three variables including id, quarter, and spending. The spending variable has missings values. The first id has no missing, the second id has 1 missing, and the third id has 2 missings.
library(tidyverse)
dat = data.frame(
id = c(1,1,1,1, 2,2,2,2, 3,3,3,3),
quarter = rep(c("Jan", "Apr", "Jul", "Oct"), times=3),
spending = c(22,35,10,64, 55,23,NA,10, 42,NA,NA,18)
)
dat
## id quarter spending
## 1 1 Jan 22
## 2 1 Apr 35
## 3 1 Jul 10
## 4 1 Oct 64
## 5 2 Jan 55
## 6 2 Apr 23
## 7 2 Jul NA
## 8 2 Oct 10
## 9 3 Jan 42
## 10 3 Apr NA
## 11 3 Jul NA
## 12 3 Oct 18
Find the missings by group:
dat %>%
group_by(id) %>%
summarise(missing=sum(is.na(spending)))
## # A tibble: 3 x 2
## id missing
## <dbl> <int>
## 1 1 0
## 2 2 1
## 3 3 2
Management of missing values
The missing values are allocated on different IDs and observations. Therefore, first I will exclude all missings and keep only id/observation with valid data. Second, I will replace the missings by substituting with the average of the group ID, and third, replace missing by filling with valid data within the group id.
Removing all the missing values
To remove all the missing values will use the na.omit
function.
na.omit(dat)
## id quarter spending
## 1 1 Jan 22
## 2 1 Apr 35
## 3 1 Jul 10
## 4 1 Oct 64
## 5 2 Jan 55
## 6 2 Apr 23
## 8 2 Oct 10
## 9 3 Jan 42
## 12 3 Oct 18
Replace missing values with the average of the group
I think this is a most common method to replace missings values within the group. I would advise checking the distribution of data before deciding to replace missings with mean or median. In this example, I will use mean.
dat %>%
group_by(id) %>%
mutate(spending_mean = ifelse(is.na(spending), mean(spending, na.rm=T), spending))
## # A tibble: 12 x 4
## # Groups: id [3]
## id quarter spending spending_mean
## <dbl> <fct> <dbl> <dbl>
## 1 1 Jan 22 22
## 2 1 Apr 35 35
## 3 1 Jul 10 10
## 4 1 Oct 64 64
## 5 2 Jan 55 55
## 6 2 Apr 23 23
## 7 2 Jul NA 29.3
## 8 2 Oct 10 10
## 9 3 Jan 42 42
## 10 3 Apr NA 30
## 11 3 Jul NA 30
## 12 3 Oct 18 18
Filling the missings with other values within the group
This approach is useful when you need to replace the missing with the former or next value within the group. Filling of missing with other values can be in two directions, up and down. See the code below to distinguish between up and down.
dat_fill_down = dat %>%
group_by(id) %>%
fill(spending, .direction = c("down"))
dat_fill_up = dat %>%
group_by(id) %>%
fill(spending, .direction = c("up"))
Now, lets see the differences. When the direction is down the missing is replaced with the previous value within id. Direction up the replacement of missing will be with the next available value.
dat
## id quarter spending
## 1 1 Jan 22
## 2 1 Apr 35
## 3 1 Jul 10
## 4 1 Oct 64
## 5 2 Jan 55
## 6 2 Apr 23
## 7 2 Jul NA
## 8 2 Oct 10
## 9 3 Jan 42
## 10 3 Apr NA
## 11 3 Jul NA
## 12 3 Oct 18
dat_fill_down
## # A tibble: 12 x 3
## # Groups: id [3]
## id quarter spending
## <dbl> <fct> <dbl>
## 1 1 Jan 22
## 2 1 Apr 35
## 3 1 Jul 10
## 4 1 Oct 64
## 5 2 Jan 55
## 6 2 Apr 23
## 7 2 Jul 23
## 8 2 Oct 10
## 9 3 Jan 42
## 10 3 Apr 42
## 11 3 Jul 42
## 12 3 Oct 18
dat_fill_up
## # A tibble: 12 x 3
## # Groups: id [3]
## id quarter spending
## <dbl> <fct> <dbl>
## 1 1 Jan 22
## 2 1 Apr 35
## 3 1 Jul 10
## 4 1 Oct 64
## 5 2 Jan 55
## 6 2 Apr 23
## 7 2 Jul 10
## 8 2 Oct 10
## 9 3 Jan 42
## 10 3 Apr 18
## 11 3 Jul 18
## 12 3 Oct 18
In this short post I showed a few tips and tricks how to manage missing in the longitudinal data with tidyverse
package.