It might happen that your dataset is not complete, and when information is not available we call it missing values. In R the missing values are coded by the symbol NA
. To identify missings in your dataset the function is is.na()
.
First lets create a small dataset:
Name <- c("John", "Tim", NA) Sex <- c("men", "men", "women") Age <- c(45, 53, NA) dt <- data.frame(Name, Sex, Age)
Here is our dataset called dt
:
dt Name Sex Age 1 John men 45 2 Tim men 53 3 <NA> women NA
Now will see for missings in the dataset:
is.na(dt) Name Sex Age FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
You also can find the sum and the percentage of missings in your dataset with the code below:
sum(is.na(dt)) mean(is.na(dt)) 2 0.2222222
When you import dataset from other statistical applications the missing values might be coded with a number, for example 99
. In order to let R know that is a missing value you need to recode it.
dt$Age[dt$Age == 99] <- NA
Another useful function in R to deal with missing values is na.omit()
which delete incomplete observations.
Let see another example, by creating first another small dataset:
Name <- c("John", "Tim", NA) Sex <- c("men", NA, "women") Age <- c(45, 53, NA) dt <- data.frame(Name, Sex, Age)
Here is the dataset, called again dt
:
dt Name Sex Age John men 45 Tim <NA> 53 <NA> women NA
Now will use the function to remove the missings
na.omit(dt) Name Sex Age John men 45
This was introduction for dealing with missings values. To learn how to impute missing data please read this post.